2 Commits

Author SHA1 Message Date
ION606 5581493cc0 what 2025-12-08 13:51:18 -05:00
ION606 4f6434ff72 added assignment VI 2025-12-05 19:59:00 -05:00
13 changed files with 607112 additions and 0 deletions
@@ -0,0 +1,410 @@
# News Popularity in Multiple Social Media Platforms
This project analyzes the **News Popularity in Multiple Social Media Platforms** dataset from the UCI Machine Learning Repository. The data contains ~93k news items collected between November 2015 and July 2016, with their final popularity on Facebook, Google+ and LinkedIn across four topics: *economy*, *microsoft*, *obama* and *palestine*.
---
## 1. Exploratory Data Analysis
### 1.1 Data overview and cleaning
We work primarily with `Data/News_Final.csv`, which has **93,239** rows and 11 variables:
- `IDLink` numeric id of the article
- `Title`, `Headline` short text fields
- `Source` news outlet that originally published the story
- `Topic` one of {economy, microsoft, obama, palestine}
- `PublishDate` publication timestamp
- `SentimentTitle`, `SentimentHeadline` numeric sentiment scores derived from title and headline text
- `Facebook`, `GooglePlus`, `LinkedIn` final popularity on each social media platform
According to the dataset documentation, **-1** in the popularity variables indicates that no final popularity value was observed. In the code, any value `< 0` in `Facebook`, `GooglePlus`, or `LinkedIn` is therefore replaced with `NaN`. Missing popularity values are later dropped on a permodel basis. `PublishDate` is converted to a proper timestamp, and a numeric time feature
```text
DaysSinceEpoch = days since 1970-01-01
````
is created to allow inclusion of temporal trends in the models. We also logtransform Facebook popularity:
```text
log_Facebook = log1p(Facebook)
```
which is used as the target for regression models.
---
### 1.2 Popularity distributions
A histogram of Facebook share counts on a **logarithmic xaxis**, after removing missing and zero values
![distribution of facebook popularity](imgs/eda_facebook_hist.png)
*Figure 1: Distribution of Facebook popularity on a log xaxis.*
The distribution is extremely rightskewed:
- Most articles receive very few shares.
- A small number of “viral” articles receive thousands of shares.
On the cleaned data, summary statistics for Facebook shares are approximately:
- median is approx 8
- mean is approx 129
- 90th percentile is approx 214
- 99th percentile is approx 2,322
- max = 49,211
Google+ and LinkedIn exhibit similar heavytailed patterns (with smaller absolute scales), which matches the description of the dataset creators ([arXiv][1]).
The distribution of `log1p(Facebook)`
![distribution of log-transformed facebook popularity](imgs/eda_log_facebook_hist.png)
*Figure 2: Distribution of logtransformed Facebook popularity.*
The log transform compresses the heavy tail and produces a more regular, unimodal distribution. This justifies using `log1p(popularity)` as the regression target: it reduces the influence of rare extreme outliers while keeping them in the data, which is important because viral stories are the phenomena of interest.
---
### 1.3 Topic effects
The four topics are not equally represented:
- economy: 33,928 items
- obama: 28,610
- microsoft: 21,858
- palestine: 8,843
The mean logFacebook popularity by topic.
![average facebook popularity by topic](imgs/eda_mean_by_topic.png)
*Figure 3: Mean logFacebook popularity by topic.*
Key observations:
- **obama** stories clearly have the highest average popularity.
- **microsoft** is slightly above **economy** and **palestine**.
- In original share counts, obama articles average roughly an order of magnitude more shares than economy/microsoft/palestine stories, but all topics remain strongly skewed.
This suggests that topic is an important categorical predictor for popularity, and motivates including it as a onehot encoded feature in the models.
---
### 1.4 Sentiment and popularity
Sentiment scores from the title and headline are continuous values roughly in the interval [-1, 1]. Their empirical distributions are centered very close to 0 with standard deviations around 0.14, indicating that most titles and headlines are only mildly positive or negative.
A 5,000row sample of `SentimentTitle` vs `log_Facebook`
![title sentiment vs facebook popularity](imgs/eda_sentiment_vs_popularity.png)
*Figure 4: Scatter of title sentiment vs logFacebook popularity (sample of 5,000 articles).*
The scatter plot shows:
- A dense vertical band near sentiment 0, reflecting many neutral titles.
- Viral and nonviral articles scattered across the full sentiment range, with no obvious linear trend.
Empirically, the correlation between sentiment and Facebook popularity is almost zero (|r| is approx 0.01). This suggests that sentiment alone is a weak predictor of popularity; we still include it in models because it may interact with topic or time, but we do not expect it to explain much variance by itself.
---
### 1.5 EDA conclusions
From the exploratory analysis we conclude:
1. **Popularity variables are nonnegative, highly skewed, and heavytailed.**
- Logtransforming shares yields more regular distributions, so regression models should target `log1p(popularity)` instead of raw counts.
2. **Topic has a strong effect on expected popularity.**
- Particularly, obamarelated news is more popular on Facebook; microsoft is relatively stronger on LinkedIn (from descriptive statistics, not shown here).
3. **Title/headline sentiment has little linear relationship with popularity.**
- It should not be expected to drive predictions strongly.
4. **There are many extreme outliers (viral stories), but these are the signal we care about.**
- We choose *not* to remove them; instead, we rely on robust models and logtransformed targets.
These observations motivate a modeling strategy that combines:
- **Linear models** (to quantify simple topic/sentiment effects on logpopularity).
- **Nonlinear treebased models** (to capture complex relationships and heavytailed behaviour).
- **Classification** of viral vs nonviral stories.
- **Clustering** of timeseries trajectories to identify typical growth patterns.
The next section formalizes these ideas.
---
## 2. Model Development, Validation and Optimization
We develop **five** models: three regression models (including a dimensionreduced variant), one classification model, and one clustering model. This covers regression, classification and unsupervised learning objectives, and explicitly examines the impact of dimensionality reduction.
All supervised models use:
- Train/test split: **80% training, 20% test**, `random_state=42`.
- Evaluation on the heldout test set only (no peeking).
- Metrics:
- Regression: R² and RMSE on logscale (using `root_mean_squared_error`).
- Classification: accuracy, F1 for the positive class, ROC AUC and confusion matrix.
### 2.1 Common preprocessing
For each model:
1. Replace `-1` in `Facebook`, `GooglePlus`, `LinkedIn` with `NaN`.
2. Drop rows with missing values in the specific target variable.
3. Use `DaysSinceEpoch` as a numeric representation of `PublishDate`.
4. Where appropriate, use `log_Facebook = log1p(Facebook)` as the regression target.
5. Encode `Topic` using onehot encoding with economy as the reference level (`drop_first=True`).
For timeseries models we also use `Data/Facebook_Economy.csv`, which stores Facebook popularity snapshots TS1TS144 every 20 minutes for economy articles. We join it with `News_Final.csv` on `IDLink` and restrict to:
- `Topic == "economy"`
- Time slices **TS1TS50** as predictors (roughly first 1617 hours)
- Final logFacebook popularity as the target
Negative TS values are interpreted as “no observed popularity yet” and are set to 0.
---
### 2.2 Regression Model 1 Linear regression on static features
**Goal.** Predict logFacebook popularity using only static metadata (no early popularity feedback).
- **Target:** `y = log_Facebook` for all topics.
- **Features:**
- `SentimentTitle`, `SentimentHeadline`
- `DaysSinceEpoch` (publication time)
- Topic onehot dummies: `Topic_microsoft`, `Topic_obama`, `Topic_palestine` (economy is implicit baseline).
We fit an ordinary least squares linear regression on the training split and evaluate on the test set.
**Results (test set):**
- **R² is approx 0.157**
- **RMSE is approx 1.86** in logspace
Actual vs predicted logFacebook values
![model 1: actual vs predicted](imgs/model1_actual_vs_predicted.png)
*Figure 5: Model 1 predictions vs actual logFacebook values.*
The predictions are compressed into a narrow band, underpredicting viral articles and overpredicting lowpopularity ones. Key coefficients:
- `Topic_obama` is approx +1.78 (large positive shift vs economy)
- `Topic_microsoft` is approx +0.10
- `Topic_palestine` is approx +0.02
- `SentimentTitle` is approx 0.38, `SentimentHeadline` is approx 0.06
- `DaysSinceEpoch` is approx 0.0007 (tiny downward trend over time)
Interpretation:
- Topic has a clear effect (especially obama).
- Sentiment effects are small and slightly negative.
- The model explains only ~16% of the variance in logpopularity, confirming that static features alone are weak predictors.
---
### 2.3 Regression Model 2 Random forest on early time slices
**Goal.** Predict final logFacebook popularity for **economy** stories using early Facebook popularity time slices and sentiment.
- **Target:** `log_Facebook` for economy topic, joined with Facebook_Economy timeseries.
- **Features:**
- TS1TS50 (early cumulative popularity counts, cleaned: negative → 0)
- `SentimentTitle`, `SentimentHeadline`
We fit a `RandomForestRegressor` with:
- 120 estimators,
- `min_samples_leaf=2`,
- `max_depth=None` (trees grow fully),
- `n_jobs=-1`, `random_state=42`.
**Results (test set):**
- **R² is approx 0.746**
- **RMSE is approx 0.86** (logscale)
Feature importances indicate:
- `TS50` alone contributes ~81% of total importance.
- Combined sentiment variables contribute ~17%.
- Earlier TS features each have very small marginal importance.
Thus, knowing an articles popularity after ~17 hours (TS50) is already highly predictive of its final 2day popularity. Early engagement is a much stronger signal than sentiment or publish time.
---
### 2.4 Regression Model 3 PCA + random forest (dimension reduction)
Model 3 examines the effect of **dimension reduction** on performance.
Instead of using all 50 TS features directly, we:
1. Standardize TS1TS50 with `StandardScaler`.
2. Apply PCA with `n_components=10`.
3. Concatenate the 10 PCA components with the two sentiment features (`SentimentTitle`, `SentimentHeadline`).
4. Train the same `RandomForestRegressor` as Model 2 on this 12dimensional feature space.
PCA results:
- 1st component explains is approx **93.5%** of variance.
- First 10 components together explain is approx **99.9%** of variance.
**Results (test set):**
- **R² is approx 0.745**
- **RMSE is approx 0.87**
Compared to Model 2:
- R² decreases only slightly (0.746 → 0.745).
- RMSE increases minimally (0.862 → 0.865).
So PCA reduces dimensionality from 50 TS features to 10 components with **negligible loss of predictive performance**. The first PCA components effectively summarize overall popularity level and growth pattern, which are the dominant signals for final popularity.
---
### 2.5 Classification Model 4 Logistic regression for viral vs nonviral
**Goal.** Classify whether an article is *viral* on Facebook, defined as being in the top 10% of final popularity.
- **Target:**
- `viral_fb = 1` if `Facebook ≥ 214` (90th percentile), otherwise 0.
- Class distribution: ~10% positive, ~90% negative.
- **Features:**
- `SentimentTitle`, `SentimentHeadline`
- `DaysSinceEpoch`
- Topic dummies as before
We intentionally **do not use timeslice features** here to simulate making a decision at or before publication, when no engagement data is available yet.
We fit a `LogisticRegression` with `max_iter=500` and `class_weight="balanced"` to counter class imbalance.
**Results (test set):**
- **Accuracy is approx 0.73**
- A naive classifier that always predicts “nonviral” would obtain is approx 0.90 accuracy, highlighting that raw accuracy is misleading under imbalance.
- **F1 (viral class) is approx 0.36**
- **ROC AUC is approx 0.75**
The ROC AUC of 0.75 indicates decent **ranking ability**: the model tends to assign higher probabilities to truly viral articles than to nonviral ones. However, at the default 0.5 threshold it generates many false positives; tuning the probability threshold would be necessary in practice depending on the business tradeoff between missing viral content and wasting attention on nonviral items.
---
### 2.6 Clustering Model 5 Kmeans on timeseries shapes
To understand typical growth trajectories of popularity, we cluster early timeseries patterns.
- **Features:** TS1TS50, standardized with `StandardScaler`.
- **Sample:** random subset of 5,000 economy+Facebook articles to keep computation manageable.
- **Algorithm:** `KMeans(n_clusters=3, n_init=10, random_state=42)`.
**Results:**
- **Silhouette score is approx 0.97**, indicating wellseparated clusters (although partly due to one large cluster vs a few small ones).
- Cluster sizes and mean final Facebook shares:
| cluster | count | mean shares | median | max |
| ------: | ----: | ----------: | -----: | ----: |
| 0 | 4,978 | ~37 | 3 | 7,045 |
| 1 | 1 | 1,886 | 1,886 | 1,886 |
| 2 | 21 | ~2,478 | 1,291 | 8,010 |
Inspecting centroid timeseries (TS1, TS10, TS25, TS50):
- **Cluster 0:** low TS1 (~0.3), slow growth, TS50 is approx 17 → “normal/low popularity” baseline; almost all articles.
- **Cluster 2:** TS1 is approx 23, TS10 is approx 211, TS50 is approx 1,388 → early rapid takeoff and sustained growth; these are clearly **viral** trajectories.
- **Cluster 1:** single extreme **superviral** outlier with TS1 is approx TS50 is approx 1,886.
Clustering therefore uncovers distinct popularity regimes: ordinary stories, viral stories, and rare superviral events.
---
## 3. Decisions and Practical Use
### 3.1 What do the models tell us?
**1. Static metadata is not enough for precise prediction.**
Model 1, using only topic, time and sentiment, explains only about 16% of the variance in logFacebook popularity. The EDA already indicated weak correlations between sentiment and engagement, and the model confirms that topic is the only strong static predictor. This means:
- Before any user feedback is observed, we can form only a rough guess about popularity (e.g., “obama stories tend to do better”), but detailed predictions are unreliable.
**2. Early engagement is the key signal.**
Models 2 and 3 show that once ~16 hours of Facebook feedback are available:
- Random forests can explain ~75% of the variance in final logpopularity.
- PCA compresses the 50dimensional TS inputs to 10 components with essentially no loss in performance.
In practice, this means that **monitoring early timeseries of shares is crucial**. Stories that are already accumulating shares quickly by TS50 are extremely likely to end up as the most popular items after two days.
**3. Logistic regression is useful for ranking, not for definitive labels.**
The viral vs nonviral classifier has:
- Good ranking ability (ROC AUC ~0.75).
- Moderate F1 score and relatively low accuracy compared to the majority baseline.
This makes it better suited as a **priority score** than as a hard decision rule. For example, an editorial team might sort draft stories by predicted viral probability to decide where to invest additional editorial resources, but should not automatically discard stories predicted to be nonviral.
**4. Clustering uncovers growth archetypes.**
Kmeans reveals three typical growth shapes:
1. Slow/low growth (most items).
2. Clearly viral trajectories.
3. A tiny number of superviral events.
Recognizing that an articles early TS pattern matches the viral or superviral cluster can trigger decisions such as:
- Featuring the article more prominently on the homepage.
- Allocating budget for promoted posts.
- Producing followup content while interest is high.
### 3.2 How useful are these models for real decisions?
A practical decision workflow informed by this analysis could be:
1. **Prepublication / immediately at publication**
Use the logistic regression model and static features (topic, sentiment, time) to assign each new article a baseline probability of becoming viral. This can help prioritize which stories to monitor more closely, but should not be the sole basis for publication decisions.
2. **Early postpublication (first few hours)**
Once some timeslice information is available (TS1TS10), use clustering to see whether the articles early trajectory resembles known viral patterns. Articles already in the viral cluster are good candidates for early promotion.
3. **Midwindow (around TS50)**
At ~1617 hours, feed TS1TS50 into the PCA + random forest regressor (Model 3) to estimate final reach. This estimate can guide decisions about:
- How long to keep the story on front pages.
- Whether to schedule followups or derivative content.
- Where to allocate marketing/promotional resources.
4. **Limitations**
- Popularity is still highly stochastic; even with R² is approx 0.75 in the best case, there is considerable residual uncertainty.
- Models trained on this dataset focus on four specific topics and a particular time period (20152016). Performance may degrade when applied to different domains, languages or time spans. ([arXiv][1])
Overall, these models are best used for **relative ranking and triage** and help in deciding which articles deserve extra attention rather than for exact point predictions of future share counts. Combining static features, early engagement signals, and growthpattern clustering yields a practical decision support tool for newsrooms and social media teams working with limited resources.
If you actually read this far...nice! :D
[1]: https://arxiv.org/abs/1801.07055 "Multi-Source Social Feedback of Online News Feeds"
Binary file not shown.
+363
View File
@@ -0,0 +1,363 @@
import zipfile
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
r2_score,
root_mean_squared_error,
accuracy_score,
f1_score,
roc_auc_score,
confusion_matrix,
silhouette_score,
)
from sklearn.pipeline import Pipeline
from sklearn.cluster import KMeans
# ensure imgs dir exists
os.makedirs("imgs", exist_ok=True)
# data loading
zip_path = "news+popularity+in+multiple+social+media+platforms.zip"
with zipfile.ZipFile(zip_path, "r") as zf:
with zf.open("Data/News_Final.csv") as f:
news = pd.read_csv(f)
# basic cleaning
pop_cols = ["Facebook", "GooglePlus", "LinkedIn"]
# encode -1 as missing
for col in pop_cols:
news.loc[news[col] < 0, col] = np.nan
# convert publishdate and add numeric time feature
news["PublishDate"] = pd.to_datetime(news["PublishDate"])
news["DaysSinceEpoch"] = (
news["PublishDate"] - pd.Timestamp("1970-01-01")
).dt.days
# log transform facebook popularity where available
news["log_Facebook"] = np.log1p(news["Facebook"])
# eda helpers (optional plotting)
def plot_eda():
plt.figure()
vals = news["Facebook"].dropna()
vals = vals[vals > 0]
vals.plot.hist(bins=50)
plt.xlabel("facebook shares")
plt.ylabel("count")
plt.title("distribution of facebook popularity")
plt.xscale("log")
plt.tight_layout()
plt.savefig("imgs/eda_facebook_hist.png")
plt.close()
plt.figure()
news["log_Facebook"].dropna().plot.hist(bins=50)
plt.xlabel("log1p(facebook shares)")
plt.ylabel("count")
plt.title("distribution of log-transformed facebook popularity")
plt.tight_layout()
plt.savefig("imgs/eda_log_facebook_hist.png")
plt.close()
mean_by_topic = (
news.groupby("Topic")["log_Facebook"].mean().sort_values()
)
plt.figure()
mean_by_topic.plot(kind="bar")
plt.ylabel("mean log1p(facebook shares)")
plt.title("average facebook popularity by topic")
plt.tight_layout()
plt.savefig("imgs/eda_mean_by_topic.png")
plt.close()
sample = news.dropna(
subset=["log_Facebook", "SentimentTitle"]
).sample(5000, random_state=42)
plt.figure()
plt.scatter(
sample["SentimentTitle"],
sample["log_Facebook"],
alpha=0.3,
)
plt.xlabel("sentimenttitle")
plt.ylabel("log1p(facebook shares)")
plt.title("title sentiment vs facebook popularity (sample)")
plt.tight_layout()
plt.savefig("imgs/eda_sentiment_vs_popularity.png")
plt.close()
# model 1: linear regression
def run_model_1():
df = news.dropna(subset=["log_Facebook"]).copy()
X = df[["SentimentTitle", "SentimentHeadline", "DaysSinceEpoch", "Topic"]]
X = pd.get_dummies(X, columns=["Topic"], drop_first=True)
y = df["log_Facebook"]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
linreg = LinearRegression()
linreg.fit(X_train, y_train)
y_pred = linreg.predict(X_test)
r2 = r2_score(y_test, y_pred)
rmse = root_mean_squared_error(y_test, y_pred)
print("model 1 linear regression")
print("r2:", r2)
print("rmse:", rmse)
print("coefficients:")
print(pd.Series(linreg.coef_, index=X.columns))
# optional diagnostic plot
plt.figure()
plt.scatter(y_test, y_pred, alpha=0.3)
plt.xlabel("actual log1p(facebook)")
plt.ylabel("predicted log1p(facebook)")
plt.title("model 1: actual vs predicted")
plt.tight_layout()
plt.savefig("imgs/model1_actual_vs_predicted.png")
plt.close()
return linreg, (X_test, y_test, y_pred)
# prepare economy + facebook time-slice data
with zipfile.ZipFile(zip_path, "r") as zf:
with zf.open("Data/Facebook_Economy.csv") as f:
fb_econ = pd.read_csv(f)
# ensure integer id for join
news["IDLink_int"] = news["IDLink"].astype(int)
news_econ = news[news["Topic"] == "economy"].copy()
news_econ["IDLink_int"] = news_econ["IDLink"].astype(int)
fb_econ_merged = fb_econ.merge(
news_econ, left_on="IDLink", right_on="IDLink_int", how="inner"
)
# clean time-slice features
ts_cols = [c for c in fb_econ.columns if c.startswith("TS")]
for col in ts_cols:
fb_econ_merged.loc[fb_econ_merged[col] < 0, col] = 0
# drop rows with missing facebook target
fb_econ_merged = fb_econ_merged[fb_econ_merged["Facebook"].notna()].copy()
fb_econ_merged["log_Facebook"] = np.log1p(fb_econ_merged["Facebook"])
ts_cols_early = ts_cols[:50]
# model 2: random forest on raw early ts
def run_model_2():
X = fb_econ_merged[ts_cols_early + ["SentimentTitle", "SentimentHeadline"]]
y = fb_econ_merged["log_Facebook"]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
rf = RandomForestRegressor(
n_estimators=120,
random_state=42,
n_jobs=-1,
max_depth=None,
min_samples_leaf=2,
)
rf.fit(X_train, y_train)
pipe = Pipeline([
("scaler", StandardScaler()),
("pca", PCA(n_components=10, random_state=42)),
("rf", RandomForestRegressor(
n_estimators=120,
random_state=42,
n_jobs=-1,
max_depth=None,
min_samples_leaf=2,
)),
])
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
r2 = r2_score(y_test, y_pred)
rmse = root_mean_squared_error(y_test, y_pred)
print("model 2 random forest on raw ts")
print("r2:", r2)
print("rmse:", rmse)
importances = pd.Series(rf.feature_importances_, index=X.columns)
print("top importances:")
print(importances.sort_values(ascending=False).head(10))
return rf, (X_test, y_test, y_pred)
# model 3: pca + random forest
def run_model_3():
ts = fb_econ_merged[ts_cols_early]
sent = fb_econ_merged[["SentimentTitle", "SentimentHeadline"]]
X = pd.concat([ts, sent], axis=1)
y = fb_econ_merged["log_Facebook"]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train[ts_cols_early])
X_test_scaled = scaler.transform(X_test[ts_cols_early])
pca = PCA(n_components=10, random_state=42)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)
train_sent = X_train[["SentimentTitle", "SentimentHeadline"]].values
test_sent = X_test[["SentimentTitle", "SentimentHeadline"]].values
X_train_final = np.hstack([X_train_pca, train_sent])
X_test_final = np.hstack([X_test_pca, test_sent])
rf = RandomForestRegressor(
n_estimators=120,
random_state=42,
n_jobs=-1,
max_depth=None,
min_samples_leaf=2,
)
rf.fit(X_train_final, y_train)
y_pred = rf.predict(X_test_final)
r2 = r2_score(y_test, y_pred)
rmse = root_mean_squared_error(y_test, y_pred)
print("model 3 random forest on pca(ts)")
print("r2:", r2)
print("rmse:", rmse)
print("pca variance explained (first 10):", pca.explained_variance_ratio_)
print("total variance explained:", pca.explained_variance_ratio_.sum())
return rf, (X_test, y_test, y_pred), (pca, scaler)
# model 4: logistic regression (viral vs non-viral)
def run_model_4():
df = news.copy()
df = df[df["Facebook"].notna()].copy()
threshold = df["Facebook"].quantile(0.9)
df["viral_fb"] = (df["Facebook"] >= threshold).astype(int)
X = df[["SentimentTitle", "SentimentHeadline", "DaysSinceEpoch", "Topic"]]
X = pd.get_dummies(X, columns=["Topic"], drop_first=True)
y = df["viral_fb"]
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=42,
stratify=y,
)
clf = LogisticRegression(
max_iter=500,
class_weight="balanced",
)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
y_proba = clf.predict_proba(X_test)[:, 1]
acc = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
auc = roc_auc_score(y_test, y_proba)
cm = confusion_matrix(y_test, y_pred)
print("model 4 logistic regression (viral vs non-viral)")
print("threshold (shares):", threshold)
print("accuracy:", acc)
print("f1 (positive class):", f1)
print("roc auc:", auc)
print("confusion matrix:\n", cm)
return clf, (X_test, y_test, y_pred, y_proba)
# model 5: k-means clustering on ts shapes
def run_model_5():
X = fb_econ_merged[ts_cols_early].values
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
rng = np.random.RandomState(42)
idx = rng.choice(X_scaled.shape[0], size=5000, replace=False)
X_sample = X_scaled[idx]
fb_sample = fb_econ_merged["Facebook"].values[idx]
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
kmeans.fit(X_sample)
labels = kmeans.labels_
sil = silhouette_score(X_sample, labels)
print("model 5 kmeans on ts shapes")
print("silhouette score:", sil)
cluster_df = pd.DataFrame(
{"cluster": labels, "Facebook": fb_sample}
)
print(cluster_df.groupby("cluster")["Facebook"].agg(
["count", "mean", "median", "max"]
))
centers_scaled = kmeans.cluster_centers_
centers = scaler.inverse_transform(centers_scaled)
centers_df = pd.DataFrame(centers, columns=ts_cols_early)
summary = pd.DataFrame({
"cluster": list(range(centers_df.shape[0])),
"avg_ts": centers_df.mean(axis=1),
"ts1": centers_df["TS1"],
"ts10": centers_df["TS10"],
"ts25": centers_df["TS25"],
"ts50": centers_df["TS50"],
})
print("cluster centroid summary:\n", summary)
return kmeans, scaler, summary
if __name__ == "__main__":
run_model_1()
run_model_2()
run_model_3()
run_model_4()
run_model_5()
plot_eda()
Binary file not shown.

After

Width:  |  Height:  |  Size: 19 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 23 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 23 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 130 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 80 KiB

+53
View File
@@ -0,0 +1,53 @@
model 1 linear regression
r2: 0.1566089012155698
rmse: 1.8625218879551908
coefficients:
SentimentTitle -0.383499
SentimentHeadline -0.064708
DaysSinceEpoch -0.000678
Topic_microsoft 0.101848
Topic_obama 1.779152
Topic_palestine 0.023738
dtype: float64
model 2 random forest on raw ts
r2: 0.7441325592979975
rmse: 0.8661035218490399
top importances:
TS50 0.810814
SentimentHeadline 0.099992
SentimentTitle 0.067386
TS49 0.001883
TS48 0.000589
TS15 0.000503
TS18 0.000503
TS13 0.000498
TS24 0.000498
TS10 0.000480
dtype: float64
model 3 random forest on pca(ts)
r2: 0.7442278904925559
rmse: 0.8659421602173341
pca variance explained (first 10): [9.38529911e-01 3.24317512e-02 1.76049987e-02 7.50439628e-03
1.90148973e-03 6.83679307e-04 3.57135169e-04 2.12058930e-04
1.33577763e-04 9.66846072e-05]
total variance explained: 0.9994556829781833
model 4 logistic regression (viral vs non-viral)
threshold (shares): 214.0
accuracy: 0.7287481626653601
f1 (positive class): 0.35709101466105386
roc auc: 0.7530964866530827
confusion matrix:
[[10669 4023]
[ 406 1230]]
model 5 kmeans on ts shapes
silhouette score: 0.9732852082508215
count mean median max
cluster
0 4978 36.751708 3.0 7045.0
1 1 1886.000000 1886.0 1886.0
2 21 2477.761905 1291.0 8010.0
cluster centroid summary:
cluster avg_ts ts1 ts10 ts25 ts50
0 0 8.317766 0.297710 2.959221 7.836079 17.221977
1 1 1885.920000 1885.000000 1886.000000 1886.000000 1886.000000
2 2 640.917143 22.761905 211.142857 579.047619 1387.619048
+25
View File
@@ -0,0 +1,25 @@
Conduct the following analysis for the dataset:
1. Exploratory Data Analysis
Explore the statistical aspects of the dataset. Analyze the
distributions and provide summaries of the relevant statistics. Perform any cleaning,
transformations, interpolations, smoothing, outlier detection/ removal, etc. required on the
data. Include figures and descriptions of this exploration and a short description of what
you concluded (e.g. nature of distribution, indication of suitable model approaches you
would try, etc.) Min.1 page text + graphics (required).
2. Model Development, Validation and Optimization
Develop and evaluate three (4000-level) or four (6000-level) or more J models. If possible,
these models should cover more than one objective, i.e. regression, classification,
clustering. Consider the efect of dimension reduction of the dataset on model
performance. Diferent models means diferent combinations of an algorithm and a
formula (input and output features). The choice of independent and response variables is
up to you. Explain why you chose them. Construct the models, test/ validate them. Briefly explain the
validation approach. You can use any method(s) covered in the course. Include your code
in your submission. Compare model results if applicable. Report the results of the model
(fits, coeficients, sample trees, other measures of fit/ importance, etc., predictors and
summary statistics). Min. 2 pages of text + graphics (required).
3. Decisions
Describe your conclusions from the model
fits, predictions and how well (or not) it could be used for decisions and why. Min. 1/2 page
of text + graphics.
Binary file not shown.
File diff suppressed because it is too large Load Diff