Data-Analytics/Assignment VI/DataAnalytics2025Fall_A6_Itamar_Oren-Naftalovich.md at 4f6434ff721281f1ab43b8eed1048939b392618d

Archived

This repository has been archived on 2026-05-09. You can view files and clone it. You cannot open issues or pull requests or push a commit.

Files

T

ION606 4f6434ff72 added assignment VI

2025-12-05 19:59:00 -05:00

18 KiB

Raw Blame History

This project analyzes the News Popularity in Multiple Social Media Platforms dataset from the UCI Machine Learning Repository. The data contains ~93k news items collected between November 2015 and July 2016, with their final popularity on Facebook, Google+ and LinkedIn across four topics: economy, microsoft, obama and palestine.

1. Exploratory Data Analysis

1.1 Data overview and cleaning

We work primarily with Data/News_Final.csv, which has 93,239 rows and 11 variables:

IDLink – numeric id of the article
Title, Headline – short text fields
Source – news outlet that originally published the story
Topic – one of {economy, microsoft, obama, palestine}
PublishDate – publication timestamp
SentimentTitle, SentimentHeadline – numeric sentiment scores derived from title and headline text
Facebook, GooglePlus, LinkedIn – final popularity on each social media platform

According to the dataset documentation, -1 in the popularity variables indicates that no final popularity value was observed. In the code, any value < 0 in Facebook, GooglePlus, or LinkedIn is therefore replaced with NaN. Missing popularity values are later dropped on a per–model basis. PublishDate is converted to a proper timestamp, and a numeric time feature

DaysSinceEpoch = days since 1970-01-01

is created to allow inclusion of temporal trends in the models. We also log‑transform Facebook popularity:

log_Facebook = log1p(Facebook)

which is used as the target for regression models.

1.2 Popularity distributions

A histogram of Facebook share counts on a logarithmic x‑axis, after removing missing and zero values

Figure 1: Distribution of Facebook popularity on a log x‑axis.

The distribution is extremely right‑skewed:

Most articles receive very few shares.
A small number of “viral” articles receive thousands of shares.

On the cleaned data, summary statistics for Facebook shares are approximately:

median is approx 8
mean is approx 129
90th percentile is approx 214
99th percentile is approx 2,322
max = 49,211

Google+ and LinkedIn exhibit similar heavy‑tailed patterns (with smaller absolute scales), which matches the description of the dataset creators (arXiv).

The distribution of log1p(Facebook)

Figure 2: Distribution of log‑transformed Facebook popularity.

The log transform compresses the heavy tail and produces a more regular, unimodal distribution. This justifies using log1p(popularity) as the regression target: it reduces the influence of rare extreme outliers while keeping them in the data, which is important because viral stories are the phenomena of interest.

1.3 Topic effects

The four topics are not equally represented:

economy: 33,928 items
obama: 28,610
microsoft: 21,858
palestine: 8,843

The mean log‑Facebook popularity by topic.

Figure 3: Mean log‑Facebook popularity by topic.

Key observations:

obama stories clearly have the highest average popularity.
microsoft is slightly above economy and palestine.
In original share counts, obama articles average roughly an order of magnitude more shares than economy/microsoft/palestine stories, but all topics remain strongly skewed.

This suggests that topic is an important categorical predictor for popularity, and motivates including it as a one‑hot encoded feature in the models.

1.4 Sentiment and popularity

Sentiment scores from the title and headline are continuous values roughly in the interval [-1, 1]. Their empirical distributions are centered very close to 0 with standard deviations around 0.14, indicating that most titles and headlines are only mildly positive or negative.

A 5,000‑row sample of SentimentTitle vs log_Facebook

Figure 4: Scatter of title sentiment vs log‑Facebook popularity (sample of 5,000 articles).

The scatter plot shows:

A dense vertical band near sentiment 0, reflecting many neutral titles.
Viral and non‑viral articles scattered across the full sentiment range, with no obvious linear trend.

Empirically, the correlation between sentiment and Facebook popularity is almost zero (|r| is approx 0.01). This suggests that sentiment alone is a weak predictor of popularity; we still include it in models because it may interact with topic or time, but we do not expect it to explain much variance by itself.

1.5 EDA conclusions

From the exploratory analysis we conclude:

Popularity variables are non‑negative, highly skewed, and heavy‑tailed.
- Log‑transforming shares yields more regular distributions, so regression models should target log1p(popularity) instead of raw counts.
Topic has a strong effect on expected popularity.
- Particularly, obama‑related news is more popular on Facebook; microsoft is relatively stronger on LinkedIn (from descriptive statistics, not shown here).
Title/headline sentiment has little linear relationship with popularity.
- It should not be expected to drive predictions strongly.
There are many extreme outliers (viral stories), but these are the signal we care about.
- We choose not to remove them; instead, we rely on robust models and log‑transformed targets.

These observations motivate a modeling strategy that combines:

Linear models (to quantify simple topic/sentiment effects on log‑popularity).
Non‑linear tree‑based models (to capture complex relationships and heavy‑tailed behaviour).
Classification of viral vs non‑viral stories.
Clustering of time‑series trajectories to identify typical growth patterns.

The next section formalizes these ideas.

2. Model Development, Validation and Optimization

We develop five models: three regression models (including a dimension‑reduced variant), one classification model, and one clustering model. This covers regression, classification and unsupervised learning objectives, and explicitly examines the impact of dimensionality reduction.

All supervised models use:

Train/test split: 80% training, 20% test, random_state=42.
Evaluation on the held‑out test set only (no peeking).
Metrics:
- Regression: R² and RMSE on log‑scale (using root_mean_squared_error).
- Classification: accuracy, F1 for the positive class, ROC AUC and confusion matrix.

2.1 Common preprocessing

For each model:

Replace -1 in Facebook, GooglePlus, LinkedIn with NaN.
Drop rows with missing values in the specific target variable.
Use DaysSinceEpoch as a numeric representation of PublishDate.
Where appropriate, use log_Facebook = log1p(Facebook) as the regression target.
Encode Topic using one‑hot encoding with economy as the reference level (drop_first=True).

For time‑series models we also use Data/Facebook_Economy.csv, which stores Facebook popularity snapshots TS1–TS144 every 20 minutes for economy articles. We join it with News_Final.csv on IDLink and restrict to:

Topic == "economy"
Time slices TS1–TS50 as predictors (roughly first 16–17 hours)
Final log‑Facebook popularity as the target

Negative TS values are interpreted as “no observed popularity yet” and are set to 0.

2.2 Regression Model 1 – Linear regression on static features

Goal. Predict log‑Facebook popularity using only static metadata (no early popularity feedback).

Target: y = log_Facebook for all topics.
Features:
- SentimentTitle, SentimentHeadline
- DaysSinceEpoch (publication time)
- Topic one‑hot dummies: Topic_microsoft, Topic_obama, Topic_palestine (economy is implicit baseline).

We fit an ordinary least squares linear regression on the training split and evaluate on the test set.

Results (test set):

R² is approx 0.157
RMSE is approx 1.86 in log‑space

Actual vs predicted log‑Facebook values

Figure 5: Model 1 predictions vs actual log‑Facebook values.

The predictions are compressed into a narrow band, under‑predicting viral articles and over‑predicting low‑popularity ones. Key coefficients:

Topic_obama is approx +1.78 (large positive shift vs economy)
Topic_microsoft is approx +0.10
Topic_palestine is approx +0.02
SentimentTitle is approx −0.38, SentimentHeadline is approx −0.06
DaysSinceEpoch is approx −0.0007 (tiny downward trend over time)

Interpretation:

Topic has a clear effect (especially obama).
Sentiment effects are small and slightly negative.
The model explains only ~16% of the variance in log‑popularity, confirming that static features alone are weak predictors.

2.3 Regression Model 2 – Random forest on early time slices

Goal. Predict final log‑Facebook popularity for economy stories using early Facebook popularity time slices and sentiment.

Target: log_Facebook for economy topic, joined with Facebook_Economy time‑series.
Features:
- TS1–TS50 (early cumulative popularity counts, cleaned: negative → 0)
- SentimentTitle, SentimentHeadline

We fit a RandomForestRegressor with:

120 estimators,
min_samples_leaf=2,
max_depth=None (trees grow fully),
n_jobs=-1, random_state=42.

Results (test set):

R² is approx 0.746
RMSE is approx 0.86 (log‑scale)

Feature importances indicate:

TS50 alone contributes ~81% of total importance.
Combined sentiment variables contribute ~17%.
Earlier TS features each have very small marginal importance.

Thus, knowing an article’s popularity after ~17 hours (TS50) is already highly predictive of its final 2‑day popularity. Early engagement is a much stronger signal than sentiment or publish time.

2.4 Regression Model 3 – PCA + random forest (dimension reduction)

Model 3 examines the effect of dimension reduction on performance.

Instead of using all 50 TS features directly, we:

Standardize TS1–TS50 with StandardScaler.
Apply PCA with n_components=10.
Concatenate the 10 PCA components with the two sentiment features (SentimentTitle, SentimentHeadline).
Train the same RandomForestRegressor as Model 2 on this 12‑dimensional feature space.

PCA results:

1st component explains is approx 93.5% of variance.
First 10 components together explain is approx 99.9% of variance.

Results (test set):

R² is approx 0.745
RMSE is approx 0.87

Compared to Model 2:

R² decreases only slightly (0.746 → 0.745).
RMSE increases minimally (0.862 → 0.865).

So PCA reduces dimensionality from 50 TS features to 10 components with negligible loss of predictive performance. The first PCA components effectively summarize overall popularity level and growth pattern, which are the dominant signals for final popularity.

2.5 Classification Model 4 – Logistic regression for viral vs non‑viral

Goal. Classify whether an article is viral on Facebook, defined as being in the top 10% of final popularity.

Target:
- viral_fb = 1 if Facebook ≥ 214 (90th percentile), otherwise 0.
- Class distribution: ~10% positive, ~90% negative.
Features:
- SentimentTitle, SentimentHeadline
- DaysSinceEpoch
- Topic dummies as before

We intentionally do not use time‑slice features here to simulate making a decision at or before publication, when no engagement data is available yet.

We fit a LogisticRegression with max_iter=500 and class_weight="balanced" to counter class imbalance.

Results (test set):

Accuracy is approx 0.73
- A naive classifier that always predicts “non‑viral” would obtain is approx 0.90 accuracy, highlighting that raw accuracy is misleading under imbalance.
F1 (viral class) is approx 0.36
ROC AUC is approx 0.75

The ROC AUC of 0.75 indicates decent ranking ability: the model tends to assign higher probabilities to truly viral articles than to non‑viral ones. However, at the default 0.5 threshold it generates many false positives; tuning the probability threshold would be necessary in practice depending on the business trade‑off between missing viral content and wasting attention on non‑viral items.

2.6 Clustering Model 5 – K‑means on time‑series shapes

To understand typical growth trajectories of popularity, we cluster early time‑series patterns.

Features: TS1–TS50, standardized with StandardScaler.
Sample: random subset of 5,000 economy+Facebook articles to keep computation manageable.
Algorithm: KMeans(n_clusters=3, n_init=10, random_state=42).

Results:

Silhouette score is approx 0.97, indicating well‑separated clusters (although partly due to one large cluster vs a few small ones).
Cluster sizes and mean final Facebook shares:

cluster count mean shares median max

0 4,978 ~37 3 7,045

1 1 1,886 1,886 1,886

2 21 ~2,478 1,291 8,010

cluster	count	mean shares	median	max
0	4,978	~37	3	7,045
1	1	1,886	1,886	1,886
2	21	~2,478	1,291	8,010

Inspecting centroid time‑series (TS1, TS10, TS25, TS50):

Cluster 0: low TS1 (~0.3), slow growth, TS50 is approx 17 → “normal/low popularity” baseline; almost all articles.
Cluster 2: TS1 is approx 23, TS10 is approx 211, TS50 is approx 1,388 → early rapid take‑off and sustained growth; these are clearly viral trajectories.
Cluster 1: single extreme super‑viral outlier with TS1 is approx TS50 is approx 1,886.

Clustering therefore uncovers distinct popularity regimes: ordinary stories, viral stories, and rare super‑viral events.

3. Decisions and Practical Use

3.1 What do the models tell us?

1. Static metadata is not enough for precise prediction.

Model 1, using only topic, time and sentiment, explains only about 16% of the variance in log‑Facebook popularity. The EDA already indicated weak correlations between sentiment and engagement, and the model confirms that topic is the only strong static predictor. This means:

Before any user feedback is observed, we can form only a rough guess about popularity (e.g., “obama stories tend to do better”), but detailed predictions are unreliable.

2. Early engagement is the key signal.

Models 2 and 3 show that once ~16 hours of Facebook feedback are available:

Random forests can explain ~75% of the variance in final log‑popularity.
PCA compresses the 50‑dimensional TS inputs to 10 components with essentially no loss in performance.

In practice, this means that monitoring early time‑series of shares is crucial. Stories that are already accumulating shares quickly by TS50 are extremely likely to end up as the most popular items after two days.

3. Logistic regression is useful for ranking, not for definitive labels.

The viral vs non‑viral classifier has:

Good ranking ability (ROC AUC ~0.75).
Moderate F1 score and relatively low accuracy compared to the majority baseline.

This makes it better suited as a priority score than as a hard decision rule. For example, an editorial team might sort draft stories by predicted viral probability to decide where to invest additional editorial resources, but should not automatically discard stories predicted to be non‑viral.

4. Clustering uncovers growth archetypes.

K‑means reveals three typical growth shapes:

Slow/low growth (most items).
Clearly viral trajectories.
A tiny number of super‑viral events.

Recognizing that an article’s early TS pattern matches the viral or super‑viral cluster can trigger decisions such as:

Featuring the article more prominently on the homepage.
Allocating budget for promoted posts.
Producing follow‑up content while interest is high.

3.2 How useful are these models for real decisions?

A practical decision workflow informed by this analysis could be:

Pre‑publication / immediately at publication

Use the logistic regression model and static features (topic, sentiment, time) to assign each new article a baseline probability of becoming viral. This can help prioritize which stories to monitor more closely, but should not be the sole basis for publication decisions.
Early post‑publication (first few hours)

Once some time‑slice information is available (TS1–TS10), use clustering to see whether the article’s early trajectory resembles known viral patterns. Articles already in the viral cluster are good candidates for early promotion.
Mid‑window (around TS50)

At ~16–17 hours, feed TS1–TS50 into the PCA + random forest regressor (Model 3) to estimate final reach. This estimate can guide decisions about:
- How long to keep the story on front pages.
- Whether to schedule follow‑ups or derivative content.
- Where to allocate marketing/promotional resources.
Limitations
- Popularity is still highly stochastic; even with R² is approx 0.75 in the best case, there is considerable residual uncertainty.
- Models trained on this dataset focus on four specific topics and a particular time period (2015–2016). Performance may degrade when applied to different domains, languages or time spans. (arXiv)

Overall, these models are best used for relative ranking and triage and help in deciding which articles deserve extra attention rather than for exact point predictions of future share counts. Combining static features, early engagement signals, and growth‑pattern clustering yields a practical decision support tool for newsrooms and social media teams working with limited resources.

If you actually read this far...nice! :D

18 KiB Raw Blame History Unescape Escape

News Popularity in Multiple Social Media Platforms

1. Exploratory Data Analysis

1.1 Data overview and cleaning

1.2 Popularity distributions

1.3 Topic effects

1.4 Sentiment and popularity

1.5 EDA conclusions

2. Model Development, Validation and Optimization

2.1 Common preprocessing

2.2 Regression Model 1 – Linear regression on static features

2.3 Regression Model 2 – Random forest on early time slices

2.4 Regression Model 3 – PCA + random forest (dimension reduction)

2.5 Classification Model 4 – Logistic regression for viral vs non‑viral

2.6 Clustering Model 5 – K‑means on time‑series shapes

3. Decisions and Practical Use

3.1 What do the models tell us?

3.2 How useful are these models for real decisions?

18 KiB

Raw Blame History