Compare commits
6 Commits
2667c06e09
..
main
| Author | SHA1 | Date | |
|---|---|---|---|
| 6afdd43fb8 | |||
| b3eccdab43 | |||
| a9f73e4314 | |||
| 091831c67c | |||
| 5581493cc0 | |||
| 4f6434ff72 |
@@ -2,9 +2,7 @@
|
||||
|
||||
## Measuring How Generative AI Adoption Reshaped Stack Overflow Participation 2018–2025
|
||||
|
||||
Itamar Oren-Naftalovich
|
||||
|
||||
<!-- **Repository Artifacts:** `analysis.r`, `data/*.csv`, `imgs/*.png`, `out.log` (model console output) -->
|
||||
Itamar Oren-Naftalovich (6000-Level)
|
||||
|
||||
---
|
||||
|
||||
|
||||
Binary file not shown.
@@ -0,0 +1,171 @@
|
||||
### Slide 1 – Title
|
||||
|
||||
**Title:** Did Stack Overflow Answers Increase After ChatGPT?
|
||||
- Changes in Stack Overflow answer activity post-ChatGPT launch
|
||||
- Impact of related policy events
|
||||
- Developer behavior balancing Stack Overflow vs. AI tools
|
||||
|
||||
---
|
||||
|
||||
### Slide 2 – Research Question
|
||||
|
||||
**Research Questions:**
|
||||
1. Volume of answers:
|
||||
- Did Stack Overflow answers change systematically after ChatGPT launched (late 2022)?
|
||||
2. Policy/event impact:
|
||||
- Did AI-answer policies and moderation events create additional shifts?
|
||||
3. Substitution effect:
|
||||
- Are heavy ChatGPT users visiting/answering less on Stack Overflow?
|
||||
|
||||
**Approach:**
|
||||
- Look for structural breaks in answer time series
|
||||
- Link site-level patterns to developer survey data
|
||||
|
||||
---
|
||||
|
||||
### Slide 3 – Data Sources
|
||||
|
||||
**Dataset 1:**
|
||||
- Monthly new answer counts (2018–2025)
|
||||
- Pulled from Stack Exchange Data Explorer
|
||||
- Includes deleted posts
|
||||
- Provides pre-ChatGPT baseline and post-event window
|
||||
|
||||
**Dataset 2:**
|
||||
- Microdata from Stack Overflow Developer Surveys (2023–2025)
|
||||
- Focus:
|
||||
- Visit frequency
|
||||
- Adoption of AI tools like ChatGPT
|
||||
|
||||
**Exploratory Plots:**
|
||||
- Raw time series
|
||||
- Pre/post comparisons
|
||||
- Seasonality
|
||||
- Moving averages
|
||||
|
||||
---
|
||||
|
||||
### Slide 4 – Preliminary Patterns
|
||||
|
||||
**Key Observations:**
|
||||
- Long-run time series:
|
||||
- Downward drift in answers pre-2022
|
||||
- Sharper drop in level and slope post-ChatGPT launch
|
||||
- Pre/post comparison:
|
||||
- Post-ChatGPT period sits lower, even after accounting for seasonal dips (e.g., summer, year-end)
|
||||
- Seasonal plots:
|
||||
- 2018–2025 share consistent within-year rhythm
|
||||
- Confirms changes aren’t due to seasonality
|
||||
|
||||
---
|
||||
|
||||
### Slide 5 – Methodology
|
||||
|
||||
**Modelling Strategies:**
|
||||
1. **Interrupted Time-Series Regression (ITS):**
|
||||
- Predictors: time trend, level jump (ChatGPT launch), slope change
|
||||
- Optional indicators: policy/moderation periods
|
||||
2. **Poisson/Negative-Binomial Count Models:**
|
||||
- Predictors: same as ITS
|
||||
- Suitable for count data
|
||||
- Quantifies percentage changes per month
|
||||
3. **ARIMA Model:**
|
||||
- Trained on pre-ChatGPT data
|
||||
- Forecasts counterfactual trajectory
|
||||
- Compares observed vs. predicted post-event counts
|
||||
4. **Survey Logistic Regression:**
|
||||
- Predicts frequent Stack Overflow visits
|
||||
- Predictors: ChatGPT usage, demographics
|
||||
|
||||
**Diagnostics:**
|
||||
- Residual checks
|
||||
- Over-dispersion
|
||||
- Out-of-sample performance
|
||||
|
||||
---
|
||||
|
||||
### Slide 6 – Model Fits & Counterfactuals
|
||||
|
||||
**Findings:**
|
||||
- **Interrupted Time-Series Regression:**
|
||||
- Downward level shift post-2022
|
||||
- Steeper negative slope post-ChatGPT
|
||||
- Controls for pre-existing trend
|
||||
- **Poisson Model:**
|
||||
- Pre-ChatGPT: mild monthly contraction
|
||||
- Post-ChatGPT: steeper decline (compounds over time)
|
||||
- **ARIMA Forecast:**
|
||||
- Trained on pre-ChatGPT data
|
||||
- Post-2022 counts fall below 80% prediction interval
|
||||
- Observed counts never recover
|
||||
|
||||
**Takeaway:**
|
||||
- Structural break in answer supply post-ChatGPT and policy changes
|
||||
- Changes not explained by trend/seasonality alone
|
||||
|
||||
---
|
||||
|
||||
### Slide 7 – Survey Results
|
||||
|
||||
**Key Insights:**
|
||||
- **ChatGPT Adoption (2023):**
|
||||
- Widespread among developers, especially heavy coders
|
||||
- Daily use common in workflows
|
||||
- **Visit Frequency (2023–2024):**
|
||||
- 2023: Heavy ChatGPT users visit Stack Overflow at similar daily rates as non-users
|
||||
- 2024: Frequent visits drop more for heavy ChatGPT users
|
||||
- **Logistic Regression:**
|
||||
- ChatGPT usage alone: weak predictor of visit frequency (low-50% accuracy)
|
||||
- Combined with cross-tabs: supports partial substitution (marginal questions shifted to ChatGPT)
|
||||
|
||||
---
|
||||
|
||||
### Slide 8 – Key Findings
|
||||
|
||||
**Summary:**
|
||||
- Monthly answers on Stack Overflow:
|
||||
- Sharp drop post-ChatGPT release
|
||||
- Continued lower trend (even after controlling for pre-existing decline)
|
||||
- Policy/moderation events:
|
||||
- Additional dips align with governance decisions
|
||||
- Suggest amplification of ChatGPT effect
|
||||
- ARIMA counterfactuals:
|
||||
- Post-2022 counts outside expected range of pre-ChatGPT dynamics
|
||||
- Substitution effect:
|
||||
- Heavy ChatGPT users less likely to visit Stack Overflow daily over time
|
||||
|
||||
---
|
||||
|
||||
### Slide 9 – Limitations
|
||||
|
||||
**Caveats:**
|
||||
1. **Causality:**
|
||||
- Overlap of ChatGPT, AI policies, moderation strike
|
||||
- Broader economic/tooling trends also in play
|
||||
2. **SEDE Data:**
|
||||
- Doesn’t capture moderation queues/private spaces
|
||||
- Some activity may be invisible
|
||||
3. **Survey Data:**
|
||||
- Self-reported
|
||||
- May under-represent active answerers or certain regions/roles
|
||||
|
||||
**Interpretation:**
|
||||
- Results are **correlational evidence** of shifts in answer supply/usage patterns
|
||||
- Not a precise causal estimate of “ChatGPT effect”
|
||||
|
||||
---
|
||||
|
||||
### Slide 10 – Implications & Future Work
|
||||
|
||||
**Implications:**
|
||||
- Answer supply sensitive to:
|
||||
- Assistance tooling
|
||||
- Governance decisions
|
||||
- Platforms should:
|
||||
- Carefully consider AI policies/moderation capacity
|
||||
- Explore integration with conversational assistants (e.g., structured answer APIs)
|
||||
|
||||
**Future Work:**
|
||||
- Tag-level/user-cohort analyses
|
||||
- Stronger quasi-experimental designs (e.g., synthetic controls)
|
||||
-
|
||||
@@ -0,0 +1,220 @@
|
||||
# Did Stack Overflow Answers Increase After ChatGPT? — Term Project Report
|
||||
|
||||
## Itamar Oren-Naftalovich
|
||||
|
||||
## 1. Abstract and introduction
|
||||
|
||||
This project asks whether the number of answers posted on Stack Overflow (SO) increased or decreased after the public launch of ChatGPT on 2022-11-30, and after subsequent community policy events. To study this, we combine (i) site-level activity from the Stack Exchange Data Explorer (SEDE) with (ii) developer sentiment and usage data from the annual Stack Overflow Developer Survey. Together, these data allow us to both measure changes in answer volume and to contextualize those changes using self-reported behavior.
|
||||
|
||||
Our core hypothesis is that the arrival of high-quality, conversational code assistance would noticeably change the supply of answers on SO, because developers have a new place to go for immediate help. We further treat moderation policies and community events as additional shocks that may amplify or dampen this effect.
|
||||
|
||||
We frame the problem as a quasi-experimental time-series analysis with interrupted trends around several key dates:
|
||||
|
||||
* ChatGPT public launch (2022-11-30)
|
||||
* Initial Stack Overflow policy banning AI-generated answers (policy posted 2022-12-05)
|
||||
* Later moderation and governance events (including the moderation strike)
|
||||
|
||||
Throughout, we pay attention to:
|
||||
|
||||
* **Internal validity:** controlling for pre-existing trends and seasonality, rather than treating pre/post averages as independent.
|
||||
* **External validity:** comparing site-level patterns to changes in developer behavior reported in survey data.
|
||||
* **Measurement caveats:** handling deleted content, moderation queues, and sampling or survey-response effects.
|
||||
|
||||
**Prior work and context.** The slide deck you provided summarizes the key posts and data sources (OpenAI’s announcement, Meta Stack Overflow policy discussions, moderation-strike posts, traffic analyses, and SO survey documentation). These references define the timeline and motivate the research question without being re-stated in full here.
|
||||
|
||||
---
|
||||
|
||||
## 2. Data description and preliminary analysis
|
||||
|
||||
### Datasets
|
||||
|
||||
We use two complementary datasets:
|
||||
|
||||
* **Dataset 1 — Site activity.**
|
||||
Monthly counts of **new answers** on Stack Overflow from the public Stack Exchange Data Explorer (SEDE). The extract includes both non-deleted and deleted answers so that we can separate organic activity from moderation effects. The main analysis window is **2018–2025**, which provides several pre-ChatGPT baseline years and a meaningful post-event period.
|
||||
|
||||
* **Dataset 2 — Developer survey.**
|
||||
Selected questions from the **Stack Overflow Developer Survey (2023–2025)**, focusing on visit frequency (e.g., daily vs. weekly) and adoption of AI tools such as ChatGPT. These variables are used to understand shifts in demand for on-site answers and how they correlate with AI usage.
|
||||
|
||||
### Criteria and rationale
|
||||
|
||||
* Dataset 1 directly measures the outcome of interest: **answer supply on Stack Overflow**.
|
||||
* Dataset 2 provides **behavioral context**: whether developers who use ChatGPT heavily also report visiting Stack Overflow less frequently.
|
||||
|
||||
By combining logs and surveys, we can triangulate between observational activity data and self-reported changes in workflow. The goal is not to claim strict causality, but to see whether the patterns in these sources align.
|
||||
|
||||
### Preliminary views
|
||||
|
||||
The first set of plots (Figure 1) provides high-level structure for later modelling:
|
||||
|
||||
* Time-series plots of monthly answers reveal overall levels and long-run trends.
|
||||
* Comparisons of pre- and post-ChatGPT periods highlight visible changes in level or slope.
|
||||
* Seasonal views (e.g., by calendar month) show systematic patterns such as summer slowdowns or end-of-year dips.
|
||||
|
||||
These descriptive views inform the later modelling choices such as adding interrupted trend terms and seasonal controls.
|
||||
|
||||

|
||||
|
||||
*Figure 1. Preliminary monthly trend in new answers on Stack Overflow.*
|
||||
|
||||
---
|
||||
|
||||
## 3. Exploratory analysis
|
||||
|
||||
We first clean and harmonize the SEDE extract, collapsing to **monthly answer counts** and separating deleted from non-deleted answers. We then:
|
||||
|
||||
* Scan for structural breaks or anomalies around key event dates.
|
||||
* Apply short moving averages to highlight medium-run shifts in the time series.
|
||||
* Plot seasonality by calendar month to visualize recurring within-year patterns.
|
||||
|
||||
On the survey side, we:
|
||||
|
||||
* Construct indicators for **frequent SO visits** (e.g., daily or almost daily) and **ChatGPT usage**.
|
||||
* Compare distributions across years to detect shifts in visiting behavior and AI adoption.
|
||||
|
||||
### Sources of uncertainty and bias
|
||||
|
||||
We explicitly track several sources of uncertainty:
|
||||
|
||||
* **Policy and moderation effects.**
|
||||
Deletions and review backlogs can move answers between months or suppress visible counts. To address this, we track **deleted and non-deleted answers separately** and compare them over time.
|
||||
|
||||
* **Seasonality and macro conditions.**
|
||||
Holidays, hiring cycles, and broader market conditions can confound naive pre/post comparisons. We therefore visualize **within-year seasonality** and include time controls in the models.
|
||||
|
||||
* **Survey representativeness.**
|
||||
Survey respondents may not be a random sample of all SO users. Active answerers and enthusiastic AI adopters might be over- or under-represented. For this reason, we treat survey-based findings as **correlational**, not causal.
|
||||
|
||||

|
||||
|
||||
*Figure 2. Example exploratory views: seasonal patterns, moving averages, and distributional summaries.*
|
||||
|
||||

|
||||
|
||||
*Figure 3. Additional exploratory views, emphasizing seasonality and pre/post differences.*
|
||||
|
||||
---
|
||||
|
||||
## 4. Model development and application
|
||||
|
||||
To move beyond descriptive plots, we implement four modelling approaches. Together, they test for structural changes in answer volume and connect survey behavior to on-site activity.
|
||||
|
||||
1. **Interrupted time-series linear regression (ITS).**
|
||||
|
||||
* **Outcome:** monthly new answers.
|
||||
* **Predictors:** a linear time trend, post-ChatGPT **level change**, and **slope change**, plus optional indicator variables for policy and moderation periods.
|
||||
* **Goal:** test for discrete jumps and gradual trend shifts relative to the pre-event trajectory.
|
||||
|
||||
2. **Poisson / negative-binomial regression for counts.**
|
||||
|
||||
* Same predictors as ITS but with a **log link** for count data.
|
||||
* We compare Poisson and negative-binomial versions to account for over-dispersion and to avoid relying on normal residuals.
|
||||
|
||||
3. **ARIMA time-series forecasting.**
|
||||
|
||||
* Fit solely on **pre-ChatGPT** data to produce a counterfactual forecast.
|
||||
* Compare out-of-sample forecasts to observed post-event answer counts.
|
||||
* Large and sustained deviations beyond forecast bands signal additional shocks beyond trend and seasonality.
|
||||
|
||||
4. **Logistic classification on survey microdata.**
|
||||
|
||||
* **Target:** whether a respondent is a **“frequent SO visitor”** (daily or almost daily).
|
||||
* **Predictors:** a ChatGPT-usage indicator plus demographic and role controls.
|
||||
* **Evaluation:** accuracy, precision/recall, and calibration curves, with a hold-out split for validation.
|
||||
* **Purpose:** test whether heavy ChatGPT users are **less likely** to report frequent SO visits, even after adjusting for other factors.
|
||||
|
||||
### Validation and diagnostics
|
||||
|
||||
For each model family, we run basic diagnostic checks:
|
||||
|
||||
* **ITS models:**
|
||||
|
||||
* Inspect residuals for autocorrelation and remaining seasonality.
|
||||
* Re-fit with seasonal terms or alternative specifications where necessary.
|
||||
|
||||
* **Count models (Poisson/NB):**
|
||||
|
||||
* Check over-dispersion indicators and compare Poisson vs. negative-binomial fits.
|
||||
* Examine goodness-of-fit plots and residual patterns.
|
||||
|
||||
* **ARIMA forecasts:**
|
||||
|
||||
* Select model orders using information criteria on the training window.
|
||||
* Inspect forecast errors and confidence bands to ensure reasonable counterfactual behavior.
|
||||
|
||||
* **Classification models:**
|
||||
|
||||
* Use a separate hold-out set for evaluation.
|
||||
* Report confusion matrices and standard performance metrics.
|
||||
* Inspect calibration to verify that predicted probabilities match observed frequencies.
|
||||
|
||||

|
||||
|
||||
*Figure 4. Example model fits (ITS) and moving-average smoothed trends around intervention dates.*
|
||||
|
||||

|
||||
|
||||
*Figure 5. Illustrative Poisson / negative-binomial fits versus observed counts.*
|
||||
|
||||

|
||||
|
||||
*Figure 6. Additional count-model diagnostics and fit comparisons.*
|
||||
|
||||

|
||||
|
||||
*Figure 7. ARIMA counterfactual forecast vs. observed post-event answer volumes.*
|
||||
|
||||
---
|
||||
|
||||
## 5. Conclusions and discussion
|
||||
|
||||
Across the descriptive plots and models, the period after November 2022 shows both **level** and **slope** changes that are consistent with a structural shift in answer supply on Stack Overflow. These changes coincide with the availability of ChatGPT and closely timed policy and moderation events.
|
||||
|
||||
ARIMA counterfactuals trained on pre-event data give a baseline trajectory. When we compare this baseline to observed post-event values, we see deviations that fall outside typical forecast bands, supporting the idea that there was a shock beyond existing trends and seasonality.
|
||||
|
||||
The survey-based classifiers reinforce this picture: heavy ChatGPT adoption is associated with **lower self-reported visit frequency**, even after controlling for observable demographics and roles. This pattern lines up with the site-level decline in new answers and suggests that some developers are partially substituting conversational AI for Stack Overflow visits.
|
||||
|
||||
### Limitations
|
||||
|
||||
* **Causality is tentative.**
|
||||
|
||||
* Policy changes and the moderation strike overlap with the ChatGPT rollout, making it difficult to cleanly attribute changes in answer volume to any single event.
|
||||
* External shocks—such as labor-market cycles, ecosystem-tooling changes, or shifts in documentation quality—may also contribute.
|
||||
|
||||
* **Survey constraints.**
|
||||
|
||||
* Survey responses are self-reported and subject to recall and response biases.
|
||||
* The sample may not represent the full SO user base or the most active answerers.
|
||||
|
||||
Because of these limitations, we interpret the results as **strong correlational evidence** of a shift in answer supply and usage patterns, not as a sharp causal estimate. Future work should:
|
||||
|
||||
* Incorporate richer covariates (e.g., tag-level activity, user cohorts, question complexity).
|
||||
* Explore quasi-experimental designs (such as synthetic controls) to better isolate the effect of AI tools and platform policies.
|
||||
|
||||
### Implications
|
||||
|
||||
For knowledge platforms, the analysis suggests that answer supply can be **sensitive to rapid changes in assistance tooling and governance**. In particular:
|
||||
|
||||
* Sustainable moderation capacity and clear, transparent AI guidance appear important to avoid destabilizing answer quality and volume.
|
||||
* As conversational assistants become part of everyday developer workflows, platforms like Stack Overflow may need deeper integration paths (for example, exposing structured answers or metadata that assistants can consume directly).
|
||||
* Balancing open contribution, quality control, and integration with external AI tools may be key to retaining community participation in an environment where “first-line help” increasingly comes from chatbots.
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
* OpenAI. “Introducing ChatGPT.” OpenAI, 30 Nov. 2022.
|
||||
* Prasnikar, D. “Policy: Generative AI (e.g., ChatGPT) Is Banned.” Meta Stack Overflow, 5 Dec. 2022.
|
||||
* Mithical. “Moderation Strike: Stack Overflow, Inc. Cannot Consistently Ignore, Mistreat, and Malign Its Volunteers.” Meta Stack Exchange, 5 June 2023.
|
||||
* Makyen. “Moderation Strike: Conclusion and the Way Forward.” Meta Stack Exchange, 7 Aug. 2023.
|
||||
* Carr, D. F. “Stack Overflow Is ChatGPT Casualty: Traffic Down 14% in March.” Similarweb Insights, 19 Apr. 2023.
|
||||
* “Database Schema Documentation for the Public Data Dump and SEDE.” Meta Stack Exchange (FAQ), 4 Oct. 2022.
|
||||
* Stack Overflow. *Stack Overflow Developer Survey* (2023–2025).
|
||||
|
||||
## Image gallery (additional figures)
|
||||
|
||||

|
||||
* *Figure 8.* Additional pre/post comparison plots.
|
||||
|
||||

|
||||
* *Figure 9.* Additional figure from the provided results.
|
||||
Binary file not shown.
@@ -0,0 +1,410 @@
|
||||
# News Popularity in Multiple Social Media Platforms
|
||||
|
||||
This project analyzes the **News Popularity in Multiple Social Media Platforms** dataset from the UCI Machine Learning Repository. The data contains ~93k news items collected between November 2015 and July 2016, with their final popularity on Facebook, Google+ and LinkedIn across four topics: *economy*, *microsoft*, *obama* and *palestine*.
|
||||
|
||||
---
|
||||
|
||||
## 1. Exploratory Data Analysis
|
||||
|
||||
### 1.1 Data overview and cleaning
|
||||
|
||||
We work primarily with `Data/News_Final.csv`, which has **93,239** rows and 11 variables:
|
||||
|
||||
- `IDLink` – numeric id of the article
|
||||
- `Title`, `Headline` – short text fields
|
||||
- `Source` – news outlet that originally published the story
|
||||
- `Topic` – one of {economy, microsoft, obama, palestine}
|
||||
- `PublishDate` – publication timestamp
|
||||
- `SentimentTitle`, `SentimentHeadline` – numeric sentiment scores derived from title and headline text
|
||||
- `Facebook`, `GooglePlus`, `LinkedIn` – final popularity on each social media platform
|
||||
|
||||
According to the dataset documentation, **-1** in the popularity variables indicates that no final popularity value was observed. In the code, any value `< 0` in `Facebook`, `GooglePlus`, or `LinkedIn` is therefore replaced with `NaN`. Missing popularity values are later dropped on a per–model basis. `PublishDate` is converted to a proper timestamp, and a numeric time feature
|
||||
|
||||
```text
|
||||
DaysSinceEpoch = days since 1970-01-01
|
||||
````
|
||||
|
||||
is created to allow inclusion of temporal trends in the models. We also log‑transform Facebook popularity:
|
||||
|
||||
```text
|
||||
log_Facebook = log1p(Facebook)
|
||||
```
|
||||
|
||||
which is used as the target for regression models.
|
||||
|
||||
---
|
||||
|
||||
### 1.2 Popularity distributions
|
||||
|
||||
A histogram of Facebook share counts on a **logarithmic x‑axis**, after removing missing and zero values
|
||||
|
||||

|
||||
|
||||
*Figure 1: Distribution of Facebook popularity on a log x‑axis.*
|
||||
|
||||
The distribution is extremely right‑skewed:
|
||||
|
||||
- Most articles receive very few shares.
|
||||
- A small number of “viral” articles receive thousands of shares.
|
||||
|
||||
On the cleaned data, summary statistics for Facebook shares are approximately:
|
||||
|
||||
- median is approx 8
|
||||
- mean is approx 129
|
||||
- 90th percentile is approx 214
|
||||
- 99th percentile is approx 2,322
|
||||
- max = 49,211
|
||||
|
||||
Google+ and LinkedIn exhibit similar heavy‑tailed patterns (with smaller absolute scales), which matches the description of the dataset creators ([arXiv][1]).
|
||||
|
||||
The distribution of `log1p(Facebook)`
|
||||
|
||||

|
||||
|
||||
*Figure 2: Distribution of log‑transformed Facebook popularity.*
|
||||
|
||||
The log transform compresses the heavy tail and produces a more regular, unimodal distribution. This justifies using `log1p(popularity)` as the regression target: it reduces the influence of rare extreme outliers while keeping them in the data, which is important because viral stories are the phenomena of interest.
|
||||
|
||||
---
|
||||
|
||||
### 1.3 Topic effects
|
||||
|
||||
The four topics are not equally represented:
|
||||
|
||||
- economy: 33,928 items
|
||||
- obama: 28,610
|
||||
- microsoft: 21,858
|
||||
- palestine: 8,843
|
||||
|
||||
The mean log‑Facebook popularity by topic.
|
||||
|
||||

|
||||
|
||||
*Figure 3: Mean log‑Facebook popularity by topic.*
|
||||
|
||||
Key observations:
|
||||
|
||||
- **obama** stories clearly have the highest average popularity.
|
||||
- **microsoft** is slightly above **economy** and **palestine**.
|
||||
- In original share counts, obama articles average roughly an order of magnitude more shares than economy/microsoft/palestine stories, but all topics remain strongly skewed.
|
||||
|
||||
This suggests that topic is an important categorical predictor for popularity, and motivates including it as a one‑hot encoded feature in the models.
|
||||
|
||||
---
|
||||
|
||||
### 1.4 Sentiment and popularity
|
||||
|
||||
Sentiment scores from the title and headline are continuous values roughly in the interval [-1, 1]. Their empirical distributions are centered very close to 0 with standard deviations around 0.14, indicating that most titles and headlines are only mildly positive or negative.
|
||||
|
||||
A 5,000‑row sample of `SentimentTitle` vs `log_Facebook`
|
||||
|
||||

|
||||
|
||||
*Figure 4: Scatter of title sentiment vs log‑Facebook popularity (sample of 5,000 articles).*
|
||||
|
||||
The scatter plot shows:
|
||||
|
||||
- A dense vertical band near sentiment 0, reflecting many neutral titles.
|
||||
- Viral and non‑viral articles scattered across the full sentiment range, with no obvious linear trend.
|
||||
|
||||
Empirically, the correlation between sentiment and Facebook popularity is almost zero (|r| is approx 0.01). This suggests that sentiment alone is a weak predictor of popularity; we still include it in models because it may interact with topic or time, but we do not expect it to explain much variance by itself.
|
||||
|
||||
---
|
||||
|
||||
### 1.5 EDA conclusions
|
||||
|
||||
From the exploratory analysis we conclude:
|
||||
|
||||
1. **Popularity variables are non‑negative, highly skewed, and heavy‑tailed.**
|
||||
|
||||
- Log‑transforming shares yields more regular distributions, so regression models should target `log1p(popularity)` instead of raw counts.
|
||||
|
||||
2. **Topic has a strong effect on expected popularity.**
|
||||
|
||||
- Particularly, obama‑related news is more popular on Facebook; microsoft is relatively stronger on LinkedIn (from descriptive statistics, not shown here).
|
||||
|
||||
3. **Title/headline sentiment has little linear relationship with popularity.**
|
||||
|
||||
- It should not be expected to drive predictions strongly.
|
||||
|
||||
4. **There are many extreme outliers (viral stories), but these are the signal we care about.**
|
||||
|
||||
- We choose *not* to remove them; instead, we rely on robust models and log‑transformed targets.
|
||||
|
||||
These observations motivate a modeling strategy that combines:
|
||||
|
||||
- **Linear models** (to quantify simple topic/sentiment effects on log‑popularity).
|
||||
- **Non‑linear tree‑based models** (to capture complex relationships and heavy‑tailed behaviour).
|
||||
- **Classification** of viral vs non‑viral stories.
|
||||
- **Clustering** of time‑series trajectories to identify typical growth patterns.
|
||||
|
||||
The next section formalizes these ideas.
|
||||
|
||||
---
|
||||
|
||||
## 2. Model Development, Validation and Optimization
|
||||
|
||||
We develop **five** models: three regression models (including a dimension‑reduced variant), one classification model, and one clustering model. This covers regression, classification and unsupervised learning objectives, and explicitly examines the impact of dimensionality reduction.
|
||||
|
||||
All supervised models use:
|
||||
|
||||
- Train/test split: **80% training, 20% test**, `random_state=42`.
|
||||
- Evaluation on the held‑out test set only (no peeking).
|
||||
- Metrics:
|
||||
|
||||
- Regression: R² and RMSE on log‑scale (using `root_mean_squared_error`).
|
||||
- Classification: accuracy, F1 for the positive class, ROC AUC and confusion matrix.
|
||||
|
||||
### 2.1 Common preprocessing
|
||||
|
||||
For each model:
|
||||
|
||||
1. Replace `-1` in `Facebook`, `GooglePlus`, `LinkedIn` with `NaN`.
|
||||
2. Drop rows with missing values in the specific target variable.
|
||||
3. Use `DaysSinceEpoch` as a numeric representation of `PublishDate`.
|
||||
4. Where appropriate, use `log_Facebook = log1p(Facebook)` as the regression target.
|
||||
5. Encode `Topic` using one‑hot encoding with economy as the reference level (`drop_first=True`).
|
||||
|
||||
For time‑series models we also use `Data/Facebook_Economy.csv`, which stores Facebook popularity snapshots TS1–TS144 every 20 minutes for economy articles. We join it with `News_Final.csv` on `IDLink` and restrict to:
|
||||
|
||||
- `Topic == "economy"`
|
||||
- Time slices **TS1–TS50** as predictors (roughly first 16–17 hours)
|
||||
- Final log‑Facebook popularity as the target
|
||||
|
||||
Negative TS values are interpreted as “no observed popularity yet” and are set to 0.
|
||||
|
||||
---
|
||||
|
||||
### 2.2 Regression Model 1 – Linear regression on static features
|
||||
|
||||
**Goal.** Predict log‑Facebook popularity using only static metadata (no early popularity feedback).
|
||||
|
||||
- **Target:** `y = log_Facebook` for all topics.
|
||||
- **Features:**
|
||||
|
||||
- `SentimentTitle`, `SentimentHeadline`
|
||||
- `DaysSinceEpoch` (publication time)
|
||||
- Topic one‑hot dummies: `Topic_microsoft`, `Topic_obama`, `Topic_palestine` (economy is implicit baseline).
|
||||
|
||||
We fit an ordinary least squares linear regression on the training split and evaluate on the test set.
|
||||
|
||||
**Results (test set):**
|
||||
|
||||
- **R² is approx 0.157**
|
||||
- **RMSE is approx 1.86** in log‑space
|
||||
|
||||
Actual vs predicted log‑Facebook values
|
||||
|
||||

|
||||
|
||||
*Figure 5: Model 1 predictions vs actual log‑Facebook values.*
|
||||
|
||||
The predictions are compressed into a narrow band, under‑predicting viral articles and over‑predicting low‑popularity ones. Key coefficients:
|
||||
|
||||
- `Topic_obama` is approx +1.78 (large positive shift vs economy)
|
||||
- `Topic_microsoft` is approx +0.10
|
||||
- `Topic_palestine` is approx +0.02
|
||||
- `SentimentTitle` is approx −0.38, `SentimentHeadline` is approx −0.06
|
||||
- `DaysSinceEpoch` is approx −0.0007 (tiny downward trend over time)
|
||||
|
||||
Interpretation:
|
||||
|
||||
- Topic has a clear effect (especially obama).
|
||||
- Sentiment effects are small and slightly negative.
|
||||
- The model explains only ~16% of the variance in log‑popularity, confirming that static features alone are weak predictors.
|
||||
|
||||
---
|
||||
|
||||
### 2.3 Regression Model 2 – Random forest on early time slices
|
||||
|
||||
**Goal.** Predict final log‑Facebook popularity for **economy** stories using early Facebook popularity time slices and sentiment.
|
||||
|
||||
- **Target:** `log_Facebook` for economy topic, joined with Facebook_Economy time‑series.
|
||||
- **Features:**
|
||||
|
||||
- TS1–TS50 (early cumulative popularity counts, cleaned: negative → 0)
|
||||
- `SentimentTitle`, `SentimentHeadline`
|
||||
|
||||
We fit a `RandomForestRegressor` with:
|
||||
|
||||
- 120 estimators,
|
||||
- `min_samples_leaf=2`,
|
||||
- `max_depth=None` (trees grow fully),
|
||||
- `n_jobs=-1`, `random_state=42`.
|
||||
|
||||
**Results (test set):**
|
||||
|
||||
- **R² is approx 0.746**
|
||||
- **RMSE is approx 0.86** (log‑scale)
|
||||
|
||||
Feature importances indicate:
|
||||
|
||||
- `TS50` alone contributes ~81% of total importance.
|
||||
- Combined sentiment variables contribute ~17%.
|
||||
- Earlier TS features each have very small marginal importance.
|
||||
|
||||
Thus, knowing an article’s popularity after ~17 hours (TS50) is already highly predictive of its final 2‑day popularity. Early engagement is a much stronger signal than sentiment or publish time.
|
||||
|
||||
---
|
||||
|
||||
### 2.4 Regression Model 3 – PCA + random forest (dimension reduction)
|
||||
|
||||
Model 3 examines the effect of **dimension reduction** on performance.
|
||||
|
||||
Instead of using all 50 TS features directly, we:
|
||||
|
||||
1. Standardize TS1–TS50 with `StandardScaler`.
|
||||
2. Apply PCA with `n_components=10`.
|
||||
3. Concatenate the 10 PCA components with the two sentiment features (`SentimentTitle`, `SentimentHeadline`).
|
||||
4. Train the same `RandomForestRegressor` as Model 2 on this 12‑dimensional feature space.
|
||||
|
||||
PCA results:
|
||||
|
||||
- 1st component explains is approx **93.5%** of variance.
|
||||
- First 10 components together explain is approx **99.9%** of variance.
|
||||
|
||||
**Results (test set):**
|
||||
|
||||
- **R² is approx 0.745**
|
||||
- **RMSE is approx 0.87**
|
||||
|
||||
Compared to Model 2:
|
||||
|
||||
- R² decreases only slightly (0.746 → 0.745).
|
||||
- RMSE increases minimally (0.862 → 0.865).
|
||||
|
||||
So PCA reduces dimensionality from 50 TS features to 10 components with **negligible loss of predictive performance**. The first PCA components effectively summarize overall popularity level and growth pattern, which are the dominant signals for final popularity.
|
||||
|
||||
---
|
||||
|
||||
### 2.5 Classification Model 4 – Logistic regression for viral vs non‑viral
|
||||
|
||||
**Goal.** Classify whether an article is *viral* on Facebook, defined as being in the top 10% of final popularity.
|
||||
|
||||
- **Target:**
|
||||
|
||||
- `viral_fb = 1` if `Facebook ≥ 214` (90th percentile), otherwise 0.
|
||||
- Class distribution: ~10% positive, ~90% negative.
|
||||
- **Features:**
|
||||
|
||||
- `SentimentTitle`, `SentimentHeadline`
|
||||
- `DaysSinceEpoch`
|
||||
- Topic dummies as before
|
||||
|
||||
We intentionally **do not use time‑slice features** here to simulate making a decision at or before publication, when no engagement data is available yet.
|
||||
|
||||
We fit a `LogisticRegression` with `max_iter=500` and `class_weight="balanced"` to counter class imbalance.
|
||||
|
||||
**Results (test set):**
|
||||
|
||||
- **Accuracy is approx 0.73**
|
||||
|
||||
- A naive classifier that always predicts “non‑viral” would obtain is approx 0.90 accuracy, highlighting that raw accuracy is misleading under imbalance.
|
||||
- **F1 (viral class) is approx 0.36**
|
||||
- **ROC AUC is approx 0.75**
|
||||
|
||||
The ROC AUC of 0.75 indicates decent **ranking ability**: the model tends to assign higher probabilities to truly viral articles than to non‑viral ones. However, at the default 0.5 threshold it generates many false positives; tuning the probability threshold would be necessary in practice depending on the business trade‑off between missing viral content and wasting attention on non‑viral items.
|
||||
|
||||
---
|
||||
|
||||
### 2.6 Clustering Model 5 – K‑means on time‑series shapes
|
||||
|
||||
To understand typical growth trajectories of popularity, we cluster early time‑series patterns.
|
||||
|
||||
- **Features:** TS1–TS50, standardized with `StandardScaler`.
|
||||
- **Sample:** random subset of 5,000 economy+Facebook articles to keep computation manageable.
|
||||
- **Algorithm:** `KMeans(n_clusters=3, n_init=10, random_state=42)`.
|
||||
|
||||
**Results:**
|
||||
|
||||
- **Silhouette score is approx 0.97**, indicating well‑separated clusters (although partly due to one large cluster vs a few small ones).
|
||||
- Cluster sizes and mean final Facebook shares:
|
||||
|
||||
| cluster | count | mean shares | median | max |
|
||||
| ------: | ----: | ----------: | -----: | ----: |
|
||||
| 0 | 4,978 | ~37 | 3 | 7,045 |
|
||||
| 1 | 1 | 1,886 | 1,886 | 1,886 |
|
||||
| 2 | 21 | ~2,478 | 1,291 | 8,010 |
|
||||
|
||||
Inspecting centroid time‑series (TS1, TS10, TS25, TS50):
|
||||
|
||||
- **Cluster 0:** low TS1 (~0.3), slow growth, TS50 is approx 17 → “normal/low popularity” baseline; almost all articles.
|
||||
- **Cluster 2:** TS1 is approx 23, TS10 is approx 211, TS50 is approx 1,388 → early rapid take‑off and sustained growth; these are clearly **viral** trajectories.
|
||||
- **Cluster 1:** single extreme **super‑viral** outlier with TS1 is approx TS50 is approx 1,886.
|
||||
|
||||
Clustering therefore uncovers distinct popularity regimes: ordinary stories, viral stories, and rare super‑viral events.
|
||||
|
||||
---
|
||||
|
||||
## 3. Decisions and Practical Use
|
||||
|
||||
### 3.1 What do the models tell us?
|
||||
|
||||
**1. Static metadata is not enough for precise prediction.**
|
||||
|
||||
Model 1, using only topic, time and sentiment, explains only about 16% of the variance in log‑Facebook popularity. The EDA already indicated weak correlations between sentiment and engagement, and the model confirms that topic is the only strong static predictor. This means:
|
||||
|
||||
- Before any user feedback is observed, we can form only a rough guess about popularity (e.g., “obama stories tend to do better”), but detailed predictions are unreliable.
|
||||
|
||||
**2. Early engagement is the key signal.**
|
||||
|
||||
Models 2 and 3 show that once ~16 hours of Facebook feedback are available:
|
||||
|
||||
- Random forests can explain ~75% of the variance in final log‑popularity.
|
||||
- PCA compresses the 50‑dimensional TS inputs to 10 components with essentially no loss in performance.
|
||||
|
||||
In practice, this means that **monitoring early time‑series of shares is crucial**. Stories that are already accumulating shares quickly by TS50 are extremely likely to end up as the most popular items after two days.
|
||||
|
||||
**3. Logistic regression is useful for ranking, not for definitive labels.**
|
||||
|
||||
The viral vs non‑viral classifier has:
|
||||
|
||||
- Good ranking ability (ROC AUC ~0.75).
|
||||
- Moderate F1 score and relatively low accuracy compared to the majority baseline.
|
||||
|
||||
This makes it better suited as a **priority score** than as a hard decision rule. For example, an editorial team might sort draft stories by predicted viral probability to decide where to invest additional editorial resources, but should not automatically discard stories predicted to be non‑viral.
|
||||
|
||||
**4. Clustering uncovers growth archetypes.**
|
||||
|
||||
K‑means reveals three typical growth shapes:
|
||||
|
||||
1. Slow/low growth (most items).
|
||||
2. Clearly viral trajectories.
|
||||
3. A tiny number of super‑viral events.
|
||||
|
||||
Recognizing that an article’s early TS pattern matches the viral or super‑viral cluster can trigger decisions such as:
|
||||
|
||||
- Featuring the article more prominently on the homepage.
|
||||
- Allocating budget for promoted posts.
|
||||
- Producing follow‑up content while interest is high.
|
||||
|
||||
### 3.2 How useful are these models for real decisions?
|
||||
|
||||
A practical decision workflow informed by this analysis could be:
|
||||
|
||||
1. **Pre‑publication / immediately at publication**
|
||||
|
||||
Use the logistic regression model and static features (topic, sentiment, time) to assign each new article a baseline probability of becoming viral. This can help prioritize which stories to monitor more closely, but should not be the sole basis for publication decisions.
|
||||
|
||||
2. **Early post‑publication (first few hours)**
|
||||
|
||||
Once some time‑slice information is available (TS1–TS10), use clustering to see whether the article’s early trajectory resembles known viral patterns. Articles already in the viral cluster are good candidates for early promotion.
|
||||
|
||||
3. **Mid‑window (around TS50)**
|
||||
|
||||
At ~16–17 hours, feed TS1–TS50 into the PCA + random forest regressor (Model 3) to estimate final reach. This estimate can guide decisions about:
|
||||
|
||||
- How long to keep the story on front pages.
|
||||
- Whether to schedule follow‑ups or derivative content.
|
||||
- Where to allocate marketing/promotional resources.
|
||||
|
||||
4. **Limitations**
|
||||
|
||||
- Popularity is still highly stochastic; even with R² is approx 0.75 in the best case, there is considerable residual uncertainty.
|
||||
- Models trained on this dataset focus on four specific topics and a particular time period (2015–2016). Performance may degrade when applied to different domains, languages or time spans. ([arXiv][1])
|
||||
|
||||
Overall, these models are best used for **relative ranking and triage** and help in deciding which articles deserve extra attention rather than for exact point predictions of future share counts. Combining static features, early engagement signals, and growth‑pattern clustering yields a practical decision support tool for newsrooms and social media teams working with limited resources.
|
||||
|
||||
If you actually read this far...nice! :D
|
||||
|
||||
[1]: https://arxiv.org/abs/1801.07055 "Multi-Source Social Feedback of Online News Feeds"
|
||||
Binary file not shown.
@@ -0,0 +1,363 @@
|
||||
import zipfile
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
import matplotlib.pyplot as plt
|
||||
import os
|
||||
|
||||
from sklearn.model_selection import train_test_split
|
||||
from sklearn.linear_model import LinearRegression, LogisticRegression
|
||||
from sklearn.ensemble import RandomForestRegressor
|
||||
from sklearn.decomposition import PCA
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from sklearn.metrics import (
|
||||
r2_score,
|
||||
root_mean_squared_error,
|
||||
accuracy_score,
|
||||
f1_score,
|
||||
roc_auc_score,
|
||||
confusion_matrix,
|
||||
silhouette_score,
|
||||
)
|
||||
from sklearn.pipeline import Pipeline
|
||||
from sklearn.cluster import KMeans
|
||||
|
||||
# ensure imgs dir exists
|
||||
os.makedirs("imgs", exist_ok=True)
|
||||
|
||||
# data loading
|
||||
|
||||
zip_path = "news+popularity+in+multiple+social+media+platforms.zip"
|
||||
|
||||
with zipfile.ZipFile(zip_path, "r") as zf:
|
||||
with zf.open("Data/News_Final.csv") as f:
|
||||
news = pd.read_csv(f)
|
||||
|
||||
# basic cleaning
|
||||
|
||||
pop_cols = ["Facebook", "GooglePlus", "LinkedIn"]
|
||||
|
||||
# encode -1 as missing
|
||||
for col in pop_cols:
|
||||
news.loc[news[col] < 0, col] = np.nan
|
||||
|
||||
# convert publishdate and add numeric time feature
|
||||
news["PublishDate"] = pd.to_datetime(news["PublishDate"])
|
||||
news["DaysSinceEpoch"] = (
|
||||
news["PublishDate"] - pd.Timestamp("1970-01-01")
|
||||
).dt.days
|
||||
|
||||
# log transform facebook popularity where available
|
||||
news["log_Facebook"] = np.log1p(news["Facebook"])
|
||||
|
||||
# eda helpers (optional plotting)
|
||||
|
||||
|
||||
def plot_eda():
|
||||
plt.figure()
|
||||
vals = news["Facebook"].dropna()
|
||||
vals = vals[vals > 0]
|
||||
vals.plot.hist(bins=50)
|
||||
plt.xlabel("facebook shares")
|
||||
plt.ylabel("count")
|
||||
plt.title("distribution of facebook popularity")
|
||||
plt.xscale("log")
|
||||
plt.tight_layout()
|
||||
plt.savefig("imgs/eda_facebook_hist.png")
|
||||
plt.close()
|
||||
|
||||
plt.figure()
|
||||
news["log_Facebook"].dropna().plot.hist(bins=50)
|
||||
plt.xlabel("log1p(facebook shares)")
|
||||
plt.ylabel("count")
|
||||
plt.title("distribution of log-transformed facebook popularity")
|
||||
plt.tight_layout()
|
||||
plt.savefig("imgs/eda_log_facebook_hist.png")
|
||||
plt.close()
|
||||
|
||||
mean_by_topic = (
|
||||
news.groupby("Topic")["log_Facebook"].mean().sort_values()
|
||||
)
|
||||
plt.figure()
|
||||
mean_by_topic.plot(kind="bar")
|
||||
plt.ylabel("mean log1p(facebook shares)")
|
||||
plt.title("average facebook popularity by topic")
|
||||
plt.tight_layout()
|
||||
plt.savefig("imgs/eda_mean_by_topic.png")
|
||||
plt.close()
|
||||
|
||||
sample = news.dropna(
|
||||
subset=["log_Facebook", "SentimentTitle"]
|
||||
).sample(5000, random_state=42)
|
||||
plt.figure()
|
||||
plt.scatter(
|
||||
sample["SentimentTitle"],
|
||||
sample["log_Facebook"],
|
||||
alpha=0.3,
|
||||
)
|
||||
plt.xlabel("sentimenttitle")
|
||||
plt.ylabel("log1p(facebook shares)")
|
||||
plt.title("title sentiment vs facebook popularity (sample)")
|
||||
plt.tight_layout()
|
||||
plt.savefig("imgs/eda_sentiment_vs_popularity.png")
|
||||
plt.close()
|
||||
|
||||
# model 1: linear regression
|
||||
|
||||
|
||||
def run_model_1():
|
||||
df = news.dropna(subset=["log_Facebook"]).copy()
|
||||
|
||||
X = df[["SentimentTitle", "SentimentHeadline", "DaysSinceEpoch", "Topic"]]
|
||||
X = pd.get_dummies(X, columns=["Topic"], drop_first=True)
|
||||
y = df["log_Facebook"]
|
||||
|
||||
X_train, X_test, y_train, y_test = train_test_split(
|
||||
X, y, test_size=0.2, random_state=42
|
||||
)
|
||||
|
||||
linreg = LinearRegression()
|
||||
linreg.fit(X_train, y_train)
|
||||
y_pred = linreg.predict(X_test)
|
||||
|
||||
r2 = r2_score(y_test, y_pred)
|
||||
rmse = root_mean_squared_error(y_test, y_pred)
|
||||
|
||||
print("model 1 – linear regression")
|
||||
print("r2:", r2)
|
||||
print("rmse:", rmse)
|
||||
print("coefficients:")
|
||||
print(pd.Series(linreg.coef_, index=X.columns))
|
||||
|
||||
# optional diagnostic plot
|
||||
plt.figure()
|
||||
plt.scatter(y_test, y_pred, alpha=0.3)
|
||||
plt.xlabel("actual log1p(facebook)")
|
||||
plt.ylabel("predicted log1p(facebook)")
|
||||
plt.title("model 1: actual vs predicted")
|
||||
plt.tight_layout()
|
||||
plt.savefig("imgs/model1_actual_vs_predicted.png")
|
||||
plt.close()
|
||||
|
||||
return linreg, (X_test, y_test, y_pred)
|
||||
|
||||
# prepare economy + facebook time-slice data
|
||||
|
||||
|
||||
with zipfile.ZipFile(zip_path, "r") as zf:
|
||||
with zf.open("Data/Facebook_Economy.csv") as f:
|
||||
fb_econ = pd.read_csv(f)
|
||||
|
||||
# ensure integer id for join
|
||||
news["IDLink_int"] = news["IDLink"].astype(int)
|
||||
|
||||
news_econ = news[news["Topic"] == "economy"].copy()
|
||||
news_econ["IDLink_int"] = news_econ["IDLink"].astype(int)
|
||||
|
||||
fb_econ_merged = fb_econ.merge(
|
||||
news_econ, left_on="IDLink", right_on="IDLink_int", how="inner"
|
||||
)
|
||||
|
||||
# clean time-slice features
|
||||
ts_cols = [c for c in fb_econ.columns if c.startswith("TS")]
|
||||
for col in ts_cols:
|
||||
fb_econ_merged.loc[fb_econ_merged[col] < 0, col] = 0
|
||||
|
||||
# drop rows with missing facebook target
|
||||
fb_econ_merged = fb_econ_merged[fb_econ_merged["Facebook"].notna()].copy()
|
||||
fb_econ_merged["log_Facebook"] = np.log1p(fb_econ_merged["Facebook"])
|
||||
|
||||
ts_cols_early = ts_cols[:50]
|
||||
|
||||
# model 2: random forest on raw early ts
|
||||
|
||||
|
||||
def run_model_2():
|
||||
X = fb_econ_merged[ts_cols_early + ["SentimentTitle", "SentimentHeadline"]]
|
||||
y = fb_econ_merged["log_Facebook"]
|
||||
|
||||
X_train, X_test, y_train, y_test = train_test_split(
|
||||
X, y, test_size=0.2, random_state=42
|
||||
)
|
||||
|
||||
rf = RandomForestRegressor(
|
||||
n_estimators=120,
|
||||
random_state=42,
|
||||
n_jobs=-1,
|
||||
max_depth=None,
|
||||
min_samples_leaf=2,
|
||||
)
|
||||
rf.fit(X_train, y_train)
|
||||
|
||||
pipe = Pipeline([
|
||||
("scaler", StandardScaler()),
|
||||
("pca", PCA(n_components=10, random_state=42)),
|
||||
("rf", RandomForestRegressor(
|
||||
n_estimators=120,
|
||||
random_state=42,
|
||||
n_jobs=-1,
|
||||
max_depth=None,
|
||||
min_samples_leaf=2,
|
||||
)),
|
||||
])
|
||||
|
||||
pipe.fit(X_train, y_train)
|
||||
y_pred = pipe.predict(X_test)
|
||||
|
||||
r2 = r2_score(y_test, y_pred)
|
||||
rmse = root_mean_squared_error(y_test, y_pred)
|
||||
|
||||
print("model 2 – random forest on raw ts")
|
||||
print("r2:", r2)
|
||||
print("rmse:", rmse)
|
||||
|
||||
importances = pd.Series(rf.feature_importances_, index=X.columns)
|
||||
print("top importances:")
|
||||
print(importances.sort_values(ascending=False).head(10))
|
||||
|
||||
return rf, (X_test, y_test, y_pred)
|
||||
|
||||
# model 3: pca + random forest
|
||||
|
||||
|
||||
def run_model_3():
|
||||
ts = fb_econ_merged[ts_cols_early]
|
||||
sent = fb_econ_merged[["SentimentTitle", "SentimentHeadline"]]
|
||||
X = pd.concat([ts, sent], axis=1)
|
||||
y = fb_econ_merged["log_Facebook"]
|
||||
|
||||
X_train, X_test, y_train, y_test = train_test_split(
|
||||
X, y, test_size=0.2, random_state=42
|
||||
)
|
||||
|
||||
scaler = StandardScaler()
|
||||
X_train_scaled = scaler.fit_transform(X_train[ts_cols_early])
|
||||
X_test_scaled = scaler.transform(X_test[ts_cols_early])
|
||||
|
||||
pca = PCA(n_components=10, random_state=42)
|
||||
X_train_pca = pca.fit_transform(X_train_scaled)
|
||||
X_test_pca = pca.transform(X_test_scaled)
|
||||
|
||||
train_sent = X_train[["SentimentTitle", "SentimentHeadline"]].values
|
||||
test_sent = X_test[["SentimentTitle", "SentimentHeadline"]].values
|
||||
|
||||
X_train_final = np.hstack([X_train_pca, train_sent])
|
||||
X_test_final = np.hstack([X_test_pca, test_sent])
|
||||
|
||||
rf = RandomForestRegressor(
|
||||
n_estimators=120,
|
||||
random_state=42,
|
||||
n_jobs=-1,
|
||||
max_depth=None,
|
||||
min_samples_leaf=2,
|
||||
)
|
||||
rf.fit(X_train_final, y_train)
|
||||
y_pred = rf.predict(X_test_final)
|
||||
|
||||
r2 = r2_score(y_test, y_pred)
|
||||
rmse = root_mean_squared_error(y_test, y_pred)
|
||||
|
||||
print("model 3 – random forest on pca(ts)")
|
||||
print("r2:", r2)
|
||||
print("rmse:", rmse)
|
||||
print("pca variance explained (first 10):", pca.explained_variance_ratio_)
|
||||
print("total variance explained:", pca.explained_variance_ratio_.sum())
|
||||
|
||||
return rf, (X_test, y_test, y_pred), (pca, scaler)
|
||||
|
||||
# model 4: logistic regression (viral vs non-viral)
|
||||
|
||||
|
||||
def run_model_4():
|
||||
df = news.copy()
|
||||
df = df[df["Facebook"].notna()].copy()
|
||||
|
||||
threshold = df["Facebook"].quantile(0.9)
|
||||
df["viral_fb"] = (df["Facebook"] >= threshold).astype(int)
|
||||
|
||||
X = df[["SentimentTitle", "SentimentHeadline", "DaysSinceEpoch", "Topic"]]
|
||||
X = pd.get_dummies(X, columns=["Topic"], drop_first=True)
|
||||
y = df["viral_fb"]
|
||||
|
||||
X_train, X_test, y_train, y_test = train_test_split(
|
||||
X,
|
||||
y,
|
||||
test_size=0.2,
|
||||
random_state=42,
|
||||
stratify=y,
|
||||
)
|
||||
|
||||
clf = LogisticRegression(
|
||||
max_iter=500,
|
||||
class_weight="balanced",
|
||||
)
|
||||
clf.fit(X_train, y_train)
|
||||
|
||||
y_pred = clf.predict(X_test)
|
||||
y_proba = clf.predict_proba(X_test)[:, 1]
|
||||
|
||||
acc = accuracy_score(y_test, y_pred)
|
||||
f1 = f1_score(y_test, y_pred)
|
||||
auc = roc_auc_score(y_test, y_proba)
|
||||
cm = confusion_matrix(y_test, y_pred)
|
||||
|
||||
print("model 4 – logistic regression (viral vs non-viral)")
|
||||
print("threshold (shares):", threshold)
|
||||
print("accuracy:", acc)
|
||||
print("f1 (positive class):", f1)
|
||||
print("roc auc:", auc)
|
||||
print("confusion matrix:\n", cm)
|
||||
|
||||
return clf, (X_test, y_test, y_pred, y_proba)
|
||||
|
||||
# model 5: k-means clustering on ts shapes
|
||||
|
||||
|
||||
def run_model_5():
|
||||
X = fb_econ_merged[ts_cols_early].values
|
||||
scaler = StandardScaler()
|
||||
X_scaled = scaler.fit_transform(X)
|
||||
|
||||
rng = np.random.RandomState(42)
|
||||
idx = rng.choice(X_scaled.shape[0], size=5000, replace=False)
|
||||
X_sample = X_scaled[idx]
|
||||
fb_sample = fb_econ_merged["Facebook"].values[idx]
|
||||
|
||||
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
|
||||
kmeans.fit(X_sample)
|
||||
labels = kmeans.labels_
|
||||
|
||||
sil = silhouette_score(X_sample, labels)
|
||||
print("model 5 – kmeans on ts shapes")
|
||||
print("silhouette score:", sil)
|
||||
|
||||
cluster_df = pd.DataFrame(
|
||||
{"cluster": labels, "Facebook": fb_sample}
|
||||
)
|
||||
print(cluster_df.groupby("cluster")["Facebook"].agg(
|
||||
["count", "mean", "median", "max"]
|
||||
))
|
||||
|
||||
centers_scaled = kmeans.cluster_centers_
|
||||
centers = scaler.inverse_transform(centers_scaled)
|
||||
centers_df = pd.DataFrame(centers, columns=ts_cols_early)
|
||||
|
||||
summary = pd.DataFrame({
|
||||
"cluster": list(range(centers_df.shape[0])),
|
||||
"avg_ts": centers_df.mean(axis=1),
|
||||
"ts1": centers_df["TS1"],
|
||||
"ts10": centers_df["TS10"],
|
||||
"ts25": centers_df["TS25"],
|
||||
"ts50": centers_df["TS50"],
|
||||
})
|
||||
print("cluster centroid summary:\n", summary)
|
||||
|
||||
return kmeans, scaler, summary
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
run_model_1()
|
||||
run_model_2()
|
||||
run_model_3()
|
||||
run_model_4()
|
||||
run_model_5()
|
||||
plot_eda()
|
||||
Binary file not shown.
|
After Width: | Height: | Size: 19 KiB |
Binary file not shown.
|
After Width: | Height: | Size: 23 KiB |
Binary file not shown.
|
After Width: | Height: | Size: 23 KiB |
Binary file not shown.
|
After Width: | Height: | Size: 130 KiB |
Binary file not shown.
|
After Width: | Height: | Size: 80 KiB |
Binary file not shown.
@@ -0,0 +1,53 @@
|
||||
model 1 – linear regression
|
||||
r2: 0.1566089012155698
|
||||
rmse: 1.8625218879551908
|
||||
coefficients:
|
||||
SentimentTitle -0.383499
|
||||
SentimentHeadline -0.064708
|
||||
DaysSinceEpoch -0.000678
|
||||
Topic_microsoft 0.101848
|
||||
Topic_obama 1.779152
|
||||
Topic_palestine 0.023738
|
||||
dtype: float64
|
||||
model 2 – random forest on raw ts
|
||||
r2: 0.7441325592979975
|
||||
rmse: 0.8661035218490399
|
||||
top importances:
|
||||
TS50 0.810814
|
||||
SentimentHeadline 0.099992
|
||||
SentimentTitle 0.067386
|
||||
TS49 0.001883
|
||||
TS48 0.000589
|
||||
TS15 0.000503
|
||||
TS18 0.000503
|
||||
TS13 0.000498
|
||||
TS24 0.000498
|
||||
TS10 0.000480
|
||||
dtype: float64
|
||||
model 3 – random forest on pca(ts)
|
||||
r2: 0.7442278904925559
|
||||
rmse: 0.8659421602173341
|
||||
pca variance explained (first 10): [9.38529911e-01 3.24317512e-02 1.76049987e-02 7.50439628e-03
|
||||
1.90148973e-03 6.83679307e-04 3.57135169e-04 2.12058930e-04
|
||||
1.33577763e-04 9.66846072e-05]
|
||||
total variance explained: 0.9994556829781833
|
||||
model 4 – logistic regression (viral vs non-viral)
|
||||
threshold (shares): 214.0
|
||||
accuracy: 0.7287481626653601
|
||||
f1 (positive class): 0.35709101466105386
|
||||
roc auc: 0.7530964866530827
|
||||
confusion matrix:
|
||||
[[10669 4023]
|
||||
[ 406 1230]]
|
||||
model 5 – kmeans on ts shapes
|
||||
silhouette score: 0.9732852082508215
|
||||
count mean median max
|
||||
cluster
|
||||
0 4978 36.751708 3.0 7045.0
|
||||
1 1 1886.000000 1886.0 1886.0
|
||||
2 21 2477.761905 1291.0 8010.0
|
||||
cluster centroid summary:
|
||||
cluster avg_ts ts1 ts10 ts25 ts50
|
||||
0 0 8.317766 0.297710 2.959221 7.836079 17.221977
|
||||
1 1 1885.920000 1885.000000 1886.000000 1886.000000 1886.000000
|
||||
2 2 640.917143 22.761905 211.142857 579.047619 1387.619048
|
||||
@@ -0,0 +1,25 @@
|
||||
Conduct the following analysis for the dataset:
|
||||
1. Exploratory Data Analysis
|
||||
Explore the statistical aspects of the dataset. Analyze the
|
||||
distributions and provide summaries of the relevant statistics. Perform any cleaning,
|
||||
transformations, interpolations, smoothing, outlier detection/ removal, etc. required on the
|
||||
data. Include figures and descriptions of this exploration and a short description of what
|
||||
you concluded (e.g. nature of distribution, indication of suitable model approaches you
|
||||
would try, etc.) Min.1 page text + graphics (required).
|
||||
|
||||
2. Model Development, Validation and Optimization
|
||||
Develop and evaluate three (4000-level) or four (6000-level) or more J models. If possible,
|
||||
these models should cover more than one objective, i.e. regression, classification,
|
||||
clustering. Consider the efect of dimension reduction of the dataset on model
|
||||
performance. Diferent models means diferent combinations of an algorithm and a
|
||||
formula (input and output features). The choice of independent and response variables is
|
||||
up to you. Explain why you chose them. Construct the models, test/ validate them. Briefly explain the
|
||||
validation approach. You can use any method(s) covered in the course. Include your code
|
||||
in your submission. Compare model results if applicable. Report the results of the model
|
||||
(fits, coeficients, sample trees, other measures of fit/ importance, etc., predictors and
|
||||
summary statistics). Min. 2 pages of text + graphics (required).
|
||||
|
||||
3. Decisions
|
||||
Describe your conclusions from the model
|
||||
fits, predictions and how well (or not) it could be used for decisions and why. Min. 1/2 page
|
||||
of text + graphics.
|
||||
Binary file not shown.
File diff suppressed because it is too large
Load Diff
Reference in New Issue
Block a user