13 Commits

Author SHA1 Message Date
ION606 6afdd43fb8 added presentation 2025-12-08 18:16:50 -05:00
ION606 b3eccdab43 small fixes 2025-12-08 17:51:50 -05:00
ION606 a9f73e4314 added presentation 2025-12-08 17:38:01 -05:00
ION606 091831c67c update 2025-12-08 17:04:32 -05:00
ION606 5581493cc0 what 2025-12-08 13:51:18 -05:00
ION606 4f6434ff72 added assignment VI 2025-12-05 19:59:00 -05:00
ION606 2667c06e09 updates 2025-12-04 13:07:25 -05:00
ION606 fa9a358415 added Assignment IV 2025-11-22 15:25:17 -05:00
ION606 18a911f9d3 added lab 5 2025-11-04 21:00:48 -05:00
ION606 414a4ac5a3 god I am DUMB 2025-11-04 17:43:39 -05:00
ION606 4eff5a6378 merge 2025-11-04 17:34:25 -05:00
ION606 5adb4119f5 Merge branch 'transfer' of https://git.ion606.com/ION606/Data-Analytics into transfer 2025-11-04 17:34:00 -05:00
ION606 cd3ababd59 added lab 5 2025-11-04 17:33:38 -05:00
42 changed files with 609197 additions and 0 deletions
+2
View File
@@ -0,0 +1,2 @@
data/
1.json
+4
View File
@@ -0,0 +1,4 @@
{
"r.linting.lineLength": false,
"r.editor.tabSize": 4,
}
@@ -0,0 +1,192 @@
# Data Analytics Fall 2025 Assignment IV
## Measuring How Generative AI Adoption Reshaped Stack Overflow Participation 20182025
Itamar Oren-Naftalovich (6000-Level)
---
## 1. Abstract and Introduction
On 30 November 2022, ChatGPT became publicly available. Within days, the Stack Overflow community faced two major shocks: developers suddenly had a new source of code-specific answers, and Stack Overflow introduced a temporary ban on AI-generated content on 5 December 2022 while already struggling with limited (and often terrible) moderation capacity. In this project I will look at whether the combination of generative AI adoption and these policy changes produced a statistically detectable shift in Stack Overflow content creation, and whether developers who say they use ChatGPT still treat Stack Overflow as a daily resource.
My initial hypothesis was that monthly answer counts would show a break after the ChatGPT launch and AI policy ban, even after controlling for the pre-2022 downward trend. I also expected that respondents who explicitly name ChatGPT as an AI assistant would be less likely to visit Stack Overflow daily. To test these ideas, I built two complementary datasets:
1. A Stack Overflow Data Explorer (SEDE) exports of monthly deleted and non-deleted answers from January 2018 through November 2025
2. Microdata from the 2023 and 2024 Stack Overflow Developer Surveys, which record both visit frequency and generative AI usage
If you want to see *how* I did this (the code) see `analysis.r`
The analysis relies on four modeling strategies: an interrupted time-series (ITS) linear regression, a Poisson regression for counts, a seasonal ARIMA model trained only on pre-ChatGPT data, and a logistic regression relating survey-reported AI usage to daily Stack Overflow visitation. Smashed together, these models indicate that Stack Overflow answer production fell by more than 53% in the post-ChatGPT period (mean 90.5 vs. 193.0 answers per month). At the same time, daily visitors are increasingly concentrated in older age cohorts, and survey respondents who explicitly mention ChatGPT do not differ meaningfully from others in how often they visit the site. The following sections describe the datasets, exploratory patterns, modeling choices, and implications for the community.
---
## 2. Data Description and Preliminary Analysis
### 2.1 Stack Overflow Answer Volume (Dataset 1)
* **Source and Scope**
`data/so_new_answers_per_month_2018_2025.csv` is a SEDE export of every new answer (deleted and non-deleted) by month from January 2018 through November 2025 (95 monthly observations). The script standardizes month formats, aggregates across deletion statuses, and adds indicators for the ChatGPT release (30 Nov 2022), the AI policy ban (5 Dec 2022), and the Stack Exchange moderator strike (5 Jun7 Aug 2023).
* **Variables**
After cleaning, the main table `answers_monthly` contains `answers_total`, `answers_non_deleted`, `answers_deleted`, calendar year and month, a sequential `time_index`, binary indicators for the events listed above, and a categorical `period` flagging pre- vs. post-ChatGPT months. A 3-month moving average (`answers_ma3`) is computed to smooth short-term noise for exploratory plots.
* **Quality Checks**
Duplicate rows were removed by grouping on `month`, and all transformations are recorded in `out.log`. The only missing values arise in the first two moving-average entries, which plotting functions simply omit. Because SEDE distinguishes deleted from non-deleted answers, the analysis keeps both so that any changes in moderation are visible in the time series.
![Figure 1. Monthly Stack Overflow answers with ChatGPT (dashed) and AI policy (dotted) markers.](imgs/01_answers_ts.png)
*Figure 1. Monthly answer counts follow a long downward trend that becomes steeper after November 2022.*
![Figure 2. Distribution of monthly answers pre- vs. post-ChatGPT.](imgs/02_box_pre_post.png)
*Figure 2. Box plots highlight the magnitude of the drop between the pre- and post-ChatGPT regimes.*
**Table 1. Descriptive Statistics by Regime (source: `data/answers_summary_period.csv`)**
| period | n_months | mean_answers | median_answers | sd_answers | min_answers | max_answers |
| ------------ | -------- | ------------ | -------------- | ---------- | ----------- | ----------- |
| pre_chatgpt | 59 | 193.0 | 185 | 44.7 | 122 | 313 |
| post_chatgpt | 36 | 90.5 | 88 | 38.0 | 11 | 157 |
A quick comparison of the six months immediately before and after 30 November 2022 shows only a 10.0% change in average answers, suggesting that the full 53% decline in Table 1 unfolded gradually across 20232025 rather than occurring instantly. This gradual pattern is one reason for using time-series models instead of treating the policy change as a simple before/after difference.
### 2.2 Stack Overflow Developer Survey (Dataset 2)
* **Source and Scope.**
The second dataset uses the publicly released 2023 and 2024 Stack Overflow Developer Survey microdata (`stack-overflow-developer-survey-2023.zip` and `stack-overflow-developer-survey-2024.zip`, downloaded 19 November 2025). Combined, these files contain 146,676 responses from professional and hobbyist developers worldwide.
* **Schema Harmonization!**
Column names differ slightly across years (for example, `SOAI` vs. `AISelect`), so helper functions search for the first matching column for each concept. The harmonized frame retains `year`, `main_branch`, `country`, numeric `age`, `gender`, reported Stack Overflow visit frequency (`so_visit`), and free-text AI assistant preferences (`ai_select`).
* **Feature Engineering**
Two binary indicators are constructed: `frequent_so` (1 if the respondent reports visiting Stack Overflow daily or multiple times per day) and `uses_chatgpt` (1 if the string “ChatGPT” appears anywhere in `ai_select`). Age is grouped into buckets (`<25`, `2534`, `3544`, `45+`, `unknown`), and gender is collapsed into a simplified label to absorb inconsistent free-text entries.
* **Sample Considerations**
Because the 2024 instrument asks about AI search preferences rather than naming specific tools, only 1,181 respondents in 2023 explicitly mention ChatGPT and almost none do in 2024. This change in wording is treated as a measurement artifact and revisited as a source of bias in Sections 3 and 4.
---
## 3. Exploratory Analysis
### 3.1 Seasonal and Trend Patterns in Answer Volume
The `answers_monthly` series preserves the familiar seasonal dip every December, but the overall level shifts downward after 2022. As Figure 3 shows, even typically slow months such as July now fall below 60 answers, compared with roughly 150220 answers in earlier years.
![Figure 3. Seasonality of Stack Overflow answers by calendar month and period.](imgs/03_seasonal.png)
*Figure 3. Post-ChatGPT seasons follow a similar seasonal shape but sit on a much lower baseline.*
The 3-month moving average in Figure 4 provides additional context. It peaks near 210 answers in mid-2018, drifts below 150 answers by late 2021, crosses under 100 answers in August 2023, and reaches about 23 answers by November 2025. The timing of the Stack Exchange moderator strike (JuneAugust 2023) aligns with the first extended period below 100 answers per month, hinting at compounding effects from generative AI substitution and reduced moderation capacity.
![Figure 4. Raw answers (faint) vs. 3-month moving average.](imgs/04_ma3.png)
*Figure 4. The smoothed series marks a clear structural break soon after the ChatGPT launch and policy ban.*
### 3.2 Survey Signals on Engagement and AI Adoption
A stacked bar chart (Figure 5) summarizes how daily Stack Overflow visitation relates to ChatGPT usage. In 2023, daily visitation rates are essentially identical for explicit ChatGPT users (39.1%) and non-users (also 39.1%), suggesting that early adopters of ChatGPT continued to visit Stack Overflow at similar rates while experimenting with AI. By 2024, daily visitation among respondents who *do not* mention ChatGPT falls to 37.3%. The near-absence of explicit ChatGPT mentions that year, however, is driven by the different survey question wording rather than a real disappearance of the tool. This reinforces the idea that self-reported tool usage is noisy and needs to be combined with behavioral indicators like monthly answer counts.
![Figure 5. Share of respondents visiting Stack Overflow daily, split by ChatGPT usage, 20232024.](imgs/08_survey_bar.png)
*Figure 5. Small differences between groups and across years illustrate how limited the AI usage field is for explaining engagement.*
### 3.3 Sources of Uncertainty and Bias
Several sources of uncertainty shape the analysis:
* **Measurement bias.**
SEDE relies on Stack Overflows internal logging. Deleted answers can be removed retroactively, so counts for the most recent months remain somewhat fluid.
* **Event alignment.**
The interrupted time-series design treats 30 November 2022 as the breakpoint between regimes, but the 2023 moderator strike and evolving AI policies create overlapping shocks that blur a clean “pre vs. post” distinction.
* **Survey sampling.**
The developer survey is voluntary, conducted in English, and heavily skewed toward respondents in North America and India. Age and tool usage are self-reported, and the 2024 wording change likely undercounts ChatGPT adoption.
* **Missingness.**
“Prefer not to say” responses in age and gender are mapped to `NA` or `Unknown`, which softens demographic differences in downstream models.
These limitations motivated the use of several modeling approaches in Section 4 instead of relying on a single model family.
---
## 4. Model Development and Application of Models
Each model addresses a slightly different question about Stack Overflow activity. All diagnostics and figures are produced directly by `analysis.r` and saved in `imgs/`.
### 4.1 Interrupted Time-Series Linear Regression
* **Specification.**
The primary linear model is
`answers_total ~ time + post_chatgpt + chatgpt_time`,
where `time` is the number of months since January 2018 and `chatgpt_time` resets to 1 in December 2022 to allow the post-ChatGPT slope to differ from the pre-ChatGPT trend.
* **Results.**
The model explains 71.7% of the variance (adjusted R² = 0.708, σ = 35.3). Before ChatGPT, monthly answers were already declining by 0.86 answers per month (p = 0.002). After November 2022, the slope becomes steeper by an additional 2.37 answers per month (p < 0.001). The immediate level change of 18 answers at the breakpoint is not statistically significant (p = 0.24).
* **Interpretation.**
Rather than a sudden cliff, the data show an acceleration of an existing decline. The post-ChatGPT trend line loses almost three extra answers each month relative to the pre-2023 trajectory, which accumulates to roughly 108 fewer answers per year.
![Figure 6. Observed vs. fitted answers under the interrupted time-series model.](imgs/05_lm_fit.png)
*Figure 6. The fitted line captures a gradual erosion in answer volume instead of a single large discontinuity.*
### 4.2 Poisson Regression for Count Data
* **Specification.**
The Poisson model uses the same predictors but applies a log link appropriate for count outcomes.
* **Results.**
The estimated multiplicative time effect before ChatGPT is `exp(time) = 0.996` (p < 0.001), corresponding to a 0.4% monthly contraction. After the release, the effective slope multiplier drops to 0.968 (p ≈ 2.7 × 10⁻⁶⁸), implying a 3.2% shrinkage per month. The residual deviance is 713.8 on 91 degrees of freedom, compared with a null deviance of 2,879.9.
* **Interpretation.**
Expressed in percentage terms, the Poisson model tells a similar story to the linear ITS: by late 2025, the expected answer count decays toward single digits if post-2022 dynamics continue unchanged.
![Figure 7. Poisson regression fit vs. observed counts.](imgs/06_pois_fit.png)
*Figure 7. The Poisson model slightly overestimates the lowest post-2024 points, consistent with some dispersion in the counts.*
### 4.3 Seasonal ARIMA Forecasting (Pre-ChatGPT Baseline)
* **Specification.**
To estimate what would have happened without ChatGPT and related policy changes, a seasonal ARIMA model, ARIMA(1,1,0)[1,0,0](12), is fit only to data through October 2022 (`train_ts`). The model then generates forecasts for November 2022November 2025, which are compared to the actual counts.
* **Results.**
On the training window, fit statistics are solid (RMSE = 32.9, MASE = 0.52). Out-of-sample, however, errors grow large: RMSE = 89.3, MAE = 79.3, MAPE ≈ 172%, and Theils U = 7.11. Observed counts soon fall below the 80% prediction interval and remain there, indicating that historical seasonality and trends alone cannot explain the post-2022 decline.
* **Interpretation.**
The ARIMA baseline functions as a counterfactual. Its consistent over-prediction of post-ChatGPT activity reinforces the conclusion that a structural break occurred, rather than a continuation of prior dynamics.
![Figure 8. ARIMA forecast (trained through Oct 2022) vs. actual counts.](imgs/07_arima_forecast.png)
*Figure 8. Actual activity diverges from the ARIMA forecast almost immediately and never returns to the predicted band.*
### 4.4 Logistic Regression on Survey Engagement
* **Specification.**
The survey-based model predicts `frequent_so` (daily or multiple-times-per-day visitor). Predictors that retain more than one level after cleaning are `uses_chatgpt`, `age_group`, and `year`. The harmonized `gender` variable collapses to a single `Unknown` level and is therefore dropped automatically. The data are split 80/20 into training and test sets with a fixed seed (123). The decision threshold is set to the training positive rate (0.384) to reduce the impact of class imbalance.
* **Results.**
Relative to respondents younger than 25, the odds ratio for the 2534 group is 1.04 (p = 0.008), while the 3544 and 45+ groups have odds ratios of 0.81 and 0.71, respectively (both p < 10⁻³²). The ChatGPT usage indicator has an odds ratio of 0.99 (p = 0.92), effectively indistinguishable from 1. The 2024 indicator yields an odds ratio of 0.92 (p ≈ 2.2 × 10⁻¹¹), pointing to a modest overall decline in daily visitation from 2023 to 2024.
On the 29,336-observation test set, the confusion matrix reports 7,560 true negatives, 10,549 false positives, 4,009 false negatives, and 7,218 true positives, giving an accuracy of 50.4%, precision of 40.6%, and recall of 64.3%.
* **Interpretation.**
The weak predictive performance and scarcity of explicit ChatGPT mentions in 2024 both suggest that the current survey instrument is not well suited for isolating the impact of AI usage on Stack Overflow engagement. Age shows a clearer pattern than AI usage: older cohorts are less likely to be daily visitors, while ChatGPT adoption, at least as self-reported in these surveys, does not significantly distinguish frequent users from others.
![Figure 9. Predicted probabilities of daily Stack Overflow usage by ChatGPT adoption.](imgs/09_logit_probs.png)
*Figure 9. Predicted probability distributions overlap heavily, mirroring the non-significant odds ratio for `uses_chatgpt`.*
---
## 5. Conclusions and Discussion
The evidence points to some sort of structural break in Stack Overflow answer production beginning in late 2022. Average monthly answers drop from 193.0 in the 2018October 2022 period to 90.5 between December 2022 and November 2025. The interrupted time-series model shows the slope of decline by becoming steeper by about 2.37 answers per month after ChatGPTs release, and the Poisson model implies a post-ChatGPT decay rate of roughly 3.2% per month. ARIMA forecasts trained only on pre-ChatGPT data substantially overestimate post-2022 activity, which reinforces the conclusion that pre-existing seasonal and secular trends cannot account for the observed collapse.
The survey-based models show more information about *who* remains active. Despite common assumptions that ChatGPT usage directly crowds out Stack Overflow visits, the current survey data do not show a strong link: the odds ratio for reported ChatGPT usage is essentially 1, and differences in daily visitation are driven more by age and year than by AI adoption. Given the 2024 wording change and the limitations of self-reported tool usage, it would be premature to claim that ChatGPT users as a group have already abandoned Stack Overflow.
Taken together, these findings suggest that any response from Stack Overflow should combine supply-side interventions (such as incentives for high-quality answers and additional moderation support to limit deleted content) with better measurement of how developers actually integrate AI tools and community Q&A into their workflows.
Future work could extend the time-series models with covariates for major product changes (e.g., Collectives, Discussions), incorporate question volume alongside answers, and revisit the survey analysis once the 2025 instrument becomes available. Causal impact methods, such as Bayesian structural time series using the ARIMA forecast as a prior, could offer a more formal estimate of the counterfactual number of answers that would have been produced without the post-2022 shocks.
---
## References
1. Stack Exchange Data Explorer. “New answers (deleted + non-deleted) per month,” query exported 19 Nov 2025 from the Stack Overflow SEDE interface.
2. Stack Overflow. “Stack Overflow Developer Survey 2023” and “Stack Overflow Developer Survey 2024,” datasets accessed 19 Nov 2025 from the Stack Overflow survey site.
3. OpenAI. “Introducing ChatGPT,” OpenAI Blog, 30 Nov 2022.
4. Stack Overflow Meta. “Temporary policy: ChatGPT is banned,” Meta Stack Overflow, 5 Dec 2022.
5. Stack Exchange. “Moderator Strike: Stack Overflow, Stack Exchange Network,” Meta Stack Exchange updates, JunAug 2023.
Binary file not shown.
+171
View File
@@ -0,0 +1,171 @@
### Slide 1 Title
**Title:** Did Stack Overflow Answers Increase After ChatGPT?
- Changes in Stack Overflow answer activity post-ChatGPT launch
- Impact of related policy events
- Developer behavior balancing Stack Overflow vs. AI tools
---
### Slide 2 Research Question
**Research Questions:**
1. Volume of answers:
- Did Stack Overflow answers change systematically after ChatGPT launched (late 2022)?
2. Policy/event impact:
- Did AI-answer policies and moderation events create additional shifts?
3. Substitution effect:
- Are heavy ChatGPT users visiting/answering less on Stack Overflow?
**Approach:**
- Look for structural breaks in answer time series
- Link site-level patterns to developer survey data
---
### Slide 3 Data Sources
**Dataset 1:**
- Monthly new answer counts (20182025)
- Pulled from Stack Exchange Data Explorer
- Includes deleted posts
- Provides pre-ChatGPT baseline and post-event window
**Dataset 2:**
- Microdata from Stack Overflow Developer Surveys (20232025)
- Focus:
- Visit frequency
- Adoption of AI tools like ChatGPT
**Exploratory Plots:**
- Raw time series
- Pre/post comparisons
- Seasonality
- Moving averages
---
### Slide 4 Preliminary Patterns
**Key Observations:**
- Long-run time series:
- Downward drift in answers pre-2022
- Sharper drop in level and slope post-ChatGPT launch
- Pre/post comparison:
- Post-ChatGPT period sits lower, even after accounting for seasonal dips (e.g., summer, year-end)
- Seasonal plots:
- 20182025 share consistent within-year rhythm
- Confirms changes arent due to seasonality
---
### Slide 5 Methodology
**Modelling Strategies:**
1. **Interrupted Time-Series Regression (ITS):**
- Predictors: time trend, level jump (ChatGPT launch), slope change
- Optional indicators: policy/moderation periods
2. **Poisson/Negative-Binomial Count Models:**
- Predictors: same as ITS
- Suitable for count data
- Quantifies percentage changes per month
3. **ARIMA Model:**
- Trained on pre-ChatGPT data
- Forecasts counterfactual trajectory
- Compares observed vs. predicted post-event counts
4. **Survey Logistic Regression:**
- Predicts frequent Stack Overflow visits
- Predictors: ChatGPT usage, demographics
**Diagnostics:**
- Residual checks
- Over-dispersion
- Out-of-sample performance
---
### Slide 6 Model Fits & Counterfactuals
**Findings:**
- **Interrupted Time-Series Regression:**
- Downward level shift post-2022
- Steeper negative slope post-ChatGPT
- Controls for pre-existing trend
- **Poisson Model:**
- Pre-ChatGPT: mild monthly contraction
- Post-ChatGPT: steeper decline (compounds over time)
- **ARIMA Forecast:**
- Trained on pre-ChatGPT data
- Post-2022 counts fall below 80% prediction interval
- Observed counts never recover
**Takeaway:**
- Structural break in answer supply post-ChatGPT and policy changes
- Changes not explained by trend/seasonality alone
---
### Slide 7 Survey Results
**Key Insights:**
- **ChatGPT Adoption (2023):**
- Widespread among developers, especially heavy coders
- Daily use common in workflows
- **Visit Frequency (20232024):**
- 2023: Heavy ChatGPT users visit Stack Overflow at similar daily rates as non-users
- 2024: Frequent visits drop more for heavy ChatGPT users
- **Logistic Regression:**
- ChatGPT usage alone: weak predictor of visit frequency (low-50% accuracy)
- Combined with cross-tabs: supports partial substitution (marginal questions shifted to ChatGPT)
---
### Slide 8 Key Findings
**Summary:**
- Monthly answers on Stack Overflow:
- Sharp drop post-ChatGPT release
- Continued lower trend (even after controlling for pre-existing decline)
- Policy/moderation events:
- Additional dips align with governance decisions
- Suggest amplification of ChatGPT effect
- ARIMA counterfactuals:
- Post-2022 counts outside expected range of pre-ChatGPT dynamics
- Substitution effect:
- Heavy ChatGPT users less likely to visit Stack Overflow daily over time
---
### Slide 9 Limitations
**Caveats:**
1. **Causality:**
- Overlap of ChatGPT, AI policies, moderation strike
- Broader economic/tooling trends also in play
2. **SEDE Data:**
- Doesnt capture moderation queues/private spaces
- Some activity may be invisible
3. **Survey Data:**
- Self-reported
- May under-represent active answerers or certain regions/roles
**Interpretation:**
- Results are **correlational evidence** of shifts in answer supply/usage patterns
- Not a precise causal estimate of “ChatGPT effect”
---
### Slide 10 Implications & Future Work
**Implications:**
- Answer supply sensitive to:
- Assistance tooling
- Governance decisions
- Platforms should:
- Carefully consider AI policies/moderation capacity
- Explore integration with conversational assistants (e.g., structured answer APIs)
**Future Work:**
- Tag-level/user-cohort analyses
- Stronger quasi-experimental designs (e.g., synthetic controls)
-
+683
View File
@@ -0,0 +1,683 @@
# install.packages(
# c("tidyverse", "lubridate", "broom", "forecast", "stringr", "dplyr"),
# repos = "http://cran.us.r-project.org"
# )
library(tidyverse)
library(lubridate)
library(broom)
library(forecast)
library(stringr)
library(dplyr)
# directory for data files (adjust if desired)
data_dir <- "data"
if (!dir.exists(data_dir)) {
dir.create(data_dir, recursive = TRUE)
}
# directory for plots
imgs_dir <- "imgs"
if (!dir.exists(imgs_dir)) {
dir.create(imgs_dir, recursive = TRUE)
}
# constants: key event dates related to chatgpt and so policy
# chatgpt public research preview launch
chatgpt_launch_date <- as.Date("2022-11-30") # openai "introducing chatgpt" blog
# stack overflow generative ai ban policy (meta so, 5 dec 2022)
so_ai_policy_date <- as.Date("2022-12-05")
# moderation strike on stack exchange (juneaug 2023) from meta posts
so_mod_strike_start <- as.Date("2023-06-05")
so_mod_strike_end <- as.Date("2023-08-07")
# helper: safe downloader
download_if_missing <- function(url, destfile) {
if (!file.exists(destfile)) {
message("downloading ", basename(destfile), " ...")
download.file(url, destfile, mode = "wb")
message("saved to ", destfile)
} else {
message("file already exists: ", destfile)
}
}
coerce_month_to_date <- function(x) {
if (inherits(x, "Date")) {
return(x)
}
if (inherits(x, "POSIXct")) {
return(lubridate::as_date(x))
}
if (inherits(x, "POSIXlt")) {
return(as.Date(x))
}
if (is.numeric(x)) {
return(as.Date(x, origin = "1970-01-01"))
}
if (is.character(x)) {
parsed <- suppressWarnings(lubridate::ymd_hms(x))
if (all(is.na(parsed))) {
parsed <- suppressWarnings(lubridate::ymd(x))
}
if (all(is.na(parsed))) {
parsed <- suppressWarnings(as.Date(x))
}
return(parsed)
}
suppressWarnings(as.Date(x))
}
# 1) load stack overflow monthly answers (dataset 1)
answers_csv_path <- file.path(data_dir, "so_new_answers_per_month_2018_2025.csv")
if (!file.exists(answers_csv_path)) {
stop(
"missing ", answers_csv_path,
"\nrun the sede query in this script and download the csv to that path first."
)
}
answers_raw <- readr::read_csv(answers_csv_path, show_col_types = FALSE) |>
rename(
month = matches("^Date$|Month", ignore.case = TRUE),
status = matches("^Status$", ignore.case = TRUE),
new_answers = matches("NewAnswers|Count", ignore.case = TRUE)
)
answers_raw <- answers_raw |>
mutate(
month = coerce_month_to_date(month),
status = tolower(status)
)
# inspect column names so you can adjust if sede changes them
print(names(answers_raw))
# expected columns: "Month", "Status", "NewAnswers"
# normalise to lower snake case just in case
answers_raw <- answers_raw |>
rename(
month = matches("Month", ignore.case = TRUE),
status = matches("Status", ignore.case = TRUE),
new_answers = matches("NewAnswers|Count", ignore.case = TRUE)
)
print(head(answers_raw))
# aggregate deleted vs non-deleted into separate columns per month
answers_monthly <- answers_raw |>
mutate(
month = as.Date(month),
status = tolower(status)
) |>
group_by(month) |>
summarise(
answers_total = sum(new_answers, na.rm = TRUE),
answers_non_deleted = sum(if_else(status == "non-deleted", new_answers, 0L)),
answers_deleted = sum(if_else(status == "deleted", new_answers, 0L)),
.groups = "drop"
) |>
arrange(month) |>
mutate(
year = year(month),
month_num = month(month),
time_index = row_number(),
post_chatgpt = month >= chatgpt_launch_date,
post_ai_policy = month >= so_ai_policy_date,
during_mod_strike = month >= so_mod_strike_start & month <= so_mod_strike_end,
period = case_when(
month < chatgpt_launch_date ~ "pre_chatgpt",
TRUE ~ "post_chatgpt"
)
)
glimpse(answers_monthly)
# 2) download and load stack overflow developer survey 2023/2024 (dataset 2)
# official survey zip files as exposed on survey.stackoverflow.co
# these urls are the same ones behind the "download full data set (csv)" links
# see: https://survey.stackoverflow.co/
survey_2023_url <- "https://survey.stackoverflow.co/datasets/stack-overflow-developer-survey-2023.zip"
survey_2024_url <- "https://survey.stackoverflow.co/datasets/stack-overflow-developer-survey-2024.zip"
survey_2023_zip <- file.path(data_dir, "stack-overflow-developer-survey-2023.zip")
survey_2024_zip <- file.path(data_dir, "stack-overflow-developer-survey-2024.zip")
download_if_missing(survey_2023_url, survey_2023_zip)
download_if_missing(survey_2024_url, survey_2024_zip)
# helper to read the "survey_results_public.csv" inside each zip
read_so_survey_from_zip <- function(zip_path, csv_pattern = "survey_results_public.csv") {
if (!file.exists(zip_path)) {
stop("zip file not found: ", zip_path)
}
# list files inside zip (works even when the CSV is in a subfolder)
zlist <- utils::unzip(zip_path, list = TRUE)
# try to find the csv by exact name or by pattern
csv_name <- zlist$Name[stringr::str_detect(zlist$Name, regex(csv_pattern, ignore_case = TRUE))]
if (length(csv_name) == 0) {
stop("could not find a csv matching ", csv_pattern, " inside ", zip_path)
}
csv_name <- csv_name[1] # take first match
# read it without extracting to disk using unz() connection
# optionally supply col_types to speed parsing
df <- readr::read_csv(
unz(zip_path, csv_name),
show_col_types = FALSE,
# col_types = cols(.default = col_character()) # uncomment & customize if you want explicit types
)
df
}
survey2023_raw <- read_so_survey_from_zip(survey_2023_zip)
survey2024_raw <- read_so_survey_from_zip(survey_2024_zip)
# look at column names to locate ai + stackoverflow usage questions
names(survey2023_raw)[1:80]
############################################################
# create a harmonised survey subset focusing on:
# - so visit frequency (column like SOVisitFreq)
# - ai tool usage (column like AISelect or SOAI)
############################################################
find_first_col <- function(df, pattern) {
cols <- names(df)[stringr::str_detect(names(df), regex(pattern, ignore_case = TRUE))]
if (length(cols) == 0) {
return(NA_character_)
}
cols[1]
}
pull_col_or_default <- function(df, col_name, default = NA_character_) {
if (is.na(col_name)) {
return(rep(default, nrow(df)))
}
df[[col_name]]
}
pull_age_numeric <- function(df, col_name) {
vec <- pull_col_or_default(df, col_name, default = NA_real_)
if (is.numeric(vec)) {
return(vec)
}
if (is.factor(vec)) {
vec <- as.character(vec)
}
if (is.character(vec)) {
vec <- stringr::str_trim(vec)
vec[vec == ""] <- NA_character_
vec[stringr::str_detect(vec, regex("prefer not to say", ignore_case = TRUE))] <- NA_character_
# parse_number extracts the leading numeric value (e.g., 25 from "25-34 years old")
return(suppressWarnings(readr::parse_number(vec)))
}
suppressWarnings(as.numeric(vec))
}
main_branch_col_2023 <- find_first_col(survey2023_raw, "^MainBranch$|MainBranch")
country_col_2023 <- find_first_col(survey2023_raw, "^Country$|Country")
age_col_2023 <- find_first_col(survey2023_raw, "^Age$|Age")
gender_col_2023 <- find_first_col(survey2023_raw, "^Gender$|Gender")
so_visit_col_2023 <- find_first_col(survey2023_raw, "SOVisitFreq")
ai_select_col_2023 <- find_first_col(survey2023_raw, "AISelect|SOAI")
main_branch_col_2024 <- find_first_col(survey2024_raw, "^MainBranch$|MainBranch")
country_col_2024 <- find_first_col(survey2024_raw, "^Country$|Country")
age_col_2024 <- find_first_col(survey2024_raw, "^Age$|Age")
gender_col_2024 <- find_first_col(survey2024_raw, "^Gender$|Gender")
so_visit_col_2024 <- find_first_col(survey2024_raw, "SOVisitFreq")
ai_select_col_2024 <- find_first_col(survey2024_raw, "AISelect|SOAI")
message("2023 so visit col: ", so_visit_col_2023)
message("2023 ai col : ", ai_select_col_2023)
message("2024 so visit col: ", so_visit_col_2024)
message("2024 ai col : ", ai_select_col_2024)
# build a clean survey frame for 2023
survey2023 <- survey2023_raw |>
transmute(
year = 2023L,
main_branch = pull_col_or_default(survey2023_raw, main_branch_col_2023),
country = pull_col_or_default(survey2023_raw, country_col_2023),
age = pull_age_numeric(survey2023_raw, age_col_2023),
gender = pull_col_or_default(survey2023_raw, gender_col_2023),
so_visit = pull_col_or_default(survey2023_raw, so_visit_col_2023),
ai_select = pull_col_or_default(survey2023_raw, ai_select_col_2023)
)
# same idea for 2024 (schema is very similar)
survey2024 <- survey2024_raw |>
transmute(
year = 2024L,
main_branch = pull_col_or_default(survey2024_raw, main_branch_col_2024),
country = pull_col_or_default(survey2024_raw, country_col_2024),
age = pull_age_numeric(survey2024_raw, age_col_2024),
gender = pull_col_or_default(survey2024_raw, gender_col_2024),
so_visit = pull_col_or_default(survey2024_raw, so_visit_col_2024),
ai_select = pull_col_or_default(survey2024_raw, ai_select_col_2024)
)
survey_all <- bind_rows(survey2023, survey2024)
# engineer features:
# - binary flag: frequent so visitor
# - binary flag: uses chatgpt as ai tool (from ai_select free text / semicolon list)
# - coarser age groups
survey_model <- survey_all |>
filter(!is.na(so_visit)) |>
mutate(
so_visit = as.character(so_visit),
ai_select = as.character(ai_select),
# frequent so visitor: daily or multiple times per day etc.
frequent_so = dplyr::case_when(
stringr::str_detect(so_visit, regex("multiple times per day", ignore_case = TRUE)) ~ 1L,
stringr::str_detect(so_visit, regex("daily|almost every day", ignore_case = TRUE)) ~ 1L,
TRUE ~ 0L
),
uses_chatgpt = dplyr::case_when(
is.na(ai_select) ~ 0L,
stringr::str_detect(ai_select, regex("chatgpt", ignore_case = TRUE)) ~ 1L,
TRUE ~ 0L
),
age_group = dplyr::case_when(
!is.na(age) & age < 25 ~ "<25",
!is.na(age) & age >= 25 & age < 35 ~ "25-34",
!is.na(age) & age >= 35 & age < 45 ~ "35-44",
!is.na(age) & age >= 45 ~ "45+",
TRUE ~ "unknown"
),
gender = if_else(is.na(gender) | gender == "", "Unknown", gender)
) |>
filter(!is.na(frequent_so)) |>
mutate(
frequent_so = as.integer(frequent_so),
uses_chatgpt = as.integer(uses_chatgpt),
age_group = factor(age_group),
gender = factor(gender),
year = factor(year)
)
glimpse(survey_model)
# SECTION 2: data description + preliminary plots (dataset 1)
# basic time series plot of answers over time (for section 2)
p_answers_ts <- ggplot(answers_monthly, aes(x = month, y = answers_total)) +
geom_line() +
geom_vline(xintercept = chatgpt_launch_date, linetype = "dashed") +
geom_vline(xintercept = so_ai_policy_date, linetype = "dotted") +
labs(
title = "monthly new answers on stack overflow",
x = "month",
y = "number of answers"
)
print(p_answers_ts)
ggsave(
filename = file.path(imgs_dir, "01_answers_ts.png"),
plot = p_answers_ts,
width = 10, height = 6, units = "in", dpi = 300
)
# boxplot pre vs post chatgpt
p_box_pre_post <- ggplot(answers_monthly, aes(x = period, y = answers_total)) +
geom_boxplot() +
labs(
title = "distribution of monthly answers: pre vs post chatgpt launch",
x = "period",
y = "monthly answers"
)
print(p_box_pre_post)
ggsave(file.path(imgs_dir, "02_box_pre_post.png"), plot = p_box_pre_post, width = 8, height = 6, units = "in", dpi = 300)
# basic summary table
answers_summary_period <- answers_monthly |>
group_by(period) |>
summarise(
n_months = n(),
mean_answers = mean(answers_total),
median_answers = median(answers_total),
sd_answers = sd(answers_total),
min_answers = min(answers_total),
max_answers = max(answers_total),
.groups = "drop"
)
print(answers_summary_period)
# SECTION 3: exploratory analysis
# seasonal pattern: answers by calendar month across years
p_seasonal <- answers_monthly |>
mutate(month_label = factor(month_num, labels = month.abb)) |>
ggplot(aes(x = month_label, y = answers_total, group = year, color = period)) +
geom_line(alpha = 0.6) +
labs(
title = "seasonality of answers by calendar month and year",
x = "calendar month",
y = "monthly answers"
)
print(p_seasonal)
ggsave(file.path(imgs_dir, "03_seasonal.png"), plot = p_seasonal, width = 10, height = 6, units = "in", dpi = 300)
# rolling 3-month moving average to smooth noise
answers_monthly <- answers_monthly |>
arrange(month) |>
mutate(
answers_ma3 = zoo::rollmean(answers_total, k = 3, fill = NA, align = "right")
)
p_ma3 <- ggplot(answers_monthly, aes(x = month)) +
geom_line(aes(y = answers_total), alpha = 0.3) +
geom_line(aes(y = answers_ma3)) +
geom_vline(xintercept = chatgpt_launch_date, linetype = "dashed") +
labs(
title = "monthly answers with 3-month moving average",
x = "month",
y = "answers"
)
print(p_ma3)
ggsave(file.path(imgs_dir, "04_ma3.png"), plot = p_ma3, width = 10, height = 6, units = "in", dpi = 300)
# simple percentage change around chatgpt launch
pre_window <- answers_monthly |>
filter(
month >= chatgpt_launch_date - months(6),
month < chatgpt_launch_date
)
post_window <- answers_monthly |>
filter(
month >= chatgpt_launch_date,
month < chatgpt_launch_date + months(6)
)
pre_mean <- mean(pre_window$answers_total)
post_mean <- mean(post_window$answers_total)
pct_change <- (post_mean - pre_mean) / pre_mean * 100
pct_change
# survey exploratory: relation between ai usage and so visit frequency
survey_counts <- survey_model |>
mutate(
uses_chatgpt_label = if_else(uses_chatgpt == 1L, "uses chatgpt", "does not use chatgpt"),
freq_label = if_else(frequent_so == 1L, "visits so daily", "visits so less often")
) |>
count(year, uses_chatgpt_label, freq_label) |>
group_by(year, uses_chatgpt_label) |>
mutate(prop = n / sum(n)) |>
ungroup()
p_survey_bar <- ggplot(survey_counts, aes(x = uses_chatgpt_label, y = prop, fill = freq_label)) +
geom_col(position = "fill") +
facet_wrap(~year) +
scale_y_continuous(labels = scales::percent_format()) +
labs(
title = "relationship between chatgpt use and stack overflow visit frequency (survey)",
x = "ai usage segment",
y = "share of respondents",
fill = "so visit frequency"
)
print(p_survey_bar)
ggsave(file.path(imgs_dir, "08_survey_bar.png"), plot = p_survey_bar, width = 10, height = 6, units = "in", dpi = 300)
# SECTION 4: model development (four different model types)
# MODEL 1: interrupted time series linear regression
# outcome: monthly answers_total
# predictors: time trend, post_chatgpt level change, slope change after chatgpt
its_data <- answers_monthly |>
mutate(
time = time_index,
chatgpt_time = if_else(month >= chatgpt_launch_date,
time_index - min(time_index[month >= chatgpt_launch_date]) + 1L,
0L
)
)
model_lm <- lm(
answers_total ~ time + post_chatgpt + chatgpt_time,
data = its_data
)
summary(model_lm)
tidy(model_lm)
glance(model_lm)
# predictions and plot
its_data <- its_data |>
mutate(
lm_fitted = predict(model_lm)
)
p_lm_fit <- ggplot(its_data, aes(x = month)) +
geom_line(aes(y = answers_total), alpha = 0.4) +
geom_line(aes(y = lm_fitted), color = "blue") +
geom_vline(xintercept = chatgpt_launch_date, linetype = "dashed") +
labs(
title = "interrupted time series regression: observed vs fitted answers",
x = "month",
y = "answers"
)
print(p_lm_fit)
ggsave(file.path(imgs_dir, "05_lm_fit.png"), plot = p_lm_fit, width = 10, height = 6, units = "in", dpi = 300)
# MODEL 2: poisson regression for count data
model_pois <- glm(
answers_total ~ time + post_chatgpt + chatgpt_time,
data = its_data,
family = poisson(link = "log")
)
summary(model_pois)
tidy(model_pois, exponentiate = TRUE) # exp(coef) ~ multiplicative effect
# compare predicted counts
its_data <- its_data |>
mutate(
pois_fitted = predict(model_pois, type = "response")
)
p_pois_fit <- ggplot(its_data, aes(x = month)) +
geom_line(aes(y = answers_total), alpha = 0.3) +
geom_line(aes(y = pois_fitted), color = "red") +
geom_vline(xintercept = chatgpt_launch_date, linetype = "dashed") +
labs(
title = "poisson regression: observed vs predicted monthly answers",
x = "month",
y = "answers"
)
print(p_pois_fit)
ggsave(file.path(imgs_dir, "06_pois_fit.png"), plot = p_pois_fit, width = 10, height = 6, units = "in", dpi = 300)
# MODEL 3: arima time series forecast (pre-chatgpt vs actual)
# construct monthly ts object (frequency = 12)
start_year <- year(min(answers_monthly$month))
start_month <- month(min(answers_monthly$month))
answers_ts <- ts(
answers_monthly$answers_total,
start = c(start_year, start_month),
frequency = 12
)
# train on pre-chatgpt data (up to oct 2022) and forecast forward
train_end <- c(2022, 10) # october 2022
train_ts <- window(answers_ts, end = train_end)
test_ts <- window(answers_ts, start = c(2022, 11))
arima_fit <- auto.arima(train_ts)
summary(arima_fit)
h <- length(test_ts)
fc <- forecast(arima_fit, h = h)
# compare forecast vs actual on the holdout period
fc_df <- data.frame(
month = answers_monthly$month[answers_monthly$month >= as.Date("2022-11-01")],
actual = as.numeric(test_ts),
forecast = as.numeric(fc$mean),
lower_80 = as.numeric(fc$lower[, "80%"]),
upper_80 = as.numeric(fc$upper[, "80%"])
)
p_arima <- ggplot(fc_df, aes(x = month)) +
geom_line(aes(y = actual), alpha = 0.6) +
geom_line(aes(y = forecast), linetype = "dashed") +
geom_ribbon(aes(ymin = lower_80, ymax = upper_80), alpha = 0.2) +
labs(
title = "arima forecast (trained on pre-chatgpt) vs actual answers",
x = "month",
y = "answers"
)
print(p_arima)
ggsave(file.path(imgs_dir, "07_arima_forecast.png"), plot = p_arima, width = 10, height = 6, units = "in", dpi = 300)
# simple accuracy metrics on the holdout
fc_accuracy <- accuracy(fc, test_ts)
print(fc_accuracy)
# MODEL 4: logistic regression does using chatgpt predict being a frequent stack overflow visitor?
set.seed(123)
survey_model_complete <- survey_model |>
filter(!is.na(uses_chatgpt), !is.na(frequent_so))
candidate_predictors <- c("uses_chatgpt", "age_group", "gender", "year")
valid_predictors <- candidate_predictors[sapply(
candidate_predictors,
function(col) dplyr::n_distinct(survey_model_complete[[col]], na.rm = TRUE) > 1
)]
drop_predictors <- setdiff(candidate_predictors, valid_predictors)
if (length(drop_predictors) > 0) {
message("dropping predictors with <2 levels: ", paste(drop_predictors, collapse = ", "))
}
logit_formula <- if (length(valid_predictors) == 0) {
frequent_so ~ 1
} else {
as.formula(paste("frequent_so ~", paste(valid_predictors, collapse = " + ")))
}
n <- nrow(survey_model_complete)
train_idx <- sample(seq_len(n), size = floor(0.8 * n))
survey_train <- survey_model_complete[train_idx, ]
survey_test <- survey_model_complete[-train_idx, ]
positive_rate <- mean(survey_train$frequent_so, na.rm = TRUE)
classification_threshold <- dplyr::case_when(
is.na(positive_rate) ~ 0.5,
positive_rate <= 0 ~ 0.5,
positive_rate >= 1 ~ 0.5,
TRUE ~ positive_rate
)
message(
"classification threshold (training frequent_so share): ",
round(classification_threshold, 3)
)
logit_model <- glm(
formula = logit_formula,
family = binomial(link = "logit"),
data = survey_train
)
summary(logit_model)
tidy(logit_model, exponentiate = TRUE, conf.int = TRUE)
# predict on test set
survey_test <- survey_test |>
mutate(
pred_prob = predict(logit_model, newdata = survey_test, type = "response"),
pred_class = if_else(pred_prob >= classification_threshold, 1L, 0L)
)
# confusion matrix and simple metrics
conf_mat <- table(
truth = factor(survey_test$frequent_so, levels = c(0, 1)),
pred = factor(survey_test$pred_class, levels = c(0, 1))
)
conf_mat
tp <- conf_mat["1", "1"]
tn <- conf_mat["0", "0"]
fp <- conf_mat["0", "1"]
fn <- conf_mat["1", "0"]
accuracy <- (tp + tn) / sum(conf_mat)
precision <- if ((tp + fp) > 0) tp / (tp + fp) else NA_real_
recall <- if ((tp + fn) > 0) tp / (tp + fn) else NA_real_
list(
accuracy = accuracy,
precision = precision,
recall = recall
)
# visual: predicted probability vs ai usage
p_logit_probs <- survey_test |>
mutate(uses_chatgpt_label = if_else(uses_chatgpt == 1L, "uses chatgpt", "does not use chatgpt")) |>
ggplot(aes(x = uses_chatgpt_label, y = pred_prob)) +
geom_boxplot() +
labs(
title = "predicted probability of being a frequent so visitor by chatgpt use",
x = "ai usage segment",
y = "predicted probability (logistic model)"
)
print(p_logit_probs)
ggsave(file.path(imgs_dir, "09_logit_probs.png"), plot = p_logit_probs, width = 8, height = 6, units = "in", dpi = 300)
# save key tables and model outputs to disk for report
write_csv(answers_monthly, file.path(data_dir, "answers_monthly_clean.csv"))
write_csv(answers_summary_period, file.path(data_dir, "answers_summary_period.csv"))
write_csv(survey_counts, file.path(data_dir, "survey_ai_vs_so_visit.csv"))
saveRDS(model_lm, file.path(data_dir, "model_lm_its.rds"))
saveRDS(model_pois, file.path(data_dir, "model_pois.rds"))
saveRDS(arima_fit, file.path(data_dir, "model_arima_prechatgpt.rds"))
saveRDS(logit_model, file.path(data_dir, "model_logit_survey.rds"))
Binary file not shown.

After

Width:  |  Height:  |  Size: 176 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 54 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 327 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 217 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 189 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 184 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 135 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 91 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 74 KiB

+245
View File
@@ -0,0 +1,245 @@
[Running] Rscript "/home/ion606/Desktop/Homework/Data Analytics/Assignment IV/analysis.r"
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.0 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Registered S3 method overwritten by 'quantmod':
method from
as.zoo.data.frame zoo
[1] "month" "status" "new_answers"
# A tibble: 6 × 3
month status new_answers
<date> <chr> <dbl>
1 2018-01-01 deleted 26
2 2018-01-01 non-deleted 159
3 2018-02-01 deleted 20
4 2018-02-01 non-deleted 175
5 2018-03-01 deleted 18
6 2018-03-01 non-deleted 193
Rows: 95
Columns: 11
$ month <date> 2018-01-01, 2018-02-01, 2018-03-01, 2018-04-01, 2…
$ answers_total <dbl> 185, 195, 211, 221, 227, 189, 149, 179, 198, 232, …
$ answers_non_deleted <dbl> 159, 175, 193, 191, 203, 172, 133, 154, 170, 198, …
$ answers_deleted <dbl> 26, 20, 18, 30, 24, 17, 16, 25, 28, 34, 20, 45, 33…
$ year <dbl> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 20…
$ month_num <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4,…
$ time_index <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,…
$ post_chatgpt <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
$ post_ai_policy <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
$ during_mod_strike <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
$ period <chr> "pre_chatgpt", "pre_chatgpt", "pre_chatgpt", "pre_…
file already exists: data/stack-overflow-developer-survey-2023.zip
file already exists: data/stack-overflow-developer-survey-2024.zip
[1] "ResponseId" "Q120"
[3] "MainBranch" "Age"
[5] "Employment" "RemoteWork"
[7] "CodingActivities" "EdLevel"
[9] "LearnCode" "LearnCodeOnline"
[11] "LearnCodeCoursesCert" "YearsCode"
[13] "YearsCodePro" "DevType"
[15] "OrgSize" "PurchaseInfluence"
[17] "TechList" "BuyNewTool"
[19] "Country" "Currency"
[21] "CompTotal" "LanguageHaveWorkedWith"
[23] "LanguageWantToWorkWith" "DatabaseHaveWorkedWith"
[25] "DatabaseWantToWorkWith" "PlatformHaveWorkedWith"
[27] "PlatformWantToWorkWith" "WebframeHaveWorkedWith"
[29] "WebframeWantToWorkWith" "MiscTechHaveWorkedWith"
[31] "MiscTechWantToWorkWith" "ToolsTechHaveWorkedWith"
[33] "ToolsTechWantToWorkWith" "NEWCollabToolsHaveWorkedWith"
[35] "NEWCollabToolsWantToWorkWith" "OpSysPersonal use"
[37] "OpSysProfessional use" "OfficeStackAsyncHaveWorkedWith"
[39] "OfficeStackAsyncWantToWorkWith" "OfficeStackSyncHaveWorkedWith"
[41] "OfficeStackSyncWantToWorkWith" "AISearchHaveWorkedWith"
[43] "AISearchWantToWorkWith" "AIDevHaveWorkedWith"
[45] "AIDevWantToWorkWith" "NEWSOSites"
[47] "SOVisitFreq" "SOAccount"
[49] "SOPartFreq" "SOComm"
[51] "SOAI" "AISelect"
[53] "AISent" "AIAcc"
[55] "AIBen" "AIToolInterested in Using"
[57] "AIToolCurrently Using" "AIToolNot interested in Using"
[59] "AINextVery different" "AINextNeither different nor similar"
[61] "AINextSomewhat similar" "AINextVery similar"
[63] "AINextSomewhat different" "TBranch"
[65] "ICorPM" "WorkExp"
[67] "Knowledge_1" "Knowledge_2"
[69] "Knowledge_3" "Knowledge_4"
[71] "Knowledge_5" "Knowledge_6"
[73] "Knowledge_7" "Knowledge_8"
[75] "Frequency_1" "Frequency_2"
[77] "Frequency_3" "TimeSearching"
[79] "TimeAnswering" "ProfessionalTech"
2023 so visit col: SOVisitFreq
2023 ai col : SOAI
2024 so visit col: SOVisitFreq
2024 ai col : AISelect
Rows: 146,676
Columns: 10
$ year <fct> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 202…
$ main_branch <chr> "I am a developer by profession", "I am a developer by pr…
$ country <chr> "United States of America", "United States of America", "…
$ age <dbl> 25, 45, 25, 25, 35, 35, 25, 45, 25, 25, 25, 25, 35, 25, 3…
$ gender <fct> Unknown, Unknown, Unknown, Unknown, Unknown, Unknown, Unk…
$ so_visit <chr> "Daily or almost daily", "A few times per month or weekly…
$ ai_select <chr> "I don't think it's super necessary, but I think improvin…
$ frequent_so <int> 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, …
$ uses_chatgpt <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, …
$ age_group <fct> 25-34, 45+, 25-34, 25-34, 35-44, 35-44, 25-34, 45+, 25-34…
# A tibble: 2 × 7
period n_months mean_answers median_answers sd_answers min_answers max_answers
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 post_… 36 90.5 88 38.0 11 157
2 pre_c… 59 193. 185 44.7 122 313
Warning message:
Removed 2 rows containing missing values or values outside the scale range
(`geom_line()`).
Warning message:
Removed 2 rows containing missing values or values outside the scale range
(`geom_line()`).
[1] -10.02227
Call:
lm(formula = answers_total ~ time + post_chatgpt + chatgpt_time,
data = its_data)
Residuals:
Min 1Q Median 3Q Max
-76.623 -22.914 -3.868 13.431 123.402
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 218.8013 9.3214 23.473 < 2e-16 ***
time -0.8589 0.2702 -3.179 0.002022 **
post_chatgptTRUE -17.9635 15.0779 -1.191 0.236601
chatgpt_time -2.3661 0.6282 -3.767 0.000293 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 35.35 on 91 degrees of freedom
Multiple R-squared: 0.717, Adjusted R-squared: 0.7077
F-statistic: 76.86 on 3 and 91 DF, p-value: < 2.2e-16
# A tibble: 4 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 219. 9.32 23.5 2.23e-40
2 time -0.859 0.270 -3.18 2.02e- 3
3 post_chatgptTRUE -18.0 15.1 -1.19 2.37e- 1
4 chatgpt_time -2.37 0.628 -3.77 2.93e- 4
# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.717 0.708 35.3 76.9 7.39e-25 3 -471. 953. 966.
# 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Call:
glm(formula = answers_total ~ time + post_chatgpt + chatgpt_time,
family = poisson(link = "log"), data = its_data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 5.3936301 0.0183909 293.277 < 2e-16 ***
time -0.0044547 0.0005512 -8.082 6.38e-16 ***
post_chatgptTRUE -0.0187737 0.0365851 -0.513 0.608
chatgpt_time -0.0322028 0.0018440 -17.464 < 2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 2879.9 on 94 degrees of freedom
Residual deviance: 713.8 on 91 degrees of freedom
AIC: 1363
Number of Fisher Scoring iterations: 4
# A tibble: 4 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 220. 0.0184 293. 0
2 time 0.996 0.000551 -8.08 6.38e-16
3 post_chatgptTRUE 0.981 0.0366 -0.513 6.08e- 1
4 chatgpt_time 0.968 0.00184 -17.5 2.71e-68
Series: train_ts
ARIMA(1,1,0)(1,0,0)[12]
Coefficients:
ar1 sar1
-0.3956 0.3016
s.e. 0.1360 0.1381
sigma^2 = 1142: log likelihood = -281.17
AIC=568.34 AICc=568.8 BIC=574.47
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set -0.1691686 32.90678 26.65938 -1.989033 14.30025 0.5170032
ACF1
Training set 0.03124461
ME RMSE MAE MPE MAPE MASE
Training set -0.1691686 32.90678 26.65938 -1.989033 14.30025 0.5170032
Test set -78.4100374 89.26691 79.26493 -171.518981 171.98870 1.5371782
ACF1 Theil's U
Training set 0.03124461 NA
Test set 0.73383075 7.11443
dropping predictors with <2 levels: gender
classification threshold (training frequent_so share): 0.384
Call:
glm(formula = logit_formula, family = binomial(link = "logit"),
data = survey_train)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.358743 0.013009 -27.577 < 2e-16 ***
uses_chatgpt -0.006783 0.066977 -0.101 0.91933
age_group25-34 0.040677 0.015439 2.635 0.00842 **
age_group35-44 -0.207571 0.017478 -11.876 < 2e-16 ***
age_group45+ -0.345739 0.020289 -17.041 < 2e-16 ***
age_groupunknown -0.222739 0.096177 -2.316 0.02056 *
year2024 -0.082452 0.012319 -6.693 2.18e-11 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 156271 on 117339 degrees of freedom
Residual deviance: 155647 on 117333 degrees of freedom
AIC: 155661
Number of Fisher Scoring iterations: 4
# A tibble: 7 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 0.699 0.0130 -27.6 2.10e-167 0.681 0.717
2 uses_chatgpt 0.993 0.0670 -0.101 9.19e- 1 0.870 1.13
3 age_group25-34 1.04 0.0154 2.63 8.42e- 3 1.01 1.07
4 age_group35-44 0.813 0.0175 -11.9 1.57e- 32 0.785 0.841
5 age_group45+ 0.708 0.0203 -17.0 4.09e- 65 0.680 0.736
6 age_groupunknown 0.800 0.0962 -2.32 2.06e- 2 0.662 0.965
7 year2024 0.921 0.0123 -6.69 2.18e- 11 0.899 0.943
pred
truth 0 1
0 7560 10549
1 4009 7218
$accuracy
[1] 0.5037497
$precision
[1] 0.4062588
$recall
[1] 0.6429144
[Done] exited with code=0 in 12.272 seconds
+13
View File
@@ -0,0 +1,13 @@
6. Oral Presentation (5%). Plan for a ~5 minute presentation; slides must cover the
following:
a). Title (with your name)
b). Problem area what you wanted to explore/ solve/ predict and why, and what you
wanted to predict?
c). The data where it came from, why it was applicable and the preliminary assessments
you made.
d). How you conducted your analysis: distribution, pattern/ relationship and model
construction. What techniques did you use/ not use and why? What worked? What did not
work? How did you apply the model? How did you optimize, account for uncertainties?
f). What did you predict and what decisions (prescriptions) were possible. What was the
outcome, conclusions?
@@ -0,0 +1,220 @@
# Did Stack Overflow Answers Increase After ChatGPT? — Term Project Report
## Itamar Oren-Naftalovich
## 1. Abstract and introduction
This project asks whether the number of answers posted on Stack Overflow (SO) increased or decreased after the public launch of ChatGPT on 2022-11-30, and after subsequent community policy events. To study this, we combine (i) site-level activity from the Stack Exchange Data Explorer (SEDE) with (ii) developer sentiment and usage data from the annual Stack Overflow Developer Survey. Together, these data allow us to both measure changes in answer volume and to contextualize those changes using self-reported behavior.
Our core hypothesis is that the arrival of high-quality, conversational code assistance would noticeably change the supply of answers on SO, because developers have a new place to go for immediate help. We further treat moderation policies and community events as additional shocks that may amplify or dampen this effect.
We frame the problem as a quasi-experimental time-series analysis with interrupted trends around several key dates:
* ChatGPT public launch (2022-11-30)
* Initial Stack Overflow policy banning AI-generated answers (policy posted 2022-12-05)
* Later moderation and governance events (including the moderation strike)
Throughout, we pay attention to:
* **Internal validity:** controlling for pre-existing trends and seasonality, rather than treating pre/post averages as independent.
* **External validity:** comparing site-level patterns to changes in developer behavior reported in survey data.
* **Measurement caveats:** handling deleted content, moderation queues, and sampling or survey-response effects.
**Prior work and context.** The slide deck you provided summarizes the key posts and data sources (OpenAIs announcement, Meta Stack Overflow policy discussions, moderation-strike posts, traffic analyses, and SO survey documentation). These references define the timeline and motivate the research question without being re-stated in full here.
---
## 2. Data description and preliminary analysis
### Datasets
We use two complementary datasets:
* **Dataset 1 — Site activity.**
Monthly counts of **new answers** on Stack Overflow from the public Stack Exchange Data Explorer (SEDE). The extract includes both non-deleted and deleted answers so that we can separate organic activity from moderation effects. The main analysis window is **20182025**, which provides several pre-ChatGPT baseline years and a meaningful post-event period.
* **Dataset 2 — Developer survey.**
Selected questions from the **Stack Overflow Developer Survey (20232025)**, focusing on visit frequency (e.g., daily vs. weekly) and adoption of AI tools such as ChatGPT. These variables are used to understand shifts in demand for on-site answers and how they correlate with AI usage.
### Criteria and rationale
* Dataset 1 directly measures the outcome of interest: **answer supply on Stack Overflow**.
* Dataset 2 provides **behavioral context**: whether developers who use ChatGPT heavily also report visiting Stack Overflow less frequently.
By combining logs and surveys, we can triangulate between observational activity data and self-reported changes in workflow. The goal is not to claim strict causality, but to see whether the patterns in these sources align.
### Preliminary views
The first set of plots (Figure 1) provides high-level structure for later modelling:
* Time-series plots of monthly answers reveal overall levels and long-run trends.
* Comparisons of pre- and post-ChatGPT periods highlight visible changes in level or slope.
* Seasonal views (e.g., by calendar month) show systematic patterns such as summer slowdowns or end-of-year dips.
These descriptive views inform the later modelling choices such as adding interrupted trend terms and seasonal controls.
![figure 1: 01_answers_ts.png](./imgs/01_answers_ts.png)
*Figure 1. Preliminary monthly trend in new answers on Stack Overflow.*
---
## 3. Exploratory analysis
We first clean and harmonize the SEDE extract, collapsing to **monthly answer counts** and separating deleted from non-deleted answers. We then:
* Scan for structural breaks or anomalies around key event dates.
* Apply short moving averages to highlight medium-run shifts in the time series.
* Plot seasonality by calendar month to visualize recurring within-year patterns.
On the survey side, we:
* Construct indicators for **frequent SO visits** (e.g., daily or almost daily) and **ChatGPT usage**.
* Compare distributions across years to detect shifts in visiting behavior and AI adoption.
### Sources of uncertainty and bias
We explicitly track several sources of uncertainty:
* **Policy and moderation effects.**
Deletions and review backlogs can move answers between months or suppress visible counts. To address this, we track **deleted and non-deleted answers separately** and compare them over time.
* **Seasonality and macro conditions.**
Holidays, hiring cycles, and broader market conditions can confound naive pre/post comparisons. We therefore visualize **within-year seasonality** and include time controls in the models.
* **Survey representativeness.**
Survey respondents may not be a random sample of all SO users. Active answerers and enthusiastic AI adopters might be over- or under-represented. For this reason, we treat survey-based findings as **correlational**, not causal.
![figure 2: 02_box_pre_post.png](./imgs/02_box_pre_post.png)
*Figure 2. Example exploratory views: seasonal patterns, moving averages, and distributional summaries.*
![figure 3: 03_seasonal.png](./imgs/03_seasonal.png)
*Figure 3. Additional exploratory views, emphasizing seasonality and pre/post differences.*
---
## 4. Model development and application
To move beyond descriptive plots, we implement four modelling approaches. Together, they test for structural changes in answer volume and connect survey behavior to on-site activity.
1. **Interrupted time-series linear regression (ITS).**
* **Outcome:** monthly new answers.
* **Predictors:** a linear time trend, post-ChatGPT **level change**, and **slope change**, plus optional indicator variables for policy and moderation periods.
* **Goal:** test for discrete jumps and gradual trend shifts relative to the pre-event trajectory.
2. **Poisson / negative-binomial regression for counts.**
* Same predictors as ITS but with a **log link** for count data.
* We compare Poisson and negative-binomial versions to account for over-dispersion and to avoid relying on normal residuals.
3. **ARIMA time-series forecasting.**
* Fit solely on **pre-ChatGPT** data to produce a counterfactual forecast.
* Compare out-of-sample forecasts to observed post-event answer counts.
* Large and sustained deviations beyond forecast bands signal additional shocks beyond trend and seasonality.
4. **Logistic classification on survey microdata.**
* **Target:** whether a respondent is a **“frequent SO visitor”** (daily or almost daily).
* **Predictors:** a ChatGPT-usage indicator plus demographic and role controls.
* **Evaluation:** accuracy, precision/recall, and calibration curves, with a hold-out split for validation.
* **Purpose:** test whether heavy ChatGPT users are **less likely** to report frequent SO visits, even after adjusting for other factors.
### Validation and diagnostics
For each model family, we run basic diagnostic checks:
* **ITS models:**
* Inspect residuals for autocorrelation and remaining seasonality.
* Re-fit with seasonal terms or alternative specifications where necessary.
* **Count models (Poisson/NB):**
* Check over-dispersion indicators and compare Poisson vs. negative-binomial fits.
* Examine goodness-of-fit plots and residual patterns.
* **ARIMA forecasts:**
* Select model orders using information criteria on the training window.
* Inspect forecast errors and confidence bands to ensure reasonable counterfactual behavior.
* **Classification models:**
* Use a separate hold-out set for evaluation.
* Report confusion matrices and standard performance metrics.
* Inspect calibration to verify that predicted probabilities match observed frequencies.
![figure 4: 04_ma3.png](./imgs/04_ma3.png)
*Figure 4. Example model fits (ITS) and moving-average smoothed trends around intervention dates.*
![figure 5: 05_itspois.png](./imgs/05_lm_fit.png)
*Figure 5. Illustrative Poisson / negative-binomial fits versus observed counts.*
![figure 6: 06_pois_fit.png](./imgs/06_pois_fit.png)
*Figure 6. Additional count-model diagnostics and fit comparisons.*
![figure 7: 07_arima_forecast.png](./imgs/07_arima_forecast.png)
*Figure 7. ARIMA counterfactual forecast vs. observed post-event answer volumes.*
---
## 5. Conclusions and discussion
Across the descriptive plots and models, the period after November 2022 shows both **level** and **slope** changes that are consistent with a structural shift in answer supply on Stack Overflow. These changes coincide with the availability of ChatGPT and closely timed policy and moderation events.
ARIMA counterfactuals trained on pre-event data give a baseline trajectory. When we compare this baseline to observed post-event values, we see deviations that fall outside typical forecast bands, supporting the idea that there was a shock beyond existing trends and seasonality.
The survey-based classifiers reinforce this picture: heavy ChatGPT adoption is associated with **lower self-reported visit frequency**, even after controlling for observable demographics and roles. This pattern lines up with the site-level decline in new answers and suggests that some developers are partially substituting conversational AI for Stack Overflow visits.
### Limitations
* **Causality is tentative.**
* Policy changes and the moderation strike overlap with the ChatGPT rollout, making it difficult to cleanly attribute changes in answer volume to any single event.
* External shocks—such as labor-market cycles, ecosystem-tooling changes, or shifts in documentation quality—may also contribute.
* **Survey constraints.**
* Survey responses are self-reported and subject to recall and response biases.
* The sample may not represent the full SO user base or the most active answerers.
Because of these limitations, we interpret the results as **strong correlational evidence** of a shift in answer supply and usage patterns, not as a sharp causal estimate. Future work should:
* Incorporate richer covariates (e.g., tag-level activity, user cohorts, question complexity).
* Explore quasi-experimental designs (such as synthetic controls) to better isolate the effect of AI tools and platform policies.
### Implications
For knowledge platforms, the analysis suggests that answer supply can be **sensitive to rapid changes in assistance tooling and governance**. In particular:
* Sustainable moderation capacity and clear, transparent AI guidance appear important to avoid destabilizing answer quality and volume.
* As conversational assistants become part of everyday developer workflows, platforms like Stack Overflow may need deeper integration paths (for example, exposing structured answers or metadata that assistants can consume directly).
* Balancing open contribution, quality control, and integration with external AI tools may be key to retaining community participation in an environment where “first-line help” increasingly comes from chatbots.
---
## References
* OpenAI. “Introducing ChatGPT.” OpenAI, 30 Nov. 2022.
* Prasnikar, D. “Policy: Generative AI (e.g., ChatGPT) Is Banned.” Meta Stack Overflow, 5 Dec. 2022.
* Mithical. “Moderation Strike: Stack Overflow, Inc. Cannot Consistently Ignore, Mistreat, and Malign Its Volunteers.” Meta Stack Exchange, 5 June 2023.
* Makyen. “Moderation Strike: Conclusion and the Way Forward.” Meta Stack Exchange, 7 Aug. 2023.
* Carr, D. F. “Stack Overflow Is ChatGPT Casualty: Traffic Down 14% in March.” Similarweb Insights, 19 Apr. 2023.
* “Database Schema Documentation for the Public Data Dump and SEDE.” Meta Stack Exchange (FAQ), 4 Oct. 2022.
* Stack Overflow. *Stack Overflow Developer Survey* (20232025).
## Image gallery (additional figures)
![Fig 8](./imgs/08_survey_bar.png)
* *Figure 8.* Additional pre/post comparison plots.
![Fig 9](./imgs/09_logit_probs.png)
* *Figure 9.* Additional figure from the provided results.
@@ -0,0 +1,410 @@
# News Popularity in Multiple Social Media Platforms
This project analyzes the **News Popularity in Multiple Social Media Platforms** dataset from the UCI Machine Learning Repository. The data contains ~93k news items collected between November 2015 and July 2016, with their final popularity on Facebook, Google+ and LinkedIn across four topics: *economy*, *microsoft*, *obama* and *palestine*.
---
## 1. Exploratory Data Analysis
### 1.1 Data overview and cleaning
We work primarily with `Data/News_Final.csv`, which has **93,239** rows and 11 variables:
- `IDLink` numeric id of the article
- `Title`, `Headline` short text fields
- `Source` news outlet that originally published the story
- `Topic` one of {economy, microsoft, obama, palestine}
- `PublishDate` publication timestamp
- `SentimentTitle`, `SentimentHeadline` numeric sentiment scores derived from title and headline text
- `Facebook`, `GooglePlus`, `LinkedIn` final popularity on each social media platform
According to the dataset documentation, **-1** in the popularity variables indicates that no final popularity value was observed. In the code, any value `< 0` in `Facebook`, `GooglePlus`, or `LinkedIn` is therefore replaced with `NaN`. Missing popularity values are later dropped on a permodel basis. `PublishDate` is converted to a proper timestamp, and a numeric time feature
```text
DaysSinceEpoch = days since 1970-01-01
````
is created to allow inclusion of temporal trends in the models. We also logtransform Facebook popularity:
```text
log_Facebook = log1p(Facebook)
```
which is used as the target for regression models.
---
### 1.2 Popularity distributions
A histogram of Facebook share counts on a **logarithmic xaxis**, after removing missing and zero values
![distribution of facebook popularity](imgs/eda_facebook_hist.png)
*Figure 1: Distribution of Facebook popularity on a log xaxis.*
The distribution is extremely rightskewed:
- Most articles receive very few shares.
- A small number of “viral” articles receive thousands of shares.
On the cleaned data, summary statistics for Facebook shares are approximately:
- median is approx 8
- mean is approx 129
- 90th percentile is approx 214
- 99th percentile is approx 2,322
- max = 49,211
Google+ and LinkedIn exhibit similar heavytailed patterns (with smaller absolute scales), which matches the description of the dataset creators ([arXiv][1]).
The distribution of `log1p(Facebook)`
![distribution of log-transformed facebook popularity](imgs/eda_log_facebook_hist.png)
*Figure 2: Distribution of logtransformed Facebook popularity.*
The log transform compresses the heavy tail and produces a more regular, unimodal distribution. This justifies using `log1p(popularity)` as the regression target: it reduces the influence of rare extreme outliers while keeping them in the data, which is important because viral stories are the phenomena of interest.
---
### 1.3 Topic effects
The four topics are not equally represented:
- economy: 33,928 items
- obama: 28,610
- microsoft: 21,858
- palestine: 8,843
The mean logFacebook popularity by topic.
![average facebook popularity by topic](imgs/eda_mean_by_topic.png)
*Figure 3: Mean logFacebook popularity by topic.*
Key observations:
- **obama** stories clearly have the highest average popularity.
- **microsoft** is slightly above **economy** and **palestine**.
- In original share counts, obama articles average roughly an order of magnitude more shares than economy/microsoft/palestine stories, but all topics remain strongly skewed.
This suggests that topic is an important categorical predictor for popularity, and motivates including it as a onehot encoded feature in the models.
---
### 1.4 Sentiment and popularity
Sentiment scores from the title and headline are continuous values roughly in the interval [-1, 1]. Their empirical distributions are centered very close to 0 with standard deviations around 0.14, indicating that most titles and headlines are only mildly positive or negative.
A 5,000row sample of `SentimentTitle` vs `log_Facebook`
![title sentiment vs facebook popularity](imgs/eda_sentiment_vs_popularity.png)
*Figure 4: Scatter of title sentiment vs logFacebook popularity (sample of 5,000 articles).*
The scatter plot shows:
- A dense vertical band near sentiment 0, reflecting many neutral titles.
- Viral and nonviral articles scattered across the full sentiment range, with no obvious linear trend.
Empirically, the correlation between sentiment and Facebook popularity is almost zero (|r| is approx 0.01). This suggests that sentiment alone is a weak predictor of popularity; we still include it in models because it may interact with topic or time, but we do not expect it to explain much variance by itself.
---
### 1.5 EDA conclusions
From the exploratory analysis we conclude:
1. **Popularity variables are nonnegative, highly skewed, and heavytailed.**
- Logtransforming shares yields more regular distributions, so regression models should target `log1p(popularity)` instead of raw counts.
2. **Topic has a strong effect on expected popularity.**
- Particularly, obamarelated news is more popular on Facebook; microsoft is relatively stronger on LinkedIn (from descriptive statistics, not shown here).
3. **Title/headline sentiment has little linear relationship with popularity.**
- It should not be expected to drive predictions strongly.
4. **There are many extreme outliers (viral stories), but these are the signal we care about.**
- We choose *not* to remove them; instead, we rely on robust models and logtransformed targets.
These observations motivate a modeling strategy that combines:
- **Linear models** (to quantify simple topic/sentiment effects on logpopularity).
- **Nonlinear treebased models** (to capture complex relationships and heavytailed behaviour).
- **Classification** of viral vs nonviral stories.
- **Clustering** of timeseries trajectories to identify typical growth patterns.
The next section formalizes these ideas.
---
## 2. Model Development, Validation and Optimization
We develop **five** models: three regression models (including a dimensionreduced variant), one classification model, and one clustering model. This covers regression, classification and unsupervised learning objectives, and explicitly examines the impact of dimensionality reduction.
All supervised models use:
- Train/test split: **80% training, 20% test**, `random_state=42`.
- Evaluation on the heldout test set only (no peeking).
- Metrics:
- Regression: R² and RMSE on logscale (using `root_mean_squared_error`).
- Classification: accuracy, F1 for the positive class, ROC AUC and confusion matrix.
### 2.1 Common preprocessing
For each model:
1. Replace `-1` in `Facebook`, `GooglePlus`, `LinkedIn` with `NaN`.
2. Drop rows with missing values in the specific target variable.
3. Use `DaysSinceEpoch` as a numeric representation of `PublishDate`.
4. Where appropriate, use `log_Facebook = log1p(Facebook)` as the regression target.
5. Encode `Topic` using onehot encoding with economy as the reference level (`drop_first=True`).
For timeseries models we also use `Data/Facebook_Economy.csv`, which stores Facebook popularity snapshots TS1TS144 every 20 minutes for economy articles. We join it with `News_Final.csv` on `IDLink` and restrict to:
- `Topic == "economy"`
- Time slices **TS1TS50** as predictors (roughly first 1617 hours)
- Final logFacebook popularity as the target
Negative TS values are interpreted as “no observed popularity yet” and are set to 0.
---
### 2.2 Regression Model 1 Linear regression on static features
**Goal.** Predict logFacebook popularity using only static metadata (no early popularity feedback).
- **Target:** `y = log_Facebook` for all topics.
- **Features:**
- `SentimentTitle`, `SentimentHeadline`
- `DaysSinceEpoch` (publication time)
- Topic onehot dummies: `Topic_microsoft`, `Topic_obama`, `Topic_palestine` (economy is implicit baseline).
We fit an ordinary least squares linear regression on the training split and evaluate on the test set.
**Results (test set):**
- **R² is approx 0.157**
- **RMSE is approx 1.86** in logspace
Actual vs predicted logFacebook values
![model 1: actual vs predicted](imgs/model1_actual_vs_predicted.png)
*Figure 5: Model 1 predictions vs actual logFacebook values.*
The predictions are compressed into a narrow band, underpredicting viral articles and overpredicting lowpopularity ones. Key coefficients:
- `Topic_obama` is approx +1.78 (large positive shift vs economy)
- `Topic_microsoft` is approx +0.10
- `Topic_palestine` is approx +0.02
- `SentimentTitle` is approx 0.38, `SentimentHeadline` is approx 0.06
- `DaysSinceEpoch` is approx 0.0007 (tiny downward trend over time)
Interpretation:
- Topic has a clear effect (especially obama).
- Sentiment effects are small and slightly negative.
- The model explains only ~16% of the variance in logpopularity, confirming that static features alone are weak predictors.
---
### 2.3 Regression Model 2 Random forest on early time slices
**Goal.** Predict final logFacebook popularity for **economy** stories using early Facebook popularity time slices and sentiment.
- **Target:** `log_Facebook` for economy topic, joined with Facebook_Economy timeseries.
- **Features:**
- TS1TS50 (early cumulative popularity counts, cleaned: negative → 0)
- `SentimentTitle`, `SentimentHeadline`
We fit a `RandomForestRegressor` with:
- 120 estimators,
- `min_samples_leaf=2`,
- `max_depth=None` (trees grow fully),
- `n_jobs=-1`, `random_state=42`.
**Results (test set):**
- **R² is approx 0.746**
- **RMSE is approx 0.86** (logscale)
Feature importances indicate:
- `TS50` alone contributes ~81% of total importance.
- Combined sentiment variables contribute ~17%.
- Earlier TS features each have very small marginal importance.
Thus, knowing an articles popularity after ~17 hours (TS50) is already highly predictive of its final 2day popularity. Early engagement is a much stronger signal than sentiment or publish time.
---
### 2.4 Regression Model 3 PCA + random forest (dimension reduction)
Model 3 examines the effect of **dimension reduction** on performance.
Instead of using all 50 TS features directly, we:
1. Standardize TS1TS50 with `StandardScaler`.
2. Apply PCA with `n_components=10`.
3. Concatenate the 10 PCA components with the two sentiment features (`SentimentTitle`, `SentimentHeadline`).
4. Train the same `RandomForestRegressor` as Model 2 on this 12dimensional feature space.
PCA results:
- 1st component explains is approx **93.5%** of variance.
- First 10 components together explain is approx **99.9%** of variance.
**Results (test set):**
- **R² is approx 0.745**
- **RMSE is approx 0.87**
Compared to Model 2:
- R² decreases only slightly (0.746 → 0.745).
- RMSE increases minimally (0.862 → 0.865).
So PCA reduces dimensionality from 50 TS features to 10 components with **negligible loss of predictive performance**. The first PCA components effectively summarize overall popularity level and growth pattern, which are the dominant signals for final popularity.
---
### 2.5 Classification Model 4 Logistic regression for viral vs nonviral
**Goal.** Classify whether an article is *viral* on Facebook, defined as being in the top 10% of final popularity.
- **Target:**
- `viral_fb = 1` if `Facebook ≥ 214` (90th percentile), otherwise 0.
- Class distribution: ~10% positive, ~90% negative.
- **Features:**
- `SentimentTitle`, `SentimentHeadline`
- `DaysSinceEpoch`
- Topic dummies as before
We intentionally **do not use timeslice features** here to simulate making a decision at or before publication, when no engagement data is available yet.
We fit a `LogisticRegression` with `max_iter=500` and `class_weight="balanced"` to counter class imbalance.
**Results (test set):**
- **Accuracy is approx 0.73**
- A naive classifier that always predicts “nonviral” would obtain is approx 0.90 accuracy, highlighting that raw accuracy is misleading under imbalance.
- **F1 (viral class) is approx 0.36**
- **ROC AUC is approx 0.75**
The ROC AUC of 0.75 indicates decent **ranking ability**: the model tends to assign higher probabilities to truly viral articles than to nonviral ones. However, at the default 0.5 threshold it generates many false positives; tuning the probability threshold would be necessary in practice depending on the business tradeoff between missing viral content and wasting attention on nonviral items.
---
### 2.6 Clustering Model 5 Kmeans on timeseries shapes
To understand typical growth trajectories of popularity, we cluster early timeseries patterns.
- **Features:** TS1TS50, standardized with `StandardScaler`.
- **Sample:** random subset of 5,000 economy+Facebook articles to keep computation manageable.
- **Algorithm:** `KMeans(n_clusters=3, n_init=10, random_state=42)`.
**Results:**
- **Silhouette score is approx 0.97**, indicating wellseparated clusters (although partly due to one large cluster vs a few small ones).
- Cluster sizes and mean final Facebook shares:
| cluster | count | mean shares | median | max |
| ------: | ----: | ----------: | -----: | ----: |
| 0 | 4,978 | ~37 | 3 | 7,045 |
| 1 | 1 | 1,886 | 1,886 | 1,886 |
| 2 | 21 | ~2,478 | 1,291 | 8,010 |
Inspecting centroid timeseries (TS1, TS10, TS25, TS50):
- **Cluster 0:** low TS1 (~0.3), slow growth, TS50 is approx 17 → “normal/low popularity” baseline; almost all articles.
- **Cluster 2:** TS1 is approx 23, TS10 is approx 211, TS50 is approx 1,388 → early rapid takeoff and sustained growth; these are clearly **viral** trajectories.
- **Cluster 1:** single extreme **superviral** outlier with TS1 is approx TS50 is approx 1,886.
Clustering therefore uncovers distinct popularity regimes: ordinary stories, viral stories, and rare superviral events.
---
## 3. Decisions and Practical Use
### 3.1 What do the models tell us?
**1. Static metadata is not enough for precise prediction.**
Model 1, using only topic, time and sentiment, explains only about 16% of the variance in logFacebook popularity. The EDA already indicated weak correlations between sentiment and engagement, and the model confirms that topic is the only strong static predictor. This means:
- Before any user feedback is observed, we can form only a rough guess about popularity (e.g., “obama stories tend to do better”), but detailed predictions are unreliable.
**2. Early engagement is the key signal.**
Models 2 and 3 show that once ~16 hours of Facebook feedback are available:
- Random forests can explain ~75% of the variance in final logpopularity.
- PCA compresses the 50dimensional TS inputs to 10 components with essentially no loss in performance.
In practice, this means that **monitoring early timeseries of shares is crucial**. Stories that are already accumulating shares quickly by TS50 are extremely likely to end up as the most popular items after two days.
**3. Logistic regression is useful for ranking, not for definitive labels.**
The viral vs nonviral classifier has:
- Good ranking ability (ROC AUC ~0.75).
- Moderate F1 score and relatively low accuracy compared to the majority baseline.
This makes it better suited as a **priority score** than as a hard decision rule. For example, an editorial team might sort draft stories by predicted viral probability to decide where to invest additional editorial resources, but should not automatically discard stories predicted to be nonviral.
**4. Clustering uncovers growth archetypes.**
Kmeans reveals three typical growth shapes:
1. Slow/low growth (most items).
2. Clearly viral trajectories.
3. A tiny number of superviral events.
Recognizing that an articles early TS pattern matches the viral or superviral cluster can trigger decisions such as:
- Featuring the article more prominently on the homepage.
- Allocating budget for promoted posts.
- Producing followup content while interest is high.
### 3.2 How useful are these models for real decisions?
A practical decision workflow informed by this analysis could be:
1. **Prepublication / immediately at publication**
Use the logistic regression model and static features (topic, sentiment, time) to assign each new article a baseline probability of becoming viral. This can help prioritize which stories to monitor more closely, but should not be the sole basis for publication decisions.
2. **Early postpublication (first few hours)**
Once some timeslice information is available (TS1TS10), use clustering to see whether the articles early trajectory resembles known viral patterns. Articles already in the viral cluster are good candidates for early promotion.
3. **Midwindow (around TS50)**
At ~1617 hours, feed TS1TS50 into the PCA + random forest regressor (Model 3) to estimate final reach. This estimate can guide decisions about:
- How long to keep the story on front pages.
- Whether to schedule followups or derivative content.
- Where to allocate marketing/promotional resources.
4. **Limitations**
- Popularity is still highly stochastic; even with R² is approx 0.75 in the best case, there is considerable residual uncertainty.
- Models trained on this dataset focus on four specific topics and a particular time period (20152016). Performance may degrade when applied to different domains, languages or time spans. ([arXiv][1])
Overall, these models are best used for **relative ranking and triage** and help in deciding which articles deserve extra attention rather than for exact point predictions of future share counts. Combining static features, early engagement signals, and growthpattern clustering yields a practical decision support tool for newsrooms and social media teams working with limited resources.
If you actually read this far...nice! :D
[1]: https://arxiv.org/abs/1801.07055 "Multi-Source Social Feedback of Online News Feeds"
Binary file not shown.
+363
View File
@@ -0,0 +1,363 @@
import zipfile
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
r2_score,
root_mean_squared_error,
accuracy_score,
f1_score,
roc_auc_score,
confusion_matrix,
silhouette_score,
)
from sklearn.pipeline import Pipeline
from sklearn.cluster import KMeans
# ensure imgs dir exists
os.makedirs("imgs", exist_ok=True)
# data loading
zip_path = "news+popularity+in+multiple+social+media+platforms.zip"
with zipfile.ZipFile(zip_path, "r") as zf:
with zf.open("Data/News_Final.csv") as f:
news = pd.read_csv(f)
# basic cleaning
pop_cols = ["Facebook", "GooglePlus", "LinkedIn"]
# encode -1 as missing
for col in pop_cols:
news.loc[news[col] < 0, col] = np.nan
# convert publishdate and add numeric time feature
news["PublishDate"] = pd.to_datetime(news["PublishDate"])
news["DaysSinceEpoch"] = (
news["PublishDate"] - pd.Timestamp("1970-01-01")
).dt.days
# log transform facebook popularity where available
news["log_Facebook"] = np.log1p(news["Facebook"])
# eda helpers (optional plotting)
def plot_eda():
plt.figure()
vals = news["Facebook"].dropna()
vals = vals[vals > 0]
vals.plot.hist(bins=50)
plt.xlabel("facebook shares")
plt.ylabel("count")
plt.title("distribution of facebook popularity")
plt.xscale("log")
plt.tight_layout()
plt.savefig("imgs/eda_facebook_hist.png")
plt.close()
plt.figure()
news["log_Facebook"].dropna().plot.hist(bins=50)
plt.xlabel("log1p(facebook shares)")
plt.ylabel("count")
plt.title("distribution of log-transformed facebook popularity")
plt.tight_layout()
plt.savefig("imgs/eda_log_facebook_hist.png")
plt.close()
mean_by_topic = (
news.groupby("Topic")["log_Facebook"].mean().sort_values()
)
plt.figure()
mean_by_topic.plot(kind="bar")
plt.ylabel("mean log1p(facebook shares)")
plt.title("average facebook popularity by topic")
plt.tight_layout()
plt.savefig("imgs/eda_mean_by_topic.png")
plt.close()
sample = news.dropna(
subset=["log_Facebook", "SentimentTitle"]
).sample(5000, random_state=42)
plt.figure()
plt.scatter(
sample["SentimentTitle"],
sample["log_Facebook"],
alpha=0.3,
)
plt.xlabel("sentimenttitle")
plt.ylabel("log1p(facebook shares)")
plt.title("title sentiment vs facebook popularity (sample)")
plt.tight_layout()
plt.savefig("imgs/eda_sentiment_vs_popularity.png")
plt.close()
# model 1: linear regression
def run_model_1():
df = news.dropna(subset=["log_Facebook"]).copy()
X = df[["SentimentTitle", "SentimentHeadline", "DaysSinceEpoch", "Topic"]]
X = pd.get_dummies(X, columns=["Topic"], drop_first=True)
y = df["log_Facebook"]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
linreg = LinearRegression()
linreg.fit(X_train, y_train)
y_pred = linreg.predict(X_test)
r2 = r2_score(y_test, y_pred)
rmse = root_mean_squared_error(y_test, y_pred)
print("model 1 linear regression")
print("r2:", r2)
print("rmse:", rmse)
print("coefficients:")
print(pd.Series(linreg.coef_, index=X.columns))
# optional diagnostic plot
plt.figure()
plt.scatter(y_test, y_pred, alpha=0.3)
plt.xlabel("actual log1p(facebook)")
plt.ylabel("predicted log1p(facebook)")
plt.title("model 1: actual vs predicted")
plt.tight_layout()
plt.savefig("imgs/model1_actual_vs_predicted.png")
plt.close()
return linreg, (X_test, y_test, y_pred)
# prepare economy + facebook time-slice data
with zipfile.ZipFile(zip_path, "r") as zf:
with zf.open("Data/Facebook_Economy.csv") as f:
fb_econ = pd.read_csv(f)
# ensure integer id for join
news["IDLink_int"] = news["IDLink"].astype(int)
news_econ = news[news["Topic"] == "economy"].copy()
news_econ["IDLink_int"] = news_econ["IDLink"].astype(int)
fb_econ_merged = fb_econ.merge(
news_econ, left_on="IDLink", right_on="IDLink_int", how="inner"
)
# clean time-slice features
ts_cols = [c for c in fb_econ.columns if c.startswith("TS")]
for col in ts_cols:
fb_econ_merged.loc[fb_econ_merged[col] < 0, col] = 0
# drop rows with missing facebook target
fb_econ_merged = fb_econ_merged[fb_econ_merged["Facebook"].notna()].copy()
fb_econ_merged["log_Facebook"] = np.log1p(fb_econ_merged["Facebook"])
ts_cols_early = ts_cols[:50]
# model 2: random forest on raw early ts
def run_model_2():
X = fb_econ_merged[ts_cols_early + ["SentimentTitle", "SentimentHeadline"]]
y = fb_econ_merged["log_Facebook"]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
rf = RandomForestRegressor(
n_estimators=120,
random_state=42,
n_jobs=-1,
max_depth=None,
min_samples_leaf=2,
)
rf.fit(X_train, y_train)
pipe = Pipeline([
("scaler", StandardScaler()),
("pca", PCA(n_components=10, random_state=42)),
("rf", RandomForestRegressor(
n_estimators=120,
random_state=42,
n_jobs=-1,
max_depth=None,
min_samples_leaf=2,
)),
])
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
r2 = r2_score(y_test, y_pred)
rmse = root_mean_squared_error(y_test, y_pred)
print("model 2 random forest on raw ts")
print("r2:", r2)
print("rmse:", rmse)
importances = pd.Series(rf.feature_importances_, index=X.columns)
print("top importances:")
print(importances.sort_values(ascending=False).head(10))
return rf, (X_test, y_test, y_pred)
# model 3: pca + random forest
def run_model_3():
ts = fb_econ_merged[ts_cols_early]
sent = fb_econ_merged[["SentimentTitle", "SentimentHeadline"]]
X = pd.concat([ts, sent], axis=1)
y = fb_econ_merged["log_Facebook"]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train[ts_cols_early])
X_test_scaled = scaler.transform(X_test[ts_cols_early])
pca = PCA(n_components=10, random_state=42)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)
train_sent = X_train[["SentimentTitle", "SentimentHeadline"]].values
test_sent = X_test[["SentimentTitle", "SentimentHeadline"]].values
X_train_final = np.hstack([X_train_pca, train_sent])
X_test_final = np.hstack([X_test_pca, test_sent])
rf = RandomForestRegressor(
n_estimators=120,
random_state=42,
n_jobs=-1,
max_depth=None,
min_samples_leaf=2,
)
rf.fit(X_train_final, y_train)
y_pred = rf.predict(X_test_final)
r2 = r2_score(y_test, y_pred)
rmse = root_mean_squared_error(y_test, y_pred)
print("model 3 random forest on pca(ts)")
print("r2:", r2)
print("rmse:", rmse)
print("pca variance explained (first 10):", pca.explained_variance_ratio_)
print("total variance explained:", pca.explained_variance_ratio_.sum())
return rf, (X_test, y_test, y_pred), (pca, scaler)
# model 4: logistic regression (viral vs non-viral)
def run_model_4():
df = news.copy()
df = df[df["Facebook"].notna()].copy()
threshold = df["Facebook"].quantile(0.9)
df["viral_fb"] = (df["Facebook"] >= threshold).astype(int)
X = df[["SentimentTitle", "SentimentHeadline", "DaysSinceEpoch", "Topic"]]
X = pd.get_dummies(X, columns=["Topic"], drop_first=True)
y = df["viral_fb"]
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=42,
stratify=y,
)
clf = LogisticRegression(
max_iter=500,
class_weight="balanced",
)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
y_proba = clf.predict_proba(X_test)[:, 1]
acc = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
auc = roc_auc_score(y_test, y_proba)
cm = confusion_matrix(y_test, y_pred)
print("model 4 logistic regression (viral vs non-viral)")
print("threshold (shares):", threshold)
print("accuracy:", acc)
print("f1 (positive class):", f1)
print("roc auc:", auc)
print("confusion matrix:\n", cm)
return clf, (X_test, y_test, y_pred, y_proba)
# model 5: k-means clustering on ts shapes
def run_model_5():
X = fb_econ_merged[ts_cols_early].values
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
rng = np.random.RandomState(42)
idx = rng.choice(X_scaled.shape[0], size=5000, replace=False)
X_sample = X_scaled[idx]
fb_sample = fb_econ_merged["Facebook"].values[idx]
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
kmeans.fit(X_sample)
labels = kmeans.labels_
sil = silhouette_score(X_sample, labels)
print("model 5 kmeans on ts shapes")
print("silhouette score:", sil)
cluster_df = pd.DataFrame(
{"cluster": labels, "Facebook": fb_sample}
)
print(cluster_df.groupby("cluster")["Facebook"].agg(
["count", "mean", "median", "max"]
))
centers_scaled = kmeans.cluster_centers_
centers = scaler.inverse_transform(centers_scaled)
centers_df = pd.DataFrame(centers, columns=ts_cols_early)
summary = pd.DataFrame({
"cluster": list(range(centers_df.shape[0])),
"avg_ts": centers_df.mean(axis=1),
"ts1": centers_df["TS1"],
"ts10": centers_df["TS10"],
"ts25": centers_df["TS25"],
"ts50": centers_df["TS50"],
})
print("cluster centroid summary:\n", summary)
return kmeans, scaler, summary
if __name__ == "__main__":
run_model_1()
run_model_2()
run_model_3()
run_model_4()
run_model_5()
plot_eda()
Binary file not shown.

After

Width:  |  Height:  |  Size: 19 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 23 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 23 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 130 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 80 KiB

+53
View File
@@ -0,0 +1,53 @@
model 1 linear regression
r2: 0.1566089012155698
rmse: 1.8625218879551908
coefficients:
SentimentTitle -0.383499
SentimentHeadline -0.064708
DaysSinceEpoch -0.000678
Topic_microsoft 0.101848
Topic_obama 1.779152
Topic_palestine 0.023738
dtype: float64
model 2 random forest on raw ts
r2: 0.7441325592979975
rmse: 0.8661035218490399
top importances:
TS50 0.810814
SentimentHeadline 0.099992
SentimentTitle 0.067386
TS49 0.001883
TS48 0.000589
TS15 0.000503
TS18 0.000503
TS13 0.000498
TS24 0.000498
TS10 0.000480
dtype: float64
model 3 random forest on pca(ts)
r2: 0.7442278904925559
rmse: 0.8659421602173341
pca variance explained (first 10): [9.38529911e-01 3.24317512e-02 1.76049987e-02 7.50439628e-03
1.90148973e-03 6.83679307e-04 3.57135169e-04 2.12058930e-04
1.33577763e-04 9.66846072e-05]
total variance explained: 0.9994556829781833
model 4 logistic regression (viral vs non-viral)
threshold (shares): 214.0
accuracy: 0.7287481626653601
f1 (positive class): 0.35709101466105386
roc auc: 0.7530964866530827
confusion matrix:
[[10669 4023]
[ 406 1230]]
model 5 kmeans on ts shapes
silhouette score: 0.9732852082508215
count mean median max
cluster
0 4978 36.751708 3.0 7045.0
1 1 1886.000000 1886.0 1886.0
2 21 2477.761905 1291.0 8010.0
cluster centroid summary:
cluster avg_ts ts1 ts10 ts25 ts50
0 0 8.317766 0.297710 2.959221 7.836079 17.221977
1 1 1885.920000 1885.000000 1886.000000 1886.000000 1886.000000
2 2 640.917143 22.761905 211.142857 579.047619 1387.619048
+25
View File
@@ -0,0 +1,25 @@
Conduct the following analysis for the dataset:
1. Exploratory Data Analysis
Explore the statistical aspects of the dataset. Analyze the
distributions and provide summaries of the relevant statistics. Perform any cleaning,
transformations, interpolations, smoothing, outlier detection/ removal, etc. required on the
data. Include figures and descriptions of this exploration and a short description of what
you concluded (e.g. nature of distribution, indication of suitable model approaches you
would try, etc.) Min.1 page text + graphics (required).
2. Model Development, Validation and Optimization
Develop and evaluate three (4000-level) or four (6000-level) or more J models. If possible,
these models should cover more than one objective, i.e. regression, classification,
clustering. Consider the efect of dimension reduction of the dataset on model
performance. Diferent models means diferent combinations of an algorithm and a
formula (input and output features). The choice of independent and response variables is
up to you. Explain why you chose them. Construct the models, test/ validate them. Briefly explain the
validation approach. You can use any method(s) covered in the course. Include your code
in your submission. Compare model results if applicable. Report the results of the model
(fits, coeficients, sample trees, other measures of fit/ importance, etc., predictors and
summary statistics). Min. 2 pages of text + graphics (required).
3. Decisions
Describe your conclusions from the model
fits, predictions and how well (or not) it could be used for decisions and why. Min. 1/2 page
of text + graphics.
+9
View File
@@ -0,0 +1,9 @@
{
"[r]": {
// generated automatically? What even....
"editor.wordSeparators": "`~!@#$%^&*()-=+[{]}\\|;:'\",<>/",
"editor.indentSize": "tabSize",
"editor.useTabStops": true,
}
}
+41
View File
@@ -0,0 +1,41 @@
##########################################
### Principal Component Analysis (PCA) ###
##########################################
## load libraries
library(ggplot2)
library(ggfortify)
library(GGally)
library(e1071)
library(class)
library(psych)
library(readr)
## set working directory so that files can be referenced without the full path
setwd("~/Courses/Data Analytics/Fall25/labs/lab 4/")
## read dataset
wine <- read_csv("wine.data", col_names = FALSE)
## set column names
names(wine) <- c("Type","Alcohol","Malic acid","Ash","Alcalinity of ash","Magnesium","Total phenols","Flavanoids","Nonflavanoid Phenols","Proanthocyanins","Color Intensity","Hue","Od280/od315 of diluted wines","Proline")
## inspect data frame
head(wine)
## change the data type of the "Type" column from character to factor
####
# Factors look like regular strings (characters) but with factors R knows
# that the column is a categorical variable with finite possible values
# e.g. "Type" in the Wine dataset can only be 1, 2, or 3
####
wine$Type <- as.factor(wine$Type)
## visualize variables
pairs.panels(wine[,-1],gap = 0,bg = c("red", "yellow", "blue")[wine$Type],pch=21)
ggpairs(wine, ggplot2::aes(colour = Type))
###
Binary file not shown.
BIN
View File
Binary file not shown.
+128
View File
@@ -0,0 +1,128 @@
install.packages(
c("e1071", "caret", "randomForest", "ggplot2", "pROC"),
repos = c("https://cloud.r-project.org/"),
dependencies = TRUE
)
suppressPackageStartupMessages({
library(e1071) # for svm/tune.svm
library(caret) # for metrics
library(randomForest) # alternative classifier
library(ggplot2)
})
set.seed(42)
read_wine <- function() {
df <- read.csv("wine.data", header = FALSE)
colnames(df) <- c(
"Class",
"Alcohol", "Malic.acid", "Ash", "Alcalinity.of.ash", "Magnesium",
"Total.phenols", "Flavanoids", "Nonflavanoid.phenols", "Proanthocyanins",
"Color.intensity", "Hue", "OD280.OD315", "Proline"
)
df$Class <- factor(df$Class)
df
}
df <- read_wine()
# split into train/test
idx <- createDataPartition(df$Class, p = 0.8, list = FALSE)
train <- df[idx, ]
test <- df[-idx, ]
# choose a subset of features based on ANOVA F-test
# I picked this sbuset before the runs:
# alcohol, flavanoids, color intensity, od280/od315, proline, total phenols
features <- c("Alcohol", "Flavanoids", "Color.intensity", "OD280.OD315", "Proline", "Total.phenols")
x_train <- train[, features]
y_train <- train$Class
x_test <- test[, features]
y_test <- test$Class
# scale features
pp <- preProcess(x_train, method = c("center", "scale"))
x_train_s <- predict(pp, x_train)
x_test_s <- predict(pp, x_test)
# linear kernel svm with hyperparameter tuning (C)
set.seed(42)
lin_grid <- data.frame(cost = c(0.1, 1, 10, 100))
tune_lin <- tune.svm(
x = x_train_s, y = y_train,
kernel = "linear",
cost = lin_grid$cost,
tunecontrol = tune.control(cross = 5)
)
lin_best <- tune_lin$best.model
# rbf kernel svm with tuning (C, gamma)
set.seed(42)
rbf_grid_cost <- c(0.1, 1, 10, 100, 1000)
rbf_grid_gamma <- c(0.001, 0.01, 0.1, 1)
tune_rbf <- tune.svm(
x = x_train_s, y = y_train,
kernel = "radial",
cost = rbf_grid_cost,
gamma = rbf_grid_gamma,
tunecontrol = tune.control(cross = 5)
)
rbf_best <- tune_rbf$best.model
# alt classifier: random forest (same features)
set.seed(42)
rf_fit <- randomForest(x = x_train, y = y_train, ntree = 500, mtry = 2, importance = TRUE)
# evaluation helper
eval_model <- function(model, x_test_s, y_test, name) {
pred <- predict(model, x_test_s)
cm <- confusionMatrix(pred, y_test)
pr <- data.frame(
model = name,
accuracy = cm$overall["Accuracy"],
precision_macro = mean(cm$byClass[, "Precision"], na.rm = TRUE),
recall_macro = mean(cm$byClass[, "Recall"], na.rm = TRUE),
f1_macro = mean(cm$byClass[, "F1"], na.rm = TRUE)
)
list(cm = cm, pr = pr)
}
# eval svm models (use scaled features)
lin_eval <- eval_model(lin_best, x_test_s, y_test, "svm_linear")
rbf_eval <- eval_model(rbf_best, x_test_s, y_test, "svm_rbf")
# evaluate random forest (no scaling)
rf_pred <- predict(rf_fit, x_test)
rf_cm <- confusionMatrix(rf_pred, y_test)
rf_pr <- data.frame(
model = "random_forest",
accuracy = rf_cm$overall["Accuracy"],
precision_macro = mean(rf_cm$byClass[, "Precision"], na.rm = TRUE),
recall_macro = mean(rf_cm$byClass[, "Recall"], na.rm = TRUE),
f1_macro = mean(rf_cm$byClass[, "F1"], na.rm = TRUE)
)
perf <- rbind(lin_eval$pr, rbf_eval$pr, rf_pr)
# print
cat("best params (linear svm): C =", lin_best$cost, "\n")
cat("best params (rbf svm): C =", rbf_best$cost, " gamma =", rbf_best$gamma, "\n\n")
print(perf)
# macro-f1 comparison
ggplot(perf, aes(x = model, y = f1_macro)) +
geom_col() +
labs(title = "macro-F1 by model (wine test set)")
# save outputs
write.table(perf, file = "lab5_performance_table.txt", sep = "\t", row.names = FALSE, quote = FALSE)
sink("lab5_confusion_matrices.txt")
cat("=== svm linear ===\n")
print(lin_eval$cm)
cat("\n=== svm rbf ===\n")
print(rbf_eval$cm)
cat("\n=== random forest ===\n")
print(rf_cm)
sink()
+95
View File
@@ -0,0 +1,95 @@
=== svm linear ===
Confusion Matrix and Statistics
Reference
Prediction 1 2 3
1 11 1 0
2 0 13 0
3 0 0 9
Overall Statistics
Accuracy : 0.9706
95% CI : (0.8467, 0.9993)
No Information Rate : 0.4118
P-Value [Acc > NIR] : 3.92e-12
Kappa : 0.9553
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: 1 Class: 2 Class: 3
Sensitivity 1.0000 0.9286 1.0000
Specificity 0.9565 1.0000 1.0000
Pos Pred Value 0.9167 1.0000 1.0000
Neg Pred Value 1.0000 0.9524 1.0000
Prevalence 0.3235 0.4118 0.2647
Detection Rate 0.3235 0.3824 0.2647
Detection Prevalence 0.3529 0.3824 0.2647
Balanced Accuracy 0.9783 0.9643 1.0000
=== svm rbf ===
Confusion Matrix and Statistics
Reference
Prediction 1 2 3
1 11 1 0
2 0 13 0
3 0 0 9
Overall Statistics
Accuracy : 0.9706
95% CI : (0.8467, 0.9993)
No Information Rate : 0.4118
P-Value [Acc > NIR] : 3.92e-12
Kappa : 0.9553
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: 1 Class: 2 Class: 3
Sensitivity 1.0000 0.9286 1.0000
Specificity 0.9565 1.0000 1.0000
Pos Pred Value 0.9167 1.0000 1.0000
Neg Pred Value 1.0000 0.9524 1.0000
Prevalence 0.3235 0.4118 0.2647
Detection Rate 0.3235 0.3824 0.2647
Detection Prevalence 0.3529 0.3824 0.2647
Balanced Accuracy 0.9783 0.9643 1.0000
=== random forest ===
Confusion Matrix and Statistics
Reference
Prediction 1 2 3
1 11 1 0
2 0 13 0
3 0 0 9
Overall Statistics
Accuracy : 0.9706
95% CI : (0.8467, 0.9993)
No Information Rate : 0.4118
P-Value [Acc > NIR] : 3.92e-12
Kappa : 0.9553
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: 1 Class: 2 Class: 3
Sensitivity 1.0000 0.9286 1.0000
Specificity 0.9565 1.0000 1.0000
Pos Pred Value 0.9167 1.0000 1.0000
Neg Pred Value 1.0000 0.9524 1.0000
Prevalence 0.3235 0.4118 0.2647
Detection Rate 0.3235 0.3824 0.2647
Detection Prevalence 0.3529 0.3824 0.2647
Balanced Accuracy 0.9783 0.9643 1.0000
+4
View File
@@ -0,0 +1,4 @@
model accuracy precision_macro recall_macro f1_macro
svm_linear 0.970588235294118 0.972222222222222 0.976190476190476 0.973161567364466
svm_rbf 0.970588235294118 0.972222222222222 0.976190476190476 0.973161567364466
random_forest 0.970588235294118 0.972222222222222 0.976190476190476 0.973161567364466
+178
View File
@@ -0,0 +1,178 @@
1,14.23,1.71,2.43,15.6,127,2.8,3.06,.28,2.29,5.64,1.04,3.92,1065
1,13.2,1.78,2.14,11.2,100,2.65,2.76,.26,1.28,4.38,1.05,3.4,1050
1,13.16,2.36,2.67,18.6,101,2.8,3.24,.3,2.81,5.68,1.03,3.17,1185
1,14.37,1.95,2.5,16.8,113,3.85,3.49,.24,2.18,7.8,.86,3.45,1480
1,13.24,2.59,2.87,21,118,2.8,2.69,.39,1.82,4.32,1.04,2.93,735
1,14.2,1.76,2.45,15.2,112,3.27,3.39,.34,1.97,6.75,1.05,2.85,1450
1,14.39,1.87,2.45,14.6,96,2.5,2.52,.3,1.98,5.25,1.02,3.58,1290
1,14.06,2.15,2.61,17.6,121,2.6,2.51,.31,1.25,5.05,1.06,3.58,1295
1,14.83,1.64,2.17,14,97,2.8,2.98,.29,1.98,5.2,1.08,2.85,1045
1,13.86,1.35,2.27,16,98,2.98,3.15,.22,1.85,7.22,1.01,3.55,1045
1,14.1,2.16,2.3,18,105,2.95,3.32,.22,2.38,5.75,1.25,3.17,1510
1,14.12,1.48,2.32,16.8,95,2.2,2.43,.26,1.57,5,1.17,2.82,1280
1,13.75,1.73,2.41,16,89,2.6,2.76,.29,1.81,5.6,1.15,2.9,1320
1,14.75,1.73,2.39,11.4,91,3.1,3.69,.43,2.81,5.4,1.25,2.73,1150
1,14.38,1.87,2.38,12,102,3.3,3.64,.29,2.96,7.5,1.2,3,1547
1,13.63,1.81,2.7,17.2,112,2.85,2.91,.3,1.46,7.3,1.28,2.88,1310
1,14.3,1.92,2.72,20,120,2.8,3.14,.33,1.97,6.2,1.07,2.65,1280
1,13.83,1.57,2.62,20,115,2.95,3.4,.4,1.72,6.6,1.13,2.57,1130
1,14.19,1.59,2.48,16.5,108,3.3,3.93,.32,1.86,8.7,1.23,2.82,1680
1,13.64,3.1,2.56,15.2,116,2.7,3.03,.17,1.66,5.1,.96,3.36,845
1,14.06,1.63,2.28,16,126,3,3.17,.24,2.1,5.65,1.09,3.71,780
1,12.93,3.8,2.65,18.6,102,2.41,2.41,.25,1.98,4.5,1.03,3.52,770
1,13.71,1.86,2.36,16.6,101,2.61,2.88,.27,1.69,3.8,1.11,4,1035
1,12.85,1.6,2.52,17.8,95,2.48,2.37,.26,1.46,3.93,1.09,3.63,1015
1,13.5,1.81,2.61,20,96,2.53,2.61,.28,1.66,3.52,1.12,3.82,845
1,13.05,2.05,3.22,25,124,2.63,2.68,.47,1.92,3.58,1.13,3.2,830
1,13.39,1.77,2.62,16.1,93,2.85,2.94,.34,1.45,4.8,.92,3.22,1195
1,13.3,1.72,2.14,17,94,2.4,2.19,.27,1.35,3.95,1.02,2.77,1285
1,13.87,1.9,2.8,19.4,107,2.95,2.97,.37,1.76,4.5,1.25,3.4,915
1,14.02,1.68,2.21,16,96,2.65,2.33,.26,1.98,4.7,1.04,3.59,1035
1,13.73,1.5,2.7,22.5,101,3,3.25,.29,2.38,5.7,1.19,2.71,1285
1,13.58,1.66,2.36,19.1,106,2.86,3.19,.22,1.95,6.9,1.09,2.88,1515
1,13.68,1.83,2.36,17.2,104,2.42,2.69,.42,1.97,3.84,1.23,2.87,990
1,13.76,1.53,2.7,19.5,132,2.95,2.74,.5,1.35,5.4,1.25,3,1235
1,13.51,1.8,2.65,19,110,2.35,2.53,.29,1.54,4.2,1.1,2.87,1095
1,13.48,1.81,2.41,20.5,100,2.7,2.98,.26,1.86,5.1,1.04,3.47,920
1,13.28,1.64,2.84,15.5,110,2.6,2.68,.34,1.36,4.6,1.09,2.78,880
1,13.05,1.65,2.55,18,98,2.45,2.43,.29,1.44,4.25,1.12,2.51,1105
1,13.07,1.5,2.1,15.5,98,2.4,2.64,.28,1.37,3.7,1.18,2.69,1020
1,14.22,3.99,2.51,13.2,128,3,3.04,.2,2.08,5.1,.89,3.53,760
1,13.56,1.71,2.31,16.2,117,3.15,3.29,.34,2.34,6.13,.95,3.38,795
1,13.41,3.84,2.12,18.8,90,2.45,2.68,.27,1.48,4.28,.91,3,1035
1,13.88,1.89,2.59,15,101,3.25,3.56,.17,1.7,5.43,.88,3.56,1095
1,13.24,3.98,2.29,17.5,103,2.64,2.63,.32,1.66,4.36,.82,3,680
1,13.05,1.77,2.1,17,107,3,3,.28,2.03,5.04,.88,3.35,885
1,14.21,4.04,2.44,18.9,111,2.85,2.65,.3,1.25,5.24,.87,3.33,1080
1,14.38,3.59,2.28,16,102,3.25,3.17,.27,2.19,4.9,1.04,3.44,1065
1,13.9,1.68,2.12,16,101,3.1,3.39,.21,2.14,6.1,.91,3.33,985
1,14.1,2.02,2.4,18.8,103,2.75,2.92,.32,2.38,6.2,1.07,2.75,1060
1,13.94,1.73,2.27,17.4,108,2.88,3.54,.32,2.08,8.90,1.12,3.1,1260
1,13.05,1.73,2.04,12.4,92,2.72,3.27,.17,2.91,7.2,1.12,2.91,1150
1,13.83,1.65,2.6,17.2,94,2.45,2.99,.22,2.29,5.6,1.24,3.37,1265
1,13.82,1.75,2.42,14,111,3.88,3.74,.32,1.87,7.05,1.01,3.26,1190
1,13.77,1.9,2.68,17.1,115,3,2.79,.39,1.68,6.3,1.13,2.93,1375
1,13.74,1.67,2.25,16.4,118,2.6,2.9,.21,1.62,5.85,.92,3.2,1060
1,13.56,1.73,2.46,20.5,116,2.96,2.78,.2,2.45,6.25,.98,3.03,1120
1,14.22,1.7,2.3,16.3,118,3.2,3,.26,2.03,6.38,.94,3.31,970
1,13.29,1.97,2.68,16.8,102,3,3.23,.31,1.66,6,1.07,2.84,1270
1,13.72,1.43,2.5,16.7,108,3.4,3.67,.19,2.04,6.8,.89,2.87,1285
2,12.37,.94,1.36,10.6,88,1.98,.57,.28,.42,1.95,1.05,1.82,520
2,12.33,1.1,2.28,16,101,2.05,1.09,.63,.41,3.27,1.25,1.67,680
2,12.64,1.36,2.02,16.8,100,2.02,1.41,.53,.62,5.75,.98,1.59,450
2,13.67,1.25,1.92,18,94,2.1,1.79,.32,.73,3.8,1.23,2.46,630
2,12.37,1.13,2.16,19,87,3.5,3.1,.19,1.87,4.45,1.22,2.87,420
2,12.17,1.45,2.53,19,104,1.89,1.75,.45,1.03,2.95,1.45,2.23,355
2,12.37,1.21,2.56,18.1,98,2.42,2.65,.37,2.08,4.6,1.19,2.3,678
2,13.11,1.01,1.7,15,78,2.98,3.18,.26,2.28,5.3,1.12,3.18,502
2,12.37,1.17,1.92,19.6,78,2.11,2,.27,1.04,4.68,1.12,3.48,510
2,13.34,.94,2.36,17,110,2.53,1.3,.55,.42,3.17,1.02,1.93,750
2,12.21,1.19,1.75,16.8,151,1.85,1.28,.14,2.5,2.85,1.28,3.07,718
2,12.29,1.61,2.21,20.4,103,1.1,1.02,.37,1.46,3.05,.906,1.82,870
2,13.86,1.51,2.67,25,86,2.95,2.86,.21,1.87,3.38,1.36,3.16,410
2,13.49,1.66,2.24,24,87,1.88,1.84,.27,1.03,3.74,.98,2.78,472
2,12.99,1.67,2.6,30,139,3.3,2.89,.21,1.96,3.35,1.31,3.5,985
2,11.96,1.09,2.3,21,101,3.38,2.14,.13,1.65,3.21,.99,3.13,886
2,11.66,1.88,1.92,16,97,1.61,1.57,.34,1.15,3.8,1.23,2.14,428
2,13.03,.9,1.71,16,86,1.95,2.03,.24,1.46,4.6,1.19,2.48,392
2,11.84,2.89,2.23,18,112,1.72,1.32,.43,.95,2.65,.96,2.52,500
2,12.33,.99,1.95,14.8,136,1.9,1.85,.35,2.76,3.4,1.06,2.31,750
2,12.7,3.87,2.4,23,101,2.83,2.55,.43,1.95,2.57,1.19,3.13,463
2,12,.92,2,19,86,2.42,2.26,.3,1.43,2.5,1.38,3.12,278
2,12.72,1.81,2.2,18.8,86,2.2,2.53,.26,1.77,3.9,1.16,3.14,714
2,12.08,1.13,2.51,24,78,2,1.58,.4,1.4,2.2,1.31,2.72,630
2,13.05,3.86,2.32,22.5,85,1.65,1.59,.61,1.62,4.8,.84,2.01,515
2,11.84,.89,2.58,18,94,2.2,2.21,.22,2.35,3.05,.79,3.08,520
2,12.67,.98,2.24,18,99,2.2,1.94,.3,1.46,2.62,1.23,3.16,450
2,12.16,1.61,2.31,22.8,90,1.78,1.69,.43,1.56,2.45,1.33,2.26,495
2,11.65,1.67,2.62,26,88,1.92,1.61,.4,1.34,2.6,1.36,3.21,562
2,11.64,2.06,2.46,21.6,84,1.95,1.69,.48,1.35,2.8,1,2.75,680
2,12.08,1.33,2.3,23.6,70,2.2,1.59,.42,1.38,1.74,1.07,3.21,625
2,12.08,1.83,2.32,18.5,81,1.6,1.5,.52,1.64,2.4,1.08,2.27,480
2,12,1.51,2.42,22,86,1.45,1.25,.5,1.63,3.6,1.05,2.65,450
2,12.69,1.53,2.26,20.7,80,1.38,1.46,.58,1.62,3.05,.96,2.06,495
2,12.29,2.83,2.22,18,88,2.45,2.25,.25,1.99,2.15,1.15,3.3,290
2,11.62,1.99,2.28,18,98,3.02,2.26,.17,1.35,3.25,1.16,2.96,345
2,12.47,1.52,2.2,19,162,2.5,2.27,.32,3.28,2.6,1.16,2.63,937
2,11.81,2.12,2.74,21.5,134,1.6,.99,.14,1.56,2.5,.95,2.26,625
2,12.29,1.41,1.98,16,85,2.55,2.5,.29,1.77,2.9,1.23,2.74,428
2,12.37,1.07,2.1,18.5,88,3.52,3.75,.24,1.95,4.5,1.04,2.77,660
2,12.29,3.17,2.21,18,88,2.85,2.99,.45,2.81,2.3,1.42,2.83,406
2,12.08,2.08,1.7,17.5,97,2.23,2.17,.26,1.4,3.3,1.27,2.96,710
2,12.6,1.34,1.9,18.5,88,1.45,1.36,.29,1.35,2.45,1.04,2.77,562
2,12.34,2.45,2.46,21,98,2.56,2.11,.34,1.31,2.8,.8,3.38,438
2,11.82,1.72,1.88,19.5,86,2.5,1.64,.37,1.42,2.06,.94,2.44,415
2,12.51,1.73,1.98,20.5,85,2.2,1.92,.32,1.48,2.94,1.04,3.57,672
2,12.42,2.55,2.27,22,90,1.68,1.84,.66,1.42,2.7,.86,3.3,315
2,12.25,1.73,2.12,19,80,1.65,2.03,.37,1.63,3.4,1,3.17,510
2,12.72,1.75,2.28,22.5,84,1.38,1.76,.48,1.63,3.3,.88,2.42,488
2,12.22,1.29,1.94,19,92,2.36,2.04,.39,2.08,2.7,.86,3.02,312
2,11.61,1.35,2.7,20,94,2.74,2.92,.29,2.49,2.65,.96,3.26,680
2,11.46,3.74,1.82,19.5,107,3.18,2.58,.24,3.58,2.9,.75,2.81,562
2,12.52,2.43,2.17,21,88,2.55,2.27,.26,1.22,2,.9,2.78,325
2,11.76,2.68,2.92,20,103,1.75,2.03,.6,1.05,3.8,1.23,2.5,607
2,11.41,.74,2.5,21,88,2.48,2.01,.42,1.44,3.08,1.1,2.31,434
2,12.08,1.39,2.5,22.5,84,2.56,2.29,.43,1.04,2.9,.93,3.19,385
2,11.03,1.51,2.2,21.5,85,2.46,2.17,.52,2.01,1.9,1.71,2.87,407
2,11.82,1.47,1.99,20.8,86,1.98,1.6,.3,1.53,1.95,.95,3.33,495
2,12.42,1.61,2.19,22.5,108,2,2.09,.34,1.61,2.06,1.06,2.96,345
2,12.77,3.43,1.98,16,80,1.63,1.25,.43,.83,3.4,.7,2.12,372
2,12,3.43,2,19,87,2,1.64,.37,1.87,1.28,.93,3.05,564
2,11.45,2.4,2.42,20,96,2.9,2.79,.32,1.83,3.25,.8,3.39,625
2,11.56,2.05,3.23,28.5,119,3.18,5.08,.47,1.87,6,.93,3.69,465
2,12.42,4.43,2.73,26.5,102,2.2,2.13,.43,1.71,2.08,.92,3.12,365
2,13.05,5.8,2.13,21.5,86,2.62,2.65,.3,2.01,2.6,.73,3.1,380
2,11.87,4.31,2.39,21,82,2.86,3.03,.21,2.91,2.8,.75,3.64,380
2,12.07,2.16,2.17,21,85,2.6,2.65,.37,1.35,2.76,.86,3.28,378
2,12.43,1.53,2.29,21.5,86,2.74,3.15,.39,1.77,3.94,.69,2.84,352
2,11.79,2.13,2.78,28.5,92,2.13,2.24,.58,1.76,3,.97,2.44,466
2,12.37,1.63,2.3,24.5,88,2.22,2.45,.4,1.9,2.12,.89,2.78,342
2,12.04,4.3,2.38,22,80,2.1,1.75,.42,1.35,2.6,.79,2.57,580
3,12.86,1.35,2.32,18,122,1.51,1.25,.21,.94,4.1,.76,1.29,630
3,12.88,2.99,2.4,20,104,1.3,1.22,.24,.83,5.4,.74,1.42,530
3,12.81,2.31,2.4,24,98,1.15,1.09,.27,.83,5.7,.66,1.36,560
3,12.7,3.55,2.36,21.5,106,1.7,1.2,.17,.84,5,.78,1.29,600
3,12.51,1.24,2.25,17.5,85,2,.58,.6,1.25,5.45,.75,1.51,650
3,12.6,2.46,2.2,18.5,94,1.62,.66,.63,.94,7.1,.73,1.58,695
3,12.25,4.72,2.54,21,89,1.38,.47,.53,.8,3.85,.75,1.27,720
3,12.53,5.51,2.64,25,96,1.79,.6,.63,1.1,5,.82,1.69,515
3,13.49,3.59,2.19,19.5,88,1.62,.48,.58,.88,5.7,.81,1.82,580
3,12.84,2.96,2.61,24,101,2.32,.6,.53,.81,4.92,.89,2.15,590
3,12.93,2.81,2.7,21,96,1.54,.5,.53,.75,4.6,.77,2.31,600
3,13.36,2.56,2.35,20,89,1.4,.5,.37,.64,5.6,.7,2.47,780
3,13.52,3.17,2.72,23.5,97,1.55,.52,.5,.55,4.35,.89,2.06,520
3,13.62,4.95,2.35,20,92,2,.8,.47,1.02,4.4,.91,2.05,550
3,12.25,3.88,2.2,18.5,112,1.38,.78,.29,1.14,8.21,.65,2,855
3,13.16,3.57,2.15,21,102,1.5,.55,.43,1.3,4,.6,1.68,830
3,13.88,5.04,2.23,20,80,.98,.34,.4,.68,4.9,.58,1.33,415
3,12.87,4.61,2.48,21.5,86,1.7,.65,.47,.86,7.65,.54,1.86,625
3,13.32,3.24,2.38,21.5,92,1.93,.76,.45,1.25,8.42,.55,1.62,650
3,13.08,3.9,2.36,21.5,113,1.41,1.39,.34,1.14,9.40,.57,1.33,550
3,13.5,3.12,2.62,24,123,1.4,1.57,.22,1.25,8.60,.59,1.3,500
3,12.79,2.67,2.48,22,112,1.48,1.36,.24,1.26,10.8,.48,1.47,480
3,13.11,1.9,2.75,25.5,116,2.2,1.28,.26,1.56,7.1,.61,1.33,425
3,13.23,3.3,2.28,18.5,98,1.8,.83,.61,1.87,10.52,.56,1.51,675
3,12.58,1.29,2.1,20,103,1.48,.58,.53,1.4,7.6,.58,1.55,640
3,13.17,5.19,2.32,22,93,1.74,.63,.61,1.55,7.9,.6,1.48,725
3,13.84,4.12,2.38,19.5,89,1.8,.83,.48,1.56,9.01,.57,1.64,480
3,12.45,3.03,2.64,27,97,1.9,.58,.63,1.14,7.5,.67,1.73,880
3,14.34,1.68,2.7,25,98,2.8,1.31,.53,2.7,13,.57,1.96,660
3,13.48,1.67,2.64,22.5,89,2.6,1.1,.52,2.29,11.75,.57,1.78,620
3,12.36,3.83,2.38,21,88,2.3,.92,.5,1.04,7.65,.56,1.58,520
3,13.69,3.26,2.54,20,107,1.83,.56,.5,.8,5.88,.96,1.82,680
3,12.85,3.27,2.58,22,106,1.65,.6,.6,.96,5.58,.87,2.11,570
3,12.96,3.45,2.35,18.5,106,1.39,.7,.4,.94,5.28,.68,1.75,675
3,13.78,2.76,2.3,22,90,1.35,.68,.41,1.03,9.58,.7,1.68,615
3,13.73,4.36,2.26,22.5,88,1.28,.47,.52,1.15,6.62,.78,1.75,520
3,13.45,3.7,2.6,23,111,1.7,.92,.43,1.46,10.68,.85,1.56,695
3,12.82,3.37,2.3,19.5,88,1.48,.66,.4,.97,10.26,.72,1.75,685
3,13.58,2.58,2.69,24.5,105,1.55,.84,.39,1.54,8.66,.74,1.8,750
3,13.4,4.6,2.86,25,112,1.98,.96,.27,1.11,8.5,.67,1.92,630
3,12.2,3.03,2.32,19,96,1.25,.49,.4,.73,5.5,.66,1.83,510
3,12.77,2.39,2.28,19.5,86,1.39,.51,.48,.64,9.899999,.57,1.63,470
3,14.16,2.51,2.48,20,91,1.68,.7,.44,1.24,9.7,.62,1.71,660
3,13.71,5.65,2.45,20.5,95,1.68,.61,.52,1.06,7.7,.64,1.74,740
3,13.4,3.91,2.48,23,102,1.8,.75,.43,1.41,7.3,.7,1.56,750
3,13.27,4.28,2.26,20,120,1.59,.69,.43,1.35,10.2,.59,1.56,835
3,13.17,2.59,2.37,20,120,1.65,.68,.53,1.46,9.3,.6,1.62,840
3,14.13,4.1,2.74,24.5,96,2.05,.76,.56,1.35,9.2,.61,1.6,560
+100
View File
@@ -0,0 +1,100 @@
1. Title of Database: Wine recognition data
Updated Sept 21, 1998 by C.Blake : Added attribute information
2. Sources:
(a) Forina, M. et al, PARVUS - An Extendible Package for Data
Exploration, Classification and Correlation. Institute of Pharmaceutical
and Food Analysis and Technologies, Via Brigata Salerno,
16147 Genoa, Italy.
(b) Stefan Aeberhard, email: stefan@coral.cs.jcu.edu.au
(c) July 1991
3. Past Usage:
(1)
S. Aeberhard, D. Coomans and O. de Vel,
Comparison of Classifiers in High Dimensional Settings,
Tech. Rep. no. 92-02, (1992), Dept. of Computer Science and Dept. of
Mathematics and Statistics, James Cook University of North Queensland.
(Also submitted to Technometrics).
The data was used with many others for comparing various
classifiers. The classes are separable, though only RDA
has achieved 100% correct classification.
(RDA : 100%, QDA 99.4%, LDA 98.9%, 1NN 96.1% (z-transformed data))
(All results using the leave-one-out technique)
In a classification context, this is a well posed problem
with "well behaved" class structures. A good data set
for first testing of a new classifier, but not very
challenging.
(2)
S. Aeberhard, D. Coomans and O. de Vel,
"THE CLASSIFICATION PERFORMANCE OF RDA"
Tech. Rep. no. 92-01, (1992), Dept. of Computer Science and Dept. of
Mathematics and Statistics, James Cook University of North Queensland.
(Also submitted to Journal of Chemometrics).
Here, the data was used to illustrate the superior performance of
the use of a new appreciation function with RDA.
4. Relevant Information:
-- These data are the results of a chemical analysis of
wines grown in the same region in Italy but derived from three
different cultivars.
The analysis determined the quantities of 13 constituents
found in each of the three types of wines.
-- I think that the initial data set had around 30 variables, but
for some reason I only have the 13 dimensional version.
I had a list of what the 30 or so variables were, but a.)
I lost it, and b.), I would not know which 13 variables
are included in the set.
-- The attributes are (dontated by Riccardo Leardi,
riclea@anchem.unige.it )
1) Alcohol
2) Malic acid
3) Ash
4) Alcalinity of ash
5) Magnesium
6) Total phenols
7) Flavanoids
8) Nonflavanoid phenols
9) Proanthocyanins
10)Color intensity
11)Hue
12)OD280/OD315 of diluted wines
13)Proline
5. Number of Instances
class 1 59
class 2 71
class 3 48
6. Number of Attributes
13
7. For Each Attribute:
All attributes are continuous
No statistics available, but suggest to standardise
variables for certain uses (e.g. for us with classifiers
which are NOT scale invariant)
NOTE: 1st attribute is class identifier (1-3)
8. Missing Attribute Values:
None
9. Class Distribution: number of instances per class
class 1 59
class 2 71
class 3 48
Binary file not shown.
File diff suppressed because it is too large Load Diff