updates

2025-12-04 13:07:25 -05:00
parent fa9a358415
commit 2667c06e09
3 changed files with 33 additions and 36 deletions
@@ -1,22 +1,24 @@
 # Data Analytics Fall 2025 – Assignment IV

-**Project Title:** Measuring How Generative AI Adoption Reshaped Stack Overflow Participation 2018–2025
+## Measuring How Generative AI Adoption Reshaped Stack Overflow Participation 2018–2025

-**Author:** Itamar Oren-Naftalovich
+Itamar Oren-Naftalovich

-**Course:** Data Analytics (Fall 2025)
-
-**Repository Artifacts:** `analysis.r`, `data/*.csv`, `imgs/*.png`, `out.log` (model console output)
+<!-- **Repository Artifacts:** `analysis.r`, `data/*.csv`, `imgs/*.png`, `out.log` (model console output) -->

 ---

 ## 1. Abstract and Introduction

-On 30 November 2022, ChatGPT became publicly available. Within days, the Stack Overflow community faced two major shocks: developers suddenly had a new source of code-specific answers, and Stack Overflow introduced a temporary ban on AI-generated content on 5 December 2022 while already struggling with limited moderation capacity. This project asks whether the combination of generative AI adoption and these policy changes produced a statistically detectable regime shift in Stack Overflow content creation, and whether developers who say they use ChatGPT still treat Stack Overflow as a daily resource.
+On 30 November 2022, ChatGPT became publicly available. Within days, the Stack Overflow community faced two major shocks: developers suddenly had a new source of code-specific answers, and Stack Overflow introduced a temporary ban on AI-generated content on 5 December 2022 while already struggling with limited (and often terrible) moderation capacity. In this project I will look at whether the combination of generative AI adoption and these policy changes produced a statistically detectable shift in Stack Overflow content creation, and whether developers who say they use ChatGPT still treat Stack Overflow as a daily resource.

-I hypothesized that monthly answer counts would show a structural break after the ChatGPT launch and AI policy ban, even after controlling for the pre-2022 downward trend. I also expected that respondents who explicitly name ChatGPT as an AI assistant would be less likely to visit Stack Overflow daily. To test these ideas, I built two complementary datasets: (a) Stack Overflow Data Explorer (SEDE) exports of monthly deleted and non-deleted answers from January 2018 through November 2025, and (b) microdata from the 2023 and 2024 Stack Overflow Developer Surveys, which record both visit frequency and generative AI usage. The script `analysis.r` cleans both sources, engineers indicators for key policy dates, and generates the plots and models discussed in this report.
+My initial hypothesis was that monthly answer counts would show a break after the ChatGPT launch and AI policy ban, even after controlling for the pre-2022 downward trend. I also expected that respondents who explicitly name ChatGPT as an AI assistant would be less likely to visit Stack Overflow daily. To test these ideas, I built two complementary datasets:
+1. A Stack Overflow Data Explorer (SEDE) exports of monthly deleted and non-deleted answers from January 2018 through November 2025
+2. Microdata from the 2023 and 2024 Stack Overflow Developer Surveys, which record both visit frequency and generative AI usage

-The analysis relies on four modeling strategies: an interrupted time-series (ITS) linear regression, a Poisson regression for counts, a seasonal ARIMA model trained only on pre-ChatGPT data, and a logistic regression relating survey-reported AI usage to daily Stack Overflow visitation. Together, these models indicate that Stack Overflow answer production fell by more than 53% in the post-ChatGPT period (mean 90.5 vs. 193.0 answers per month). At the same time, daily visitors are increasingly concentrated in older age cohorts, and survey respondents who explicitly mention ChatGPT do not differ meaningfully from others in how often they visit the site. The sections that follow describe the datasets, exploratory patterns, modeling choices, and implications for the community.
+If you want to see *how* I did this (the code) see `analysis.r`
+
+The analysis relies on four modeling strategies: an interrupted time-series (ITS) linear regression, a Poisson regression for counts, a seasonal ARIMA model trained only on pre-ChatGPT data, and a logistic regression relating survey-reported AI usage to daily Stack Overflow visitation. Smashed together, these models indicate that Stack Overflow answer production fell by more than 53% in the post-ChatGPT period (mean 90.5 vs. 193.0 answers per month). At the same time, daily visitors are increasingly concentrated in older age cohorts, and survey respondents who explicitly mention ChatGPT do not differ meaningfully from others in how often they visit the site. The following sections describe the datasets, exploratory patterns, modeling choices, and implications for the community.

 ---

@@ -24,13 +26,13 @@ The analysis relies on four modeling strategies: an interrupted time-series (ITS

 ### 2.1 Stack Overflow Answer Volume (Dataset 1)

-* **Source & scope.**
+* **Source and Scope**
  `data/so_new_answers_per_month_2018_2025.csv` is a SEDE export of every new answer (deleted and non-deleted) by month from January 2018 through November 2025 (95 monthly observations). The script standardizes month formats, aggregates across deletion statuses, and adds indicators for the ChatGPT release (30 Nov 2022), the AI policy ban (5 Dec 2022), and the Stack Exchange moderator strike (5 Jun–7 Aug 2023).

-* **Variables.**
+* **Variables**
  After cleaning, the main table `answers_monthly` contains `answers_total`, `answers_non_deleted`, `answers_deleted`, calendar year and month, a sequential `time_index`, binary indicators for the events listed above, and a categorical `period` flagging pre- vs. post-ChatGPT months. A 3-month moving average (`answers_ma3`) is computed to smooth short-term noise for exploratory plots.

-* **Quality checks.**
+* **Quality Checks**
  Duplicate rows were removed by grouping on `month`, and all transformations are recorded in `out.log`. The only missing values arise in the first two moving-average entries, which plotting functions simply omit. Because SEDE distinguishes deleted from non-deleted answers, the analysis keeps both so that any changes in moderation are visible in the time series.

 ![Figure 1. Monthly Stack Overflow answers with ChatGPT (dashed) and AI policy (dotted) markers.](imgs/01_answers_ts.png)
@@ -39,7 +41,7 @@ The analysis relies on four modeling strategies: an interrupted time-series (ITS
 ![Figure 2. Distribution of monthly answers pre- vs. post-ChatGPT.](imgs/02_box_pre_post.png)
 *Figure 2. Box plots highlight the magnitude of the drop between the pre- and post-ChatGPT regimes.*

-**Table 1. Descriptive statistics by regime (source: `data/answers_summary_period.csv`).**
+**Table 1. Descriptive Statistics by Regime (source: `data/answers_summary_period.csv`)**

 | period       | n_months | mean_answers | median_answers | sd_answers | min_answers | max_answers |
 | ------------ | -------- | ------------ | -------------- | ---------- | ----------- | ----------- |
@@ -50,16 +52,16 @@ A quick comparison of the six months immediately before and after 30 November 20

 ### 2.2 Stack Overflow Developer Survey (Dataset 2)

-* **Source & scope.**
+* **Source and Scope.**
  The second dataset uses the publicly released 2023 and 2024 Stack Overflow Developer Survey microdata (`stack-overflow-developer-survey-2023.zip` and `stack-overflow-developer-survey-2024.zip`, downloaded 19 November 2025). Combined, these files contain 146,676 responses from professional and hobbyist developers worldwide.

-* **Schema harmonization.**
+* **Schema Harmonization!**
  Column names differ slightly across years (for example, `SOAI` vs. `AISelect`), so helper functions search for the first matching column for each concept. The harmonized frame retains `year`, `main_branch`, `country`, numeric `age`, `gender`, reported Stack Overflow visit frequency (`so_visit`), and free-text AI assistant preferences (`ai_select`).

-* **Feature engineering.**
+* **Feature Engineering**
  Two binary indicators are constructed: `frequent_so` (1 if the respondent reports visiting Stack Overflow daily or multiple times per day) and `uses_chatgpt` (1 if the string “ChatGPT” appears anywhere in `ai_select`). Age is grouped into buckets (`<25`, `25–34`, `35–44`, `45+`, `unknown`), and gender is collapsed into a simplified label to absorb inconsistent free-text entries.

-* **Sample considerations.**
+* **Sample Considerations**
  Because the 2024 instrument asks about AI search preferences rather than naming specific tools, only 1,181 respondents in 2023 explicitly mention ChatGPT and almost none do in 2024. This change in wording is treated as a measurement artifact and revisited as a source of bias in Sections 3 and 4.

 ---
@@ -173,11 +175,11 @@ Each model addresses a slightly different question about Stack Overflow activity

 ## 5. Conclusions and Discussion

-The evidence points to a genuine structural break in Stack Overflow answer production beginning in late 2022. Average monthly answers drop from 193.0 in the 2018–October 2022 period to 90.5 between December 2022 and November 2025. The interrupted time-series model shows the slope of decline becoming steeper by about −2.37 answers per month after ChatGPT’s release, and the Poisson model implies a post-ChatGPT decay rate of roughly 3.2% per month. ARIMA forecasts trained only on pre-ChatGPT data substantially overestimate post-2022 activity, reinforcing the conclusion that pre-existing seasonal and secular trends cannot account for the observed collapse.
+The evidence points to some sort of structural break in Stack Overflow answer production beginning in late 2022. Average monthly answers drop from 193.0 in the 2018–October 2022 period to 90.5 between December 2022 and November 2025. The interrupted time-series model shows the slope of decline by becoming steeper by about −2.37 answers per month after ChatGPT’s release, and the Poisson model implies a post-ChatGPT decay rate of roughly 3.2% per month. ARIMA forecasts trained only on pre-ChatGPT data substantially overestimate post-2022 activity, which reinforces the conclusion that pre-existing seasonal and secular trends cannot account for the observed collapse.

-The survey-based models tell a more nuanced story about *who* remains active. Despite common assumptions that ChatGPT usage directly crowds out Stack Overflow visits, the current survey data do not show a strong link: the odds ratio for reported ChatGPT usage is essentially 1, and differences in daily visitation are driven more by age and year than by AI adoption. Given the 2024 wording change and the limitations of self-reported tool usage, it would be premature to claim that ChatGPT users as a group have already abandoned Stack Overflow.
+The survey-based models show more information about *who* remains active. Despite common assumptions that ChatGPT usage directly crowds out Stack Overflow visits, the current survey data do not show a strong link: the odds ratio for reported ChatGPT usage is essentially 1, and differences in daily visitation are driven more by age and year than by AI adoption. Given the 2024 wording change and the limitations of self-reported tool usage, it would be premature to claim that ChatGPT users as a group have already abandoned Stack Overflow.

-Taken together, these findings suggest that any response from Stack Overflow should combine supply-side interventions—such as incentives for high-quality answers and additional moderation support to limit deleted content—with better measurement of how developers actually integrate AI tools and community Q&A into their workflows.
+Taken together, these findings suggest that any response from Stack Overflow should combine supply-side interventions (such as incentives for high-quality answers and additional moderation support to limit deleted content) with better measurement of how developers actually integrate AI tools and community Q&A into their workflows.

 Future work could extend the time-series models with covariates for major product changes (e.g., Collectives, Discussions), incorporate question volume alongside answers, and revisit the survey analysis once the 2025 instrument becomes available. Causal impact methods, such as Bayesian structural time series using the ARIMA forecast as a prior, could offer a more formal estimate of the counterfactual number of answers that would have been produced without the post-2022 shocks.

@@ -190,8 +192,3 @@ Future work could extend the time-series models with covariates for major produc
 3. OpenAI. “Introducing ChatGPT,” OpenAI Blog, 30 Nov 2022.
 4. Stack Overflow Meta. “Temporary policy: ChatGPT is banned,” Meta Stack Overflow, 5 Dec 2022.
 5. Stack Exchange. “Moderator Strike: Stack Overflow, Stack Exchange Network,” Meta Stack Exchange updates, Jun–Aug 2023.
-
---
-
-**Code and Reproducibility:**
-All data acquisition, cleaning, plotting, and model fitting steps are implemented in `analysis.r`. Running `Rscript analysis.r` recreates each figure in `imgs/` and regenerates the tidy datasets referenced throughout this report, with a console transcript saved to `out.log`.
@@ -1,13 +1,13 @@

-6. Oral Presentation (5%). Plan for a ~5 minute presentation; slides must cover the 
-following: 
-a). Title (with your name) 
-b). Problem area – what you wanted to explore/ solve/ predict and why, and what you 
-wanted to predict? 
-c). The data – where it came from, why it was applicable and the preliminary assessments 
-you made. 
-d). How you conducted your analysis: distribution, pattern/ relationship and model 
-construction. What techniques did you use/ not use and why? What worked? What did not 
-work? How did you apply the model? How did you optimize, account for uncertainties? 
-f). What did you predict and what decisions (prescriptions) were possible. What was the 
-outcome, conclusions? 
+6. Oral Presentation (5%). Plan for a ~5 minute presentation; slides must cover the
+following:
+a). Title (with your name)
+b). Problem area – what you wanted to explore/ solve/ predict and why, and what you
+wanted to predict?
+c). The data – where it came from, why it was applicable and the preliminary assessments
+you made.
+d). How you conducted your analysis: distribution, pattern/ relationship and model
+construction. What techniques did you use/ not use and why? What worked? What did not
+work? How did you apply the model? How did you optimize, account for uncertainties?
+f). What did you predict and what decisions (prescriptions) were possible. What was the
+outcome, conclusions?