Archived

This repository has been archived on 2026-05-09. You can view files and clone it. You cannot open issues or pull requests or push a commit.

Files

T

ION606 091831c67c update

2025-12-08 17:04:32 -05:00

12 KiB

Raw Permalink Blame History

Did Stack Overflow Answers Increase After ChatGPT? — Term Project Report

Itamar Oren-Naftalovich

1. Abstract and introduction

This project asks whether the number of answers posted on Stack Overflow (SO) increased or decreased after the public launch of ChatGPT on 2022-11-30, and after subsequent community policy events. To study this, we combine (i) site-level activity from the Stack Exchange Data Explorer (SEDE) with (ii) developer sentiment and usage data from the annual Stack Overflow Developer Survey. Together, these data allow us to both measure changes in answer volume and to contextualize those changes using self-reported behavior.

Our core hypothesis is that the arrival of high-quality, conversational code assistance would noticeably change the supply of answers on SO, because developers have a new place to go for immediate help. We further treat moderation policies and community events as additional shocks that may amplify or dampen this effect.

We frame the problem as a quasi-experimental time-series analysis with interrupted trends around several key dates:

ChatGPT public launch (2022-11-30)
Initial Stack Overflow policy banning AI-generated answers (policy posted 2022-12-05)
Later moderation and governance events (including the moderation strike)

Throughout, we pay attention to:

Internal validity: controlling for pre-existing trends and seasonality, rather than treating pre/post averages as independent.
External validity: comparing site-level patterns to changes in developer behavior reported in survey data.
Measurement caveats: handling deleted content, moderation queues, and sampling or survey-response effects.

Prior work and context. The slide deck you provided summarizes the key posts and data sources (OpenAI’s announcement, Meta Stack Overflow policy discussions, moderation-strike posts, traffic analyses, and SO survey documentation). These references define the timeline and motivate the research question without being re-stated in full here.

2. Data description and preliminary analysis

Datasets

We use two complementary datasets:

Dataset 1 — Site activity. Monthly counts of new answers on Stack Overflow from the public Stack Exchange Data Explorer (SEDE). The extract includes both non-deleted and deleted answers so that we can separate organic activity from moderation effects. The main analysis window is 2018–2025, which provides several pre-ChatGPT baseline years and a meaningful post-event period.
Dataset 2 — Developer survey. Selected questions from the Stack Overflow Developer Survey (2023–2025), focusing on visit frequency (e.g., daily vs. weekly) and adoption of AI tools such as ChatGPT. These variables are used to understand shifts in demand for on-site answers and how they correlate with AI usage.

Criteria and rationale

Dataset 1 directly measures the outcome of interest: answer supply on Stack Overflow.
Dataset 2 provides behavioral context: whether developers who use ChatGPT heavily also report visiting Stack Overflow less frequently.

By combining logs and surveys, we can triangulate between observational activity data and self-reported changes in workflow. The goal is not to claim strict causality, but to see whether the patterns in these sources align.

Preliminary views

The first set of plots (Figure 1) provides high-level structure for later modelling:

Time-series plots of monthly answers reveal overall levels and long-run trends.
Comparisons of pre- and post-ChatGPT periods highlight visible changes in level or slope.
Seasonal views (e.g., by calendar month) show systematic patterns such as summer slowdowns or end-of-year dips.

These descriptive views inform the later modelling choices such as adding interrupted trend terms and seasonal controls.

Figure 1. Preliminary monthly trend in new answers on Stack Overflow.

3. Exploratory analysis

We first clean and harmonize the SEDE extract, collapsing to monthly answer counts and separating deleted from non-deleted answers. We then:

Scan for structural breaks or anomalies around key event dates.
Apply short moving averages to highlight medium-run shifts in the time series.
Plot seasonality by calendar month to visualize recurring within-year patterns.

On the survey side, we:

Construct indicators for frequent SO visits (e.g., daily or almost daily) and ChatGPT usage.
Compare distributions across years to detect shifts in visiting behavior and AI adoption.

Sources of uncertainty and bias

We explicitly track several sources of uncertainty:

Policy and moderation effects. Deletions and review backlogs can move answers between months or suppress visible counts. To address this, we track deleted and non-deleted answers separately and compare them over time.
Seasonality and macro conditions. Holidays, hiring cycles, and broader market conditions can confound naive pre/post comparisons. We therefore visualize within-year seasonality and include time controls in the models.
Survey representativeness. Survey respondents may not be a random sample of all SO users. Active answerers and enthusiastic AI adopters might be over- or under-represented. For this reason, we treat survey-based findings as correlational, not causal.

Figure 2. Example exploratory views: seasonal patterns, moving averages, and distributional summaries.

Figure 3. Additional exploratory views, emphasizing seasonality and pre/post differences.

4. Model development and application

To move beyond descriptive plots, we implement four modelling approaches. Together, they test for structural changes in answer volume and connect survey behavior to on-site activity.

Interrupted time-series linear regression (ITS).
- Outcome: monthly new answers.
- Predictors: a linear time trend, post-ChatGPT level change, and slope change, plus optional indicator variables for policy and moderation periods.
- Goal: test for discrete jumps and gradual trend shifts relative to the pre-event trajectory.
Poisson / negative-binomial regression for counts.
- Same predictors as ITS but with a log link for count data.
- We compare Poisson and negative-binomial versions to account for over-dispersion and to avoid relying on normal residuals.
ARIMA time-series forecasting.
- Fit solely on pre-ChatGPT data to produce a counterfactual forecast.
- Compare out-of-sample forecasts to observed post-event answer counts.
- Large and sustained deviations beyond forecast bands signal additional shocks beyond trend and seasonality.
Logistic classification on survey microdata.
- Target: whether a respondent is a “frequent SO visitor” (daily or almost daily).
- Predictors: a ChatGPT-usage indicator plus demographic and role controls.
- Evaluation: accuracy, precision/recall, and calibration curves, with a hold-out split for validation.
- Purpose: test whether heavy ChatGPT users are less likely to report frequent SO visits, even after adjusting for other factors.

Validation and diagnostics

For each model family, we run basic diagnostic checks:

ITS models:
- Inspect residuals for autocorrelation and remaining seasonality.
- Re-fit with seasonal terms or alternative specifications where necessary.
Count models (Poisson/NB):
- Check over-dispersion indicators and compare Poisson vs. negative-binomial fits.
- Examine goodness-of-fit plots and residual patterns.
ARIMA forecasts:
- Select model orders using information criteria on the training window.
- Inspect forecast errors and confidence bands to ensure reasonable counterfactual behavior.
Classification models:
- Use a separate hold-out set for evaluation.
- Report confusion matrices and standard performance metrics.
- Inspect calibration to verify that predicted probabilities match observed frequencies.

Figure 4. Example model fits (ITS) and moving-average smoothed trends around intervention dates.

Figure 5. Illustrative Poisson / negative-binomial fits versus observed counts.

Figure 6. Additional count-model diagnostics and fit comparisons.

Figure 7. ARIMA counterfactual forecast vs. observed post-event answer volumes.

5. Conclusions and discussion

Across the descriptive plots and models, the period after November 2022 shows both level and slope changes that are consistent with a structural shift in answer supply on Stack Overflow. These changes coincide with the availability of ChatGPT and closely timed policy and moderation events.

ARIMA counterfactuals trained on pre-event data give a baseline trajectory. When we compare this baseline to observed post-event values, we see deviations that fall outside typical forecast bands, supporting the idea that there was a shock beyond existing trends and seasonality.

The survey-based classifiers reinforce this picture: heavy ChatGPT adoption is associated with lower self-reported visit frequency, even after controlling for observable demographics and roles. This pattern lines up with the site-level decline in new answers and suggests that some developers are partially substituting conversational AI for Stack Overflow visits.

Limitations

Causality is tentative.
- Policy changes and the moderation strike overlap with the ChatGPT rollout, making it difficult to cleanly attribute changes in answer volume to any single event.
- External shocks—such as labor-market cycles, ecosystem-tooling changes, or shifts in documentation quality—may also contribute.
Survey constraints.
- Survey responses are self-reported and subject to recall and response biases.
- The sample may not represent the full SO user base or the most active answerers.

Because of these limitations, we interpret the results as strong correlational evidence of a shift in answer supply and usage patterns, not as a sharp causal estimate. Future work should:

Incorporate richer covariates (e.g., tag-level activity, user cohorts, question complexity).
Explore quasi-experimental designs (such as synthetic controls) to better isolate the effect of AI tools and platform policies.

Implications

For knowledge platforms, the analysis suggests that answer supply can be sensitive to rapid changes in assistance tooling and governance. In particular:

Sustainable moderation capacity and clear, transparent AI guidance appear important to avoid destabilizing answer quality and volume.
As conversational assistants become part of everyday developer workflows, platforms like Stack Overflow may need deeper integration paths (for example, exposing structured answers or metadata that assistants can consume directly).
Balancing open contribution, quality control, and integration with external AI tools may be key to retaining community participation in an environment where “first-line help” increasingly comes from chatbots.

References

OpenAI. “Introducing ChatGPT.” OpenAI, 30 Nov. 2022.
Prasnikar, D. “Policy: Generative AI (e.g., ChatGPT) Is Banned.” Meta Stack Overflow, 5 Dec. 2022.
Mithical. “Moderation Strike: Stack Overflow, Inc. Cannot Consistently Ignore, Mistreat, and Malign Its Volunteers.” Meta Stack Exchange, 5 June 2023.
Makyen. “Moderation Strike: Conclusion and the Way Forward.” Meta Stack Exchange, 7 Aug. 2023.
Carr, D. F. “Stack Overflow Is ChatGPT Casualty: Traffic Down 14% in March.” Similarweb Insights, 19 Apr. 2023.
“Database Schema Documentation for the Public Data Dump and SEDE.” Meta Stack Exchange (FAQ), 4 Oct. 2022.
Stack Overflow. Stack Overflow Developer Survey (2023–2025).

Image gallery (additional figures)

Figure 8. Additional pre/post comparison plots.

Figure 9. Additional figure from the provided results.

12 KiB Raw Permalink Blame History Unescape Escape