added Assignment IV

2025-11-22 15:25:17 -05:00
parent 18a911f9d3
commit fa9a358415
17 changed files with 1144 additions and 0 deletions
@@ -0,0 +1,2 @@
+data/
+1.json
@@ -0,0 +1,4 @@
+{
+	"r.linting.lineLength": false,
+	"r.editor.tabSize": 4,
+}
@@ -0,0 +1,197 @@
+# Data Analytics Fall 2025 – Assignment IV
+
+**Project Title:** Measuring How Generative AI Adoption Reshaped Stack Overflow Participation 2018–2025
+
+**Author:** Itamar Oren-Naftalovich
+
+**Course:** Data Analytics (Fall 2025)
+
+**Repository Artifacts:** `analysis.r`, `data/*.csv`, `imgs/*.png`, `out.log` (model console output)
+
+---
+
+## 1. Abstract and Introduction
+
+On 30 November 2022, ChatGPT became publicly available. Within days, the Stack Overflow community faced two major shocks: developers suddenly had a new source of code-specific answers, and Stack Overflow introduced a temporary ban on AI-generated content on 5 December 2022 while already struggling with limited moderation capacity. This project asks whether the combination of generative AI adoption and these policy changes produced a statistically detectable regime shift in Stack Overflow content creation, and whether developers who say they use ChatGPT still treat Stack Overflow as a daily resource.
+
+I hypothesized that monthly answer counts would show a structural break after the ChatGPT launch and AI policy ban, even after controlling for the pre-2022 downward trend. I also expected that respondents who explicitly name ChatGPT as an AI assistant would be less likely to visit Stack Overflow daily. To test these ideas, I built two complementary datasets: (a) Stack Overflow Data Explorer (SEDE) exports of monthly deleted and non-deleted answers from January 2018 through November 2025, and (b) microdata from the 2023 and 2024 Stack Overflow Developer Surveys, which record both visit frequency and generative AI usage. The script `analysis.r` cleans both sources, engineers indicators for key policy dates, and generates the plots and models discussed in this report.
+
+The analysis relies on four modeling strategies: an interrupted time-series (ITS) linear regression, a Poisson regression for counts, a seasonal ARIMA model trained only on pre-ChatGPT data, and a logistic regression relating survey-reported AI usage to daily Stack Overflow visitation. Together, these models indicate that Stack Overflow answer production fell by more than 53% in the post-ChatGPT period (mean 90.5 vs. 193.0 answers per month). At the same time, daily visitors are increasingly concentrated in older age cohorts, and survey respondents who explicitly mention ChatGPT do not differ meaningfully from others in how often they visit the site. The sections that follow describe the datasets, exploratory patterns, modeling choices, and implications for the community.
+
+---
+
+## 2. Data Description and Preliminary Analysis
+
+### 2.1 Stack Overflow Answer Volume (Dataset 1)
+
+* **Source & scope.**
+  `data/so_new_answers_per_month_2018_2025.csv` is a SEDE export of every new answer (deleted and non-deleted) by month from January 2018 through November 2025 (95 monthly observations). The script standardizes month formats, aggregates across deletion statuses, and adds indicators for the ChatGPT release (30 Nov 2022), the AI policy ban (5 Dec 2022), and the Stack Exchange moderator strike (5 Jun–7 Aug 2023).
+
+* **Variables.**
+  After cleaning, the main table `answers_monthly` contains `answers_total`, `answers_non_deleted`, `answers_deleted`, calendar year and month, a sequential `time_index`, binary indicators for the events listed above, and a categorical `period` flagging pre- vs. post-ChatGPT months. A 3-month moving average (`answers_ma3`) is computed to smooth short-term noise for exploratory plots.
+
+* **Quality checks.**
+  Duplicate rows were removed by grouping on `month`, and all transformations are recorded in `out.log`. The only missing values arise in the first two moving-average entries, which plotting functions simply omit. Because SEDE distinguishes deleted from non-deleted answers, the analysis keeps both so that any changes in moderation are visible in the time series.
+
+![Figure 1. Monthly Stack Overflow answers with ChatGPT (dashed) and AI policy (dotted) markers.](imgs/01_answers_ts.png)
+*Figure 1. Monthly answer counts follow a long downward trend that becomes steeper after November 2022.*
+
+![Figure 2. Distribution of monthly answers pre- vs. post-ChatGPT.](imgs/02_box_pre_post.png)
+*Figure 2. Box plots highlight the magnitude of the drop between the pre- and post-ChatGPT regimes.*
+
+**Table 1. Descriptive statistics by regime (source: `data/answers_summary_period.csv`).**
+
+| period       | n_months | mean_answers | median_answers | sd_answers | min_answers | max_answers |
+| ------------ | -------- | ------------ | -------------- | ---------- | ----------- | ----------- |
+| pre_chatgpt  | 59       | 193.0        | 185            | 44.7       | 122         | 313         |
+| post_chatgpt | 36       | 90.5         | 88             | 38.0       | 11          | 157         |
+
+A quick comparison of the six months immediately before and after 30 November 2022 shows only a −10.0% change in average answers, suggesting that the full −53% decline in Table 1 unfolded gradually across 2023–2025 rather than occurring instantly. This gradual pattern is one reason for using time-series models instead of treating the policy change as a simple before/after difference.
+
+### 2.2 Stack Overflow Developer Survey (Dataset 2)
+
+* **Source & scope.**
+  The second dataset uses the publicly released 2023 and 2024 Stack Overflow Developer Survey microdata (`stack-overflow-developer-survey-2023.zip` and `stack-overflow-developer-survey-2024.zip`, downloaded 19 November 2025). Combined, these files contain 146,676 responses from professional and hobbyist developers worldwide.
+
+* **Schema harmonization.**
+  Column names differ slightly across years (for example, `SOAI` vs. `AISelect`), so helper functions search for the first matching column for each concept. The harmonized frame retains `year`, `main_branch`, `country`, numeric `age`, `gender`, reported Stack Overflow visit frequency (`so_visit`), and free-text AI assistant preferences (`ai_select`).
+
+* **Feature engineering.**
+  Two binary indicators are constructed: `frequent_so` (1 if the respondent reports visiting Stack Overflow daily or multiple times per day) and `uses_chatgpt` (1 if the string “ChatGPT” appears anywhere in `ai_select`). Age is grouped into buckets (`<25`, `25–34`, `35–44`, `45+`, `unknown`), and gender is collapsed into a simplified label to absorb inconsistent free-text entries.
+
+* **Sample considerations.**
+  Because the 2024 instrument asks about AI search preferences rather than naming specific tools, only 1,181 respondents in 2023 explicitly mention ChatGPT and almost none do in 2024. This change in wording is treated as a measurement artifact and revisited as a source of bias in Sections 3 and 4.
+
+---
+
+## 3. Exploratory Analysis
+
+### 3.1 Seasonal and Trend Patterns in Answer Volume
+
+The `answers_monthly` series preserves the familiar seasonal dip every December, but the overall level shifts downward after 2022. As Figure 3 shows, even typically slow months such as July now fall below 60 answers, compared with roughly 150–220 answers in earlier years.
+
+![Figure 3. Seasonality of Stack Overflow answers by calendar month and period.](imgs/03_seasonal.png)
+*Figure 3. Post-ChatGPT seasons follow a similar seasonal shape but sit on a much lower baseline.*
+
+The 3-month moving average in Figure 4 provides additional context. It peaks near 210 answers in mid-2018, drifts below 150 answers by late 2021, crosses under 100 answers in August 2023, and reaches about 23 answers by November 2025. The timing of the Stack Exchange moderator strike (June–August 2023) aligns with the first extended period below 100 answers per month, hinting at compounding effects from generative AI substitution and reduced moderation capacity.
+
+![Figure 4. Raw answers (faint) vs. 3-month moving average.](imgs/04_ma3.png)
+*Figure 4. The smoothed series marks a clear structural break soon after the ChatGPT launch and policy ban.*
+
+### 3.2 Survey Signals on Engagement and AI Adoption
+
+A stacked bar chart (Figure 5) summarizes how daily Stack Overflow visitation relates to ChatGPT usage. In 2023, daily visitation rates are essentially identical for explicit ChatGPT users (39.1%) and non-users (also 39.1%), suggesting that early adopters of ChatGPT continued to visit Stack Overflow at similar rates while experimenting with AI. By 2024, daily visitation among respondents who *do not* mention ChatGPT falls to 37.3%. The near-absence of explicit ChatGPT mentions that year, however, is driven by the different survey question wording rather than a real disappearance of the tool. This reinforces the idea that self-reported tool usage is noisy and needs to be combined with behavioral indicators like monthly answer counts.
+
+![Figure 5. Share of respondents visiting Stack Overflow daily, split by ChatGPT usage, 2023–2024.](imgs/08_survey_bar.png)
+*Figure 5. Small differences between groups and across years illustrate how limited the AI usage field is for explaining engagement.*
+
+### 3.3 Sources of Uncertainty and Bias
+
+Several sources of uncertainty shape the analysis:
+
+* **Measurement bias.**
+  SEDE relies on Stack Overflow’s internal logging. Deleted answers can be removed retroactively, so counts for the most recent months remain somewhat fluid.
+
+* **Event alignment.**
+  The interrupted time-series design treats 30 November 2022 as the breakpoint between regimes, but the 2023 moderator strike and evolving AI policies create overlapping shocks that blur a clean “pre vs. post” distinction.
+
+* **Survey sampling.**
+  The developer survey is voluntary, conducted in English, and heavily skewed toward respondents in North America and India. Age and tool usage are self-reported, and the 2024 wording change likely undercounts ChatGPT adoption.
+
+* **Missingness.**
+  “Prefer not to say” responses in age and gender are mapped to `NA` or `Unknown`, which softens demographic differences in downstream models.
+
+These limitations motivated the use of several modeling approaches in Section 4 instead of relying on a single model family.
+
+---
+
+## 4. Model Development and Application of Models
+
+Each model addresses a slightly different question about Stack Overflow activity. All diagnostics and figures are produced directly by `analysis.r` and saved in `imgs/`.
+
+### 4.1 Interrupted Time-Series Linear Regression
+
+* **Specification.**
+  The primary linear model is
+  `answers_total ~ time + post_chatgpt + chatgpt_time`,
+  where `time` is the number of months since January 2018 and `chatgpt_time` resets to 1 in December 2022 to allow the post-ChatGPT slope to differ from the pre-ChatGPT trend.
+
+* **Results.**
+  The model explains 71.7% of the variance (adjusted R² = 0.708, σ = 35.3). Before ChatGPT, monthly answers were already declining by −0.86 answers per month (p = 0.002). After November 2022, the slope becomes steeper by an additional −2.37 answers per month (p < 0.001). The immediate level change of −18 answers at the breakpoint is not statistically significant (p = 0.24).
+
+* **Interpretation.**
+  Rather than a sudden cliff, the data show an acceleration of an existing decline. The post-ChatGPT trend line loses almost three extra answers each month relative to the pre-2023 trajectory, which accumulates to roughly 108 fewer answers per year.
+
+![Figure 6. Observed vs. fitted answers under the interrupted time-series model.](imgs/05_lm_fit.png)
+*Figure 6. The fitted line captures a gradual erosion in answer volume instead of a single large discontinuity.*
+
+### 4.2 Poisson Regression for Count Data
+
+* **Specification.**
+  The Poisson model uses the same predictors but applies a log link appropriate for count outcomes.
+
+* **Results.**
+  The estimated multiplicative time effect before ChatGPT is `exp(time) = 0.996` (p < 0.001), corresponding to a 0.4% monthly contraction. After the release, the effective slope multiplier drops to 0.968 (p ≈ 2.7 × 10⁻⁶⁸), implying a 3.2% shrinkage per month. The residual deviance is 713.8 on 91 degrees of freedom, compared with a null deviance of 2,879.9.
+
+* **Interpretation.**
+  Expressed in percentage terms, the Poisson model tells a similar story to the linear ITS: by late 2025, the expected answer count decays toward single digits if post-2022 dynamics continue unchanged.
+
+![Figure 7. Poisson regression fit vs. observed counts.](imgs/06_pois_fit.png)
+*Figure 7. The Poisson model slightly overestimates the lowest post-2024 points, consistent with some dispersion in the counts.*
+
+### 4.3 Seasonal ARIMA Forecasting (Pre-ChatGPT Baseline)
+
+* **Specification.**
+  To estimate what would have happened without ChatGPT and related policy changes, a seasonal ARIMA model, ARIMA(1,1,0)[1,0,0](12), is fit only to data through October 2022 (`train_ts`). The model then generates forecasts for November 2022–November 2025, which are compared to the actual counts.
+
+* **Results.**
+  On the training window, fit statistics are solid (RMSE = 32.9, MASE = 0.52). Out-of-sample, however, errors grow large: RMSE = 89.3, MAE = 79.3, MAPE ≈ 172%, and Theil’s U = 7.11. Observed counts soon fall below the 80% prediction interval and remain there, indicating that historical seasonality and trends alone cannot explain the post-2022 decline.
+
+* **Interpretation.**
+  The ARIMA baseline functions as a counterfactual. Its consistent over-prediction of post-ChatGPT activity reinforces the conclusion that a structural break occurred, rather than a continuation of prior dynamics.
+
+![Figure 8. ARIMA forecast (trained through Oct 2022) vs. actual counts.](imgs/07_arima_forecast.png)
+*Figure 8. Actual activity diverges from the ARIMA forecast almost immediately and never returns to the predicted band.*
+
+### 4.4 Logistic Regression on Survey Engagement
+
+* **Specification.**
+  The survey-based model predicts `frequent_so` (daily or multiple-times-per-day visitor). Predictors that retain more than one level after cleaning are `uses_chatgpt`, `age_group`, and `year`. The harmonized `gender` variable collapses to a single `Unknown` level and is therefore dropped automatically. The data are split 80/20 into training and test sets with a fixed seed (123). The decision threshold is set to the training positive rate (0.384) to reduce the impact of class imbalance.
+
+* **Results.**
+  Relative to respondents younger than 25, the odds ratio for the 25–34 group is 1.04 (p = 0.008), while the 35–44 and 45+ groups have odds ratios of 0.81 and 0.71, respectively (both p < 10⁻³²). The ChatGPT usage indicator has an odds ratio of 0.99 (p = 0.92), effectively indistinguishable from 1. The 2024 indicator yields an odds ratio of 0.92 (p ≈ 2.2 × 10⁻¹¹), pointing to a modest overall decline in daily visitation from 2023 to 2024.
+
+  On the 29,336-observation test set, the confusion matrix reports 7,560 true negatives, 10,549 false positives, 4,009 false negatives, and 7,218 true positives, giving an accuracy of 50.4%, precision of 40.6%, and recall of 64.3%.
+
+* **Interpretation.**
+  The weak predictive performance and scarcity of explicit ChatGPT mentions in 2024 both suggest that the current survey instrument is not well suited for isolating the impact of AI usage on Stack Overflow engagement. Age shows a clearer pattern than AI usage: older cohorts are less likely to be daily visitors, while ChatGPT adoption, at least as self-reported in these surveys, does not significantly distinguish frequent users from others.
+
+![Figure 9. Predicted probabilities of daily Stack Overflow usage by ChatGPT adoption.](imgs/09_logit_probs.png)
+*Figure 9. Predicted probability distributions overlap heavily, mirroring the non-significant odds ratio for `uses_chatgpt`.*
+
+---
+
+## 5. Conclusions and Discussion
+
+The evidence points to a genuine structural break in Stack Overflow answer production beginning in late 2022. Average monthly answers drop from 193.0 in the 2018–October 2022 period to 90.5 between December 2022 and November 2025. The interrupted time-series model shows the slope of decline becoming steeper by about −2.37 answers per month after ChatGPT’s release, and the Poisson model implies a post-ChatGPT decay rate of roughly 3.2% per month. ARIMA forecasts trained only on pre-ChatGPT data substantially overestimate post-2022 activity, reinforcing the conclusion that pre-existing seasonal and secular trends cannot account for the observed collapse.
+
+The survey-based models tell a more nuanced story about *who* remains active. Despite common assumptions that ChatGPT usage directly crowds out Stack Overflow visits, the current survey data do not show a strong link: the odds ratio for reported ChatGPT usage is essentially 1, and differences in daily visitation are driven more by age and year than by AI adoption. Given the 2024 wording change and the limitations of self-reported tool usage, it would be premature to claim that ChatGPT users as a group have already abandoned Stack Overflow.
+
+Taken together, these findings suggest that any response from Stack Overflow should combine supply-side interventions—such as incentives for high-quality answers and additional moderation support to limit deleted content—with better measurement of how developers actually integrate AI tools and community Q&A into their workflows.
+
+Future work could extend the time-series models with covariates for major product changes (e.g., Collectives, Discussions), incorporate question volume alongside answers, and revisit the survey analysis once the 2025 instrument becomes available. Causal impact methods, such as Bayesian structural time series using the ARIMA forecast as a prior, could offer a more formal estimate of the counterfactual number of answers that would have been produced without the post-2022 shocks.
+
+---
+
+## References
+
+1. Stack Exchange Data Explorer. “New answers (deleted + non-deleted) per month,” query exported 19 Nov 2025 from the Stack Overflow SEDE interface.
+2. Stack Overflow. “Stack Overflow Developer Survey 2023” and “Stack Overflow Developer Survey 2024,” datasets accessed 19 Nov 2025 from the Stack Overflow survey site.
+3. OpenAI. “Introducing ChatGPT,” OpenAI Blog, 30 Nov 2022.
+4. Stack Overflow Meta. “Temporary policy: ChatGPT is banned,” Meta Stack Overflow, 5 Dec 2022.
+5. Stack Exchange. “Moderator Strike: Stack Overflow, Stack Exchange Network,” Meta Stack Exchange updates, Jun–Aug 2023.
+
+---
+
+**Code and Reproducibility:**
+All data acquisition, cleaning, plotting, and model fitting steps are implemented in `analysis.r`. Running `Rscript analysis.r` recreates each figure in `imgs/` and regenerates the tidy datasets referenced throughout this report, with a console transcript saved to `out.log`.
@@ -0,0 +1,683 @@
+# install.packages(
+#     c("tidyverse", "lubridate", "broom", "forecast", "stringr", "dplyr"),
+#     repos = "http://cran.us.r-project.org"
+# )
+
+library(tidyverse)
+library(lubridate)
+library(broom)
+library(forecast)
+library(stringr)
+library(dplyr)
+
+
+# directory for data files (adjust if desired)
+data_dir <- "data"
+if (!dir.exists(data_dir)) {
+    dir.create(data_dir, recursive = TRUE)
+}
+
+# directory for plots
+imgs_dir <- "imgs"
+if (!dir.exists(imgs_dir)) {
+    dir.create(imgs_dir, recursive = TRUE)
+}
+
+# constants: key event dates related to chatgpt and so policy
+
+# chatgpt public research preview launch
+chatgpt_launch_date <- as.Date("2022-11-30") # openai "introducing chatgpt" blog
+
+# stack overflow generative ai ban policy (meta so, 5 dec 2022)
+so_ai_policy_date <- as.Date("2022-12-05")
+
+# moderation strike on stack exchange (june–aug 2023) from meta posts
+so_mod_strike_start <- as.Date("2023-06-05")
+so_mod_strike_end <- as.Date("2023-08-07")
+
+# helper: safe downloader
+download_if_missing <- function(url, destfile) {
+    if (!file.exists(destfile)) {
+        message("downloading ", basename(destfile), " ...")
+        download.file(url, destfile, mode = "wb")
+        message("saved to ", destfile)
+    } else {
+        message("file already exists: ", destfile)
+    }
+}
+
+coerce_month_to_date <- function(x) {
+    if (inherits(x, "Date")) {
+        return(x)
+    }
+    if (inherits(x, "POSIXct")) {
+        return(lubridate::as_date(x))
+    }
+    if (inherits(x, "POSIXlt")) {
+        return(as.Date(x))
+    }
+    if (is.numeric(x)) {
+        return(as.Date(x, origin = "1970-01-01"))
+    }
+    if (is.character(x)) {
+        parsed <- suppressWarnings(lubridate::ymd_hms(x))
+        if (all(is.na(parsed))) {
+            parsed <- suppressWarnings(lubridate::ymd(x))
+        }
+        if (all(is.na(parsed))) {
+            parsed <- suppressWarnings(as.Date(x))
+        }
+        return(parsed)
+    }
+    suppressWarnings(as.Date(x))
+}
+
+# 1) load stack overflow monthly answers (dataset 1)
+
+answers_csv_path <- file.path(data_dir, "so_new_answers_per_month_2018_2025.csv")
+
+if (!file.exists(answers_csv_path)) {
+    stop(
+        "missing ", answers_csv_path,
+        "\nrun the sede query in this script and download the csv to that path first."
+    )
+}
+
+answers_raw <- readr::read_csv(answers_csv_path, show_col_types = FALSE) |>
+    rename(
+        month = matches("^Date$|Month", ignore.case = TRUE),
+        status = matches("^Status$", ignore.case = TRUE),
+        new_answers = matches("NewAnswers|Count", ignore.case = TRUE)
+    )
+
+answers_raw <- answers_raw |>
+    mutate(
+        month = coerce_month_to_date(month),
+        status = tolower(status)
+    )
+
+
+# inspect column names so you can adjust if sede changes them
+print(names(answers_raw))
+
+# expected columns: "Month", "Status", "NewAnswers"
+# normalise to lower snake case just in case
+
+answers_raw <- answers_raw |>
+    rename(
+        month = matches("Month", ignore.case = TRUE),
+        status = matches("Status", ignore.case = TRUE),
+        new_answers = matches("NewAnswers|Count", ignore.case = TRUE)
+    )
+
+print(head(answers_raw))
+
+# aggregate deleted vs non-deleted into separate columns per month
+answers_monthly <- answers_raw |>
+    mutate(
+        month = as.Date(month),
+        status = tolower(status)
+    ) |>
+    group_by(month) |>
+    summarise(
+        answers_total = sum(new_answers, na.rm = TRUE),
+        answers_non_deleted = sum(if_else(status == "non-deleted", new_answers, 0L)),
+        answers_deleted = sum(if_else(status == "deleted", new_answers, 0L)),
+        .groups = "drop"
+    ) |>
+    arrange(month) |>
+    mutate(
+        year = year(month),
+        month_num = month(month),
+        time_index = row_number(),
+        post_chatgpt = month >= chatgpt_launch_date,
+        post_ai_policy = month >= so_ai_policy_date,
+        during_mod_strike = month >= so_mod_strike_start & month <= so_mod_strike_end,
+        period = case_when(
+            month < chatgpt_launch_date ~ "pre_chatgpt",
+            TRUE ~ "post_chatgpt"
+        )
+    )
+
+glimpse(answers_monthly)
+
+# 2) download and load stack overflow developer survey 2023/2024 (dataset 2)
+
+# official survey zip files as exposed on survey.stackoverflow.co
+# these urls are the same ones behind the "download full data set (csv)" links
+# see: https://survey.stackoverflow.co/
+survey_2023_url <- "https://survey.stackoverflow.co/datasets/stack-overflow-developer-survey-2023.zip"
+survey_2024_url <- "https://survey.stackoverflow.co/datasets/stack-overflow-developer-survey-2024.zip"
+
+survey_2023_zip <- file.path(data_dir, "stack-overflow-developer-survey-2023.zip")
+survey_2024_zip <- file.path(data_dir, "stack-overflow-developer-survey-2024.zip")
+
+download_if_missing(survey_2023_url, survey_2023_zip)
+download_if_missing(survey_2024_url, survey_2024_zip)
+
+# helper to read the "survey_results_public.csv" inside each zip
+read_so_survey_from_zip <- function(zip_path, csv_pattern = "survey_results_public.csv") {
+    if (!file.exists(zip_path)) {
+        stop("zip file not found: ", zip_path)
+    }
+
+    # list files inside zip (works even when the CSV is in a subfolder)
+    zlist <- utils::unzip(zip_path, list = TRUE)
+    # try to find the csv by exact name or by pattern
+    csv_name <- zlist$Name[stringr::str_detect(zlist$Name, regex(csv_pattern, ignore_case = TRUE))]
+    if (length(csv_name) == 0) {
+        stop("could not find a csv matching ", csv_pattern, " inside ", zip_path)
+    }
+    csv_name <- csv_name[1] # take first match
+
+    # read it without extracting to disk using unz() connection
+    # optionally supply col_types to speed parsing
+    df <- readr::read_csv(
+        unz(zip_path, csv_name),
+        show_col_types = FALSE,
+        # col_types = cols(.default = col_character()) # uncomment & customize if you want explicit types
+    )
+    df
+}
+
+survey2023_raw <- read_so_survey_from_zip(survey_2023_zip)
+survey2024_raw <- read_so_survey_from_zip(survey_2024_zip)
+
+# look at column names to locate ai + stackoverflow usage questions
+names(survey2023_raw)[1:80]
+
+############################################################
+# create a harmonised survey subset focusing on:
+#  - so visit frequency (column like SOVisitFreq)
+#  - ai tool usage (column like AISelect or SOAI)
+############################################################
+
+find_first_col <- function(df, pattern) {
+    cols <- names(df)[stringr::str_detect(names(df), regex(pattern, ignore_case = TRUE))]
+    if (length(cols) == 0) {
+        return(NA_character_)
+    }
+    cols[1]
+}
+
+pull_col_or_default <- function(df, col_name, default = NA_character_) {
+    if (is.na(col_name)) {
+        return(rep(default, nrow(df)))
+    }
+    df[[col_name]]
+}
+
+pull_age_numeric <- function(df, col_name) {
+    vec <- pull_col_or_default(df, col_name, default = NA_real_)
+    if (is.numeric(vec)) {
+        return(vec)
+    }
+    if (is.factor(vec)) {
+        vec <- as.character(vec)
+    }
+    if (is.character(vec)) {
+        vec <- stringr::str_trim(vec)
+        vec[vec == ""] <- NA_character_
+        vec[stringr::str_detect(vec, regex("prefer not to say", ignore_case = TRUE))] <- NA_character_
+        # parse_number extracts the leading numeric value (e.g., 25 from "25-34 years old")
+        return(suppressWarnings(readr::parse_number(vec)))
+    }
+    suppressWarnings(as.numeric(vec))
+}
+
+main_branch_col_2023 <- find_first_col(survey2023_raw, "^MainBranch$|MainBranch")
+country_col_2023 <- find_first_col(survey2023_raw, "^Country$|Country")
+age_col_2023 <- find_first_col(survey2023_raw, "^Age$|Age")
+gender_col_2023 <- find_first_col(survey2023_raw, "^Gender$|Gender")
+so_visit_col_2023 <- find_first_col(survey2023_raw, "SOVisitFreq")
+ai_select_col_2023 <- find_first_col(survey2023_raw, "AISelect|SOAI")
+
+main_branch_col_2024 <- find_first_col(survey2024_raw, "^MainBranch$|MainBranch")
+country_col_2024 <- find_first_col(survey2024_raw, "^Country$|Country")
+age_col_2024 <- find_first_col(survey2024_raw, "^Age$|Age")
+gender_col_2024 <- find_first_col(survey2024_raw, "^Gender$|Gender")
+so_visit_col_2024 <- find_first_col(survey2024_raw, "SOVisitFreq")
+ai_select_col_2024 <- find_first_col(survey2024_raw, "AISelect|SOAI")
+
+message("2023 so visit col: ", so_visit_col_2023)
+message("2023 ai col      : ", ai_select_col_2023)
+message("2024 so visit col: ", so_visit_col_2024)
+message("2024 ai col      : ", ai_select_col_2024)
+
+# build a clean survey frame for 2023
+
+survey2023 <- survey2023_raw |>
+    transmute(
+        year = 2023L,
+        main_branch = pull_col_or_default(survey2023_raw, main_branch_col_2023),
+        country = pull_col_or_default(survey2023_raw, country_col_2023),
+        age = pull_age_numeric(survey2023_raw, age_col_2023),
+        gender = pull_col_or_default(survey2023_raw, gender_col_2023),
+        so_visit = pull_col_or_default(survey2023_raw, so_visit_col_2023),
+        ai_select = pull_col_or_default(survey2023_raw, ai_select_col_2023)
+    )
+
+# same idea for 2024 (schema is very similar)
+
+survey2024 <- survey2024_raw |>
+    transmute(
+        year = 2024L,
+        main_branch = pull_col_or_default(survey2024_raw, main_branch_col_2024),
+        country = pull_col_or_default(survey2024_raw, country_col_2024),
+        age = pull_age_numeric(survey2024_raw, age_col_2024),
+        gender = pull_col_or_default(survey2024_raw, gender_col_2024),
+        so_visit = pull_col_or_default(survey2024_raw, so_visit_col_2024),
+        ai_select = pull_col_or_default(survey2024_raw, ai_select_col_2024)
+    )
+
+survey_all <- bind_rows(survey2023, survey2024)
+
+# engineer features:
+#   - binary flag: frequent so visitor
+#   - binary flag: uses chatgpt as ai tool (from ai_select free text / semicolon list)
+#   - coarser age groups
+
+survey_model <- survey_all |>
+    filter(!is.na(so_visit)) |>
+    mutate(
+        so_visit = as.character(so_visit),
+        ai_select = as.character(ai_select),
+
+        # frequent so visitor: daily or multiple times per day etc.
+        frequent_so = dplyr::case_when(
+            stringr::str_detect(so_visit, regex("multiple times per day", ignore_case = TRUE)) ~ 1L,
+            stringr::str_detect(so_visit, regex("daily|almost every day", ignore_case = TRUE)) ~ 1L,
+            TRUE ~ 0L
+        ),
+        uses_chatgpt = dplyr::case_when(
+            is.na(ai_select) ~ 0L,
+            stringr::str_detect(ai_select, regex("chatgpt", ignore_case = TRUE)) ~ 1L,
+            TRUE ~ 0L
+        ),
+        age_group = dplyr::case_when(
+            !is.na(age) & age < 25 ~ "<25",
+            !is.na(age) & age >= 25 & age < 35 ~ "25-34",
+            !is.na(age) & age >= 35 & age < 45 ~ "35-44",
+            !is.na(age) & age >= 45 ~ "45+",
+            TRUE ~ "unknown"
+        ),
+        gender = if_else(is.na(gender) | gender == "", "Unknown", gender)
+    ) |>
+    filter(!is.na(frequent_so)) |>
+    mutate(
+        frequent_so = as.integer(frequent_so),
+        uses_chatgpt = as.integer(uses_chatgpt),
+        age_group = factor(age_group),
+        gender = factor(gender),
+        year = factor(year)
+    )
+
+glimpse(survey_model)
+
+# SECTION 2: data description + preliminary plots (dataset 1)
+# basic time series plot of answers over time (for section 2)
+
+p_answers_ts <- ggplot(answers_monthly, aes(x = month, y = answers_total)) +
+    geom_line() +
+    geom_vline(xintercept = chatgpt_launch_date, linetype = "dashed") +
+    geom_vline(xintercept = so_ai_policy_date, linetype = "dotted") +
+    labs(
+        title = "monthly new answers on stack overflow",
+        x = "month",
+        y = "number of answers"
+    )
+
+print(p_answers_ts)
+ggsave(
+    filename = file.path(imgs_dir, "01_answers_ts.png"),
+    plot = p_answers_ts,
+    width = 10, height = 6, units = "in", dpi = 300
+)
+
+# boxplot pre vs post chatgpt
+
+p_box_pre_post <- ggplot(answers_monthly, aes(x = period, y = answers_total)) +
+    geom_boxplot() +
+    labs(
+        title = "distribution of monthly answers: pre vs post chatgpt launch",
+        x = "period",
+        y = "monthly answers"
+    )
+
+print(p_box_pre_post)
+ggsave(file.path(imgs_dir, "02_box_pre_post.png"), plot = p_box_pre_post, width = 8, height = 6, units = "in", dpi = 300)
+
+# basic summary table
+
+answers_summary_period <- answers_monthly |>
+    group_by(period) |>
+    summarise(
+        n_months = n(),
+        mean_answers = mean(answers_total),
+        median_answers = median(answers_total),
+        sd_answers = sd(answers_total),
+        min_answers = min(answers_total),
+        max_answers = max(answers_total),
+        .groups = "drop"
+    )
+
+print(answers_summary_period)
+
+# SECTION 3: exploratory analysis
+# seasonal pattern: answers by calendar month across years
+
+p_seasonal <- answers_monthly |>
+    mutate(month_label = factor(month_num, labels = month.abb)) |>
+    ggplot(aes(x = month_label, y = answers_total, group = year, color = period)) +
+    geom_line(alpha = 0.6) +
+    labs(
+        title = "seasonality of answers by calendar month and year",
+        x = "calendar month",
+        y = "monthly answers"
+    )
+
+print(p_seasonal)
+ggsave(file.path(imgs_dir, "03_seasonal.png"), plot = p_seasonal, width = 10, height = 6, units = "in", dpi = 300)
+
+# rolling 3-month moving average to smooth noise
+
+answers_monthly <- answers_monthly |>
+    arrange(month) |>
+    mutate(
+        answers_ma3 = zoo::rollmean(answers_total, k = 3, fill = NA, align = "right")
+    )
+
+p_ma3 <- ggplot(answers_monthly, aes(x = month)) +
+    geom_line(aes(y = answers_total), alpha = 0.3) +
+    geom_line(aes(y = answers_ma3)) +
+    geom_vline(xintercept = chatgpt_launch_date, linetype = "dashed") +
+    labs(
+        title = "monthly answers with 3-month moving average",
+        x = "month",
+        y = "answers"
+    )
+
+print(p_ma3)
+ggsave(file.path(imgs_dir, "04_ma3.png"), plot = p_ma3, width = 10, height = 6, units = "in", dpi = 300)
+
+# simple percentage change around chatgpt launch
+
+pre_window <- answers_monthly |>
+    filter(
+        month >= chatgpt_launch_date - months(6),
+        month < chatgpt_launch_date
+    )
+
+post_window <- answers_monthly |>
+    filter(
+        month >= chatgpt_launch_date,
+        month < chatgpt_launch_date + months(6)
+    )
+
+pre_mean <- mean(pre_window$answers_total)
+post_mean <- mean(post_window$answers_total)
+pct_change <- (post_mean - pre_mean) / pre_mean * 100
+
+pct_change
+
+# survey exploratory: relation between ai usage and so visit frequency
+
+survey_counts <- survey_model |>
+    mutate(
+        uses_chatgpt_label = if_else(uses_chatgpt == 1L, "uses chatgpt", "does not use chatgpt"),
+        freq_label = if_else(frequent_so == 1L, "visits so daily", "visits so less often")
+    ) |>
+    count(year, uses_chatgpt_label, freq_label) |>
+    group_by(year, uses_chatgpt_label) |>
+    mutate(prop = n / sum(n)) |>
+    ungroup()
+
+p_survey_bar <- ggplot(survey_counts, aes(x = uses_chatgpt_label, y = prop, fill = freq_label)) +
+    geom_col(position = "fill") +
+    facet_wrap(~year) +
+    scale_y_continuous(labels = scales::percent_format()) +
+    labs(
+        title = "relationship between chatgpt use and stack overflow visit frequency (survey)",
+        x = "ai usage segment",
+        y = "share of respondents",
+        fill = "so visit frequency"
+    )
+
+print(p_survey_bar)
+ggsave(file.path(imgs_dir, "08_survey_bar.png"), plot = p_survey_bar, width = 10, height = 6, units = "in", dpi = 300)
+
+# SECTION 4: model development (four different model types)
+
+# MODEL 1: interrupted time series linear regression
+# outcome: monthly answers_total
+# predictors: time trend, post_chatgpt level change, slope change after chatgpt
+
+its_data <- answers_monthly |>
+    mutate(
+        time = time_index,
+        chatgpt_time = if_else(month >= chatgpt_launch_date,
+            time_index - min(time_index[month >= chatgpt_launch_date]) + 1L,
+            0L
+        )
+    )
+
+model_lm <- lm(
+    answers_total ~ time + post_chatgpt + chatgpt_time,
+    data = its_data
+)
+
+summary(model_lm)
+tidy(model_lm)
+glance(model_lm)
+
+# predictions and plot
+
+its_data <- its_data |>
+    mutate(
+        lm_fitted = predict(model_lm)
+    )
+
+p_lm_fit <- ggplot(its_data, aes(x = month)) +
+    geom_line(aes(y = answers_total), alpha = 0.4) +
+    geom_line(aes(y = lm_fitted), color = "blue") +
+    geom_vline(xintercept = chatgpt_launch_date, linetype = "dashed") +
+    labs(
+        title = "interrupted time series regression: observed vs fitted answers",
+        x = "month",
+        y = "answers"
+    )
+
+print(p_lm_fit)
+ggsave(file.path(imgs_dir, "05_lm_fit.png"), plot = p_lm_fit, width = 10, height = 6, units = "in", dpi = 300)
+
+# MODEL 2: poisson regression for count data
+
+model_pois <- glm(
+    answers_total ~ time + post_chatgpt + chatgpt_time,
+    data = its_data,
+    family = poisson(link = "log")
+)
+
+summary(model_pois)
+tidy(model_pois, exponentiate = TRUE) # exp(coef) ~ multiplicative effect
+
+# compare predicted counts
+
+its_data <- its_data |>
+    mutate(
+        pois_fitted = predict(model_pois, type = "response")
+    )
+
+p_pois_fit <- ggplot(its_data, aes(x = month)) +
+    geom_line(aes(y = answers_total), alpha = 0.3) +
+    geom_line(aes(y = pois_fitted), color = "red") +
+    geom_vline(xintercept = chatgpt_launch_date, linetype = "dashed") +
+    labs(
+        title = "poisson regression: observed vs predicted monthly answers",
+        x = "month",
+        y = "answers"
+    )
+
+print(p_pois_fit)
+ggsave(file.path(imgs_dir, "06_pois_fit.png"), plot = p_pois_fit, width = 10, height = 6, units = "in", dpi = 300)
+
+# MODEL 3: arima time series forecast (pre-chatgpt vs actual)
+
+# construct monthly ts object (frequency = 12)
+start_year <- year(min(answers_monthly$month))
+start_month <- month(min(answers_monthly$month))
+
+answers_ts <- ts(
+    answers_monthly$answers_total,
+    start = c(start_year, start_month),
+    frequency = 12
+)
+
+# train on pre-chatgpt data (up to oct 2022) and forecast forward
+train_end <- c(2022, 10) # october 2022
+train_ts <- window(answers_ts, end = train_end)
+test_ts <- window(answers_ts, start = c(2022, 11))
+
+arima_fit <- auto.arima(train_ts)
+
+summary(arima_fit)
+
+h <- length(test_ts)
+fc <- forecast(arima_fit, h = h)
+
+# compare forecast vs actual on the holdout period
+
+fc_df <- data.frame(
+    month = answers_monthly$month[answers_monthly$month >= as.Date("2022-11-01")],
+    actual = as.numeric(test_ts),
+    forecast = as.numeric(fc$mean),
+    lower_80 = as.numeric(fc$lower[, "80%"]),
+    upper_80 = as.numeric(fc$upper[, "80%"])
+)
+
+p_arima <- ggplot(fc_df, aes(x = month)) +
+    geom_line(aes(y = actual), alpha = 0.6) +
+    geom_line(aes(y = forecast), linetype = "dashed") +
+    geom_ribbon(aes(ymin = lower_80, ymax = upper_80), alpha = 0.2) +
+    labs(
+        title = "arima forecast (trained on pre-chatgpt) vs actual answers",
+        x = "month",
+        y = "answers"
+    )
+
+print(p_arima)
+ggsave(file.path(imgs_dir, "07_arima_forecast.png"), plot = p_arima, width = 10, height = 6, units = "in", dpi = 300)
+
+# simple accuracy metrics on the holdout
+
+fc_accuracy <- accuracy(fc, test_ts)
+print(fc_accuracy)
+
+# MODEL 4: logistic regression – does using chatgpt predict being a frequent stack overflow visitor?
+
+set.seed(123)
+
+survey_model_complete <- survey_model |>
+    filter(!is.na(uses_chatgpt), !is.na(frequent_so))
+
+candidate_predictors <- c("uses_chatgpt", "age_group", "gender", "year")
+valid_predictors <- candidate_predictors[sapply(
+    candidate_predictors,
+    function(col) dplyr::n_distinct(survey_model_complete[[col]], na.rm = TRUE) > 1
+)]
+drop_predictors <- setdiff(candidate_predictors, valid_predictors)
+if (length(drop_predictors) > 0) {
+    message("dropping predictors with <2 levels: ", paste(drop_predictors, collapse = ", "))
+}
+
+logit_formula <- if (length(valid_predictors) == 0) {
+    frequent_so ~ 1
+} else {
+    as.formula(paste("frequent_so ~", paste(valid_predictors, collapse = " + ")))
+}
+
+n <- nrow(survey_model_complete)
+train_idx <- sample(seq_len(n), size = floor(0.8 * n))
+
+survey_train <- survey_model_complete[train_idx, ]
+survey_test <- survey_model_complete[-train_idx, ]
+
+positive_rate <- mean(survey_train$frequent_so, na.rm = TRUE)
+classification_threshold <- dplyr::case_when(
+    is.na(positive_rate) ~ 0.5,
+    positive_rate <= 0 ~ 0.5,
+    positive_rate >= 1 ~ 0.5,
+    TRUE ~ positive_rate
+)
+
+message(
+    "classification threshold (training frequent_so share): ",
+    round(classification_threshold, 3)
+)
+
+logit_model <- glm(
+    formula = logit_formula,
+    family = binomial(link = "logit"),
+    data = survey_train
+)
+
+summary(logit_model)
+tidy(logit_model, exponentiate = TRUE, conf.int = TRUE)
+
+# predict on test set
+
+survey_test <- survey_test |>
+    mutate(
+        pred_prob = predict(logit_model, newdata = survey_test, type = "response"),
+        pred_class = if_else(pred_prob >= classification_threshold, 1L, 0L)
+    )
+
+# confusion matrix and simple metrics
+
+conf_mat <- table(
+    truth = factor(survey_test$frequent_so, levels = c(0, 1)),
+    pred = factor(survey_test$pred_class, levels = c(0, 1))
+)
+
+conf_mat
+
+tp <- conf_mat["1", "1"]
+tn <- conf_mat["0", "0"]
+fp <- conf_mat["0", "1"]
+fn <- conf_mat["1", "0"]
+
+accuracy <- (tp + tn) / sum(conf_mat)
+precision <- if ((tp + fp) > 0) tp / (tp + fp) else NA_real_
+recall <- if ((tp + fn) > 0) tp / (tp + fn) else NA_real_
+
+list(
+    accuracy = accuracy,
+    precision = precision,
+    recall = recall
+)
+
+# visual: predicted probability vs ai usage
+
+p_logit_probs <- survey_test |>
+    mutate(uses_chatgpt_label = if_else(uses_chatgpt == 1L, "uses chatgpt", "does not use chatgpt")) |>
+    ggplot(aes(x = uses_chatgpt_label, y = pred_prob)) +
+    geom_boxplot() +
+    labs(
+        title = "predicted probability of being a frequent so visitor by chatgpt use",
+        x = "ai usage segment",
+        y = "predicted probability (logistic model)"
+    )
+
+print(p_logit_probs)
+ggsave(file.path(imgs_dir, "09_logit_probs.png"), plot = p_logit_probs, width = 8, height = 6, units = "in", dpi = 300)
+
+# save key tables and model outputs to disk for report
+
+write_csv(answers_monthly, file.path(data_dir, "answers_monthly_clean.csv"))
+write_csv(answers_summary_period, file.path(data_dir, "answers_summary_period.csv"))
+write_csv(survey_counts, file.path(data_dir, "survey_ai_vs_so_visit.csv"))
+
+saveRDS(model_lm, file.path(data_dir, "model_lm_its.rds"))
+saveRDS(model_pois, file.path(data_dir, "model_pois.rds"))
+saveRDS(arima_fit, file.path(data_dir, "model_arima_prechatgpt.rds"))
+saveRDS(logit_model, file.path(data_dir, "model_logit_survey.rds"))
@@ -0,0 +1,245 @@
+[Running] Rscript "/home/ion606/Desktop/Homework/Data Analytics/Assignment IV/analysis.r"
+── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
+✔ dplyr     1.1.4     ✔ readr     2.1.5
+✔ forcats   1.0.1     ✔ stringr   1.6.0
+✔ ggplot2   4.0.0     ✔ tibble    3.3.0
+✔ lubridate 1.9.4     ✔ tidyr     1.3.1
+✔ purrr     1.1.0     
+── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
+✖ dplyr::filter() masks stats::filter()
+✖ dplyr::lag()    masks stats::lag()
+ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
+Registered S3 method overwritten by 'quantmod':
+  method            from
+  as.zoo.data.frame zoo 
+[1] "month"       "status"      "new_answers"
+# A tibble: 6 × 3
+  month      status      new_answers
+  <date>     <chr>             <dbl>
+1 2018-01-01 deleted              26
+2 2018-01-01 non-deleted         159
+3 2018-02-01 deleted              20
+4 2018-02-01 non-deleted         175
+5 2018-03-01 deleted              18
+6 2018-03-01 non-deleted         193
+Rows: 95
+Columns: 11
+$ month               <date> 2018-01-01, 2018-02-01, 2018-03-01, 2018-04-01, 2…
+$ answers_total       <dbl> 185, 195, 211, 221, 227, 189, 149, 179, 198, 232, …
+$ answers_non_deleted <dbl> 159, 175, 193, 191, 203, 172, 133, 154, 170, 198, …
+$ answers_deleted     <dbl> 26, 20, 18, 30, 24, 17, 16, 25, 28, 34, 20, 45, 33…
+$ year                <dbl> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 20…
+$ month_num           <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4,…
+$ time_index          <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,…
+$ post_chatgpt        <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
+$ post_ai_policy      <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
+$ during_mod_strike   <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
+$ period              <chr> "pre_chatgpt", "pre_chatgpt", "pre_chatgpt", "pre_…
+file already exists: data/stack-overflow-developer-survey-2023.zip
+file already exists: data/stack-overflow-developer-survey-2024.zip
+ [1] "ResponseId"                          "Q120"                               
+ [3] "MainBranch"                          "Age"                                
+ [5] "Employment"                          "RemoteWork"                         
+ [7] "CodingActivities"                    "EdLevel"                            
+ [9] "LearnCode"                           "LearnCodeOnline"                    
+[11] "LearnCodeCoursesCert"                "YearsCode"                          
+[13] "YearsCodePro"                        "DevType"                            
+[15] "OrgSize"                             "PurchaseInfluence"                  
+[17] "TechList"                            "BuyNewTool"                         
+[19] "Country"                             "Currency"                           
+[21] "CompTotal"                           "LanguageHaveWorkedWith"             
+[23] "LanguageWantToWorkWith"              "DatabaseHaveWorkedWith"             
+[25] "DatabaseWantToWorkWith"              "PlatformHaveWorkedWith"             
+[27] "PlatformWantToWorkWith"              "WebframeHaveWorkedWith"             
+[29] "WebframeWantToWorkWith"              "MiscTechHaveWorkedWith"             
+[31] "MiscTechWantToWorkWith"              "ToolsTechHaveWorkedWith"            
+[33] "ToolsTechWantToWorkWith"             "NEWCollabToolsHaveWorkedWith"       
+[35] "NEWCollabToolsWantToWorkWith"        "OpSysPersonal use"                  
+[37] "OpSysProfessional use"               "OfficeStackAsyncHaveWorkedWith"     
+[39] "OfficeStackAsyncWantToWorkWith"      "OfficeStackSyncHaveWorkedWith"      
+[41] "OfficeStackSyncWantToWorkWith"       "AISearchHaveWorkedWith"             
+[43] "AISearchWantToWorkWith"              "AIDevHaveWorkedWith"                
+[45] "AIDevWantToWorkWith"                 "NEWSOSites"                         
+[47] "SOVisitFreq"                         "SOAccount"                          
+[49] "SOPartFreq"                          "SOComm"                             
+[51] "SOAI"                                "AISelect"                           
+[53] "AISent"                              "AIAcc"                              
+[55] "AIBen"                               "AIToolInterested in Using"          
+[57] "AIToolCurrently Using"               "AIToolNot interested in Using"      
+[59] "AINextVery different"                "AINextNeither different nor similar"
+[61] "AINextSomewhat similar"              "AINextVery similar"                 
+[63] "AINextSomewhat different"            "TBranch"                            
+[65] "ICorPM"                              "WorkExp"                            
+[67] "Knowledge_1"                         "Knowledge_2"                        
+[69] "Knowledge_3"                         "Knowledge_4"                        
+[71] "Knowledge_5"                         "Knowledge_6"                        
+[73] "Knowledge_7"                         "Knowledge_8"                        
+[75] "Frequency_1"                         "Frequency_2"                        
+[77] "Frequency_3"                         "TimeSearching"                      
+[79] "TimeAnswering"                       "ProfessionalTech"                   
+2023 so visit col: SOVisitFreq
+2023 ai col      : SOAI
+2024 so visit col: SOVisitFreq
+2024 ai col      : AISelect
+Rows: 146,676
+Columns: 10
+$ year         <fct> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 202…
+$ main_branch  <chr> "I am a developer by profession", "I am a developer by pr…
+$ country      <chr> "United States of America", "United States of America", "…
+$ age          <dbl> 25, 45, 25, 25, 35, 35, 25, 45, 25, 25, 25, 25, 35, 25, 3…
+$ gender       <fct> Unknown, Unknown, Unknown, Unknown, Unknown, Unknown, Unk…
+$ so_visit     <chr> "Daily or almost daily", "A few times per month or weekly…
+$ ai_select    <chr> "I don't think it's super necessary, but I think improvin…
+$ frequent_so  <int> 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, …
+$ uses_chatgpt <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, …
+$ age_group    <fct> 25-34, 45+, 25-34, 25-34, 35-44, 35-44, 25-34, 45+, 25-34…
+# A tibble: 2 × 7
+  period n_months mean_answers median_answers sd_answers min_answers max_answers
+  <chr>     <int>        <dbl>          <dbl>      <dbl>       <dbl>       <dbl>
+1 post_…       36         90.5             88       38.0          11         157
+2 pre_c…       59        193.             185       44.7         122         313
+Warning message:
+Removed 2 rows containing missing values or values outside the scale range
+(`geom_line()`). 
+Warning message:
+Removed 2 rows containing missing values or values outside the scale range
+(`geom_line()`). 
+[1] -10.02227
+
+Call:
+lm(formula = answers_total ~ time + post_chatgpt + chatgpt_time, 
+    data = its_data)
+
+Residuals:
+    Min      1Q  Median      3Q     Max 
+-76.623 -22.914  -3.868  13.431 123.402 
+
+Coefficients:
+                 Estimate Std. Error t value Pr(>|t|)    
+(Intercept)      218.8013     9.3214  23.473  < 2e-16 ***
+time              -0.8589     0.2702  -3.179 0.002022 ** 
+post_chatgptTRUE -17.9635    15.0779  -1.191 0.236601    
+chatgpt_time      -2.3661     0.6282  -3.767 0.000293 ***
+---
+Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
+
+Residual standard error: 35.35 on 91 degrees of freedom
+Multiple R-squared:  0.717,	Adjusted R-squared:  0.7077 
+F-statistic: 76.86 on 3 and 91 DF,  p-value: < 2.2e-16
+
+# A tibble: 4 × 5
+  term             estimate std.error statistic  p.value
+  <chr>               <dbl>     <dbl>     <dbl>    <dbl>
+1 (Intercept)       219.        9.32      23.5  2.23e-40
+2 time               -0.859     0.270     -3.18 2.02e- 3
+3 post_chatgptTRUE  -18.0      15.1       -1.19 2.37e- 1
+4 chatgpt_time       -2.37      0.628     -3.77 2.93e- 4
+# A tibble: 1 × 12
+  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
+      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
+1     0.717         0.708  35.3      76.9 7.39e-25     3  -471.  953.  966.
+# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
+
+Call:
+glm(formula = answers_total ~ time + post_chatgpt + chatgpt_time, 
+    family = poisson(link = "log"), data = its_data)
+
+Coefficients:
+                   Estimate Std. Error z value Pr(>|z|)    
+(Intercept)       5.3936301  0.0183909 293.277  < 2e-16 ***
+time             -0.0044547  0.0005512  -8.082 6.38e-16 ***
+post_chatgptTRUE -0.0187737  0.0365851  -0.513    0.608    
+chatgpt_time     -0.0322028  0.0018440 -17.464  < 2e-16 ***
+---
+Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
+
+(Dispersion parameter for poisson family taken to be 1)
+
+    Null deviance: 2879.9  on 94  degrees of freedom
+Residual deviance:  713.8  on 91  degrees of freedom
+AIC: 1363
+
+Number of Fisher Scoring iterations: 4
+
+# A tibble: 4 × 5
+  term             estimate std.error statistic  p.value
+  <chr>               <dbl>     <dbl>     <dbl>    <dbl>
+1 (Intercept)       220.     0.0184     293.    0       
+2 time                0.996  0.000551    -8.08  6.38e-16
+3 post_chatgptTRUE    0.981  0.0366      -0.513 6.08e- 1
+4 chatgpt_time        0.968  0.00184    -17.5   2.71e-68
+Series: train_ts 
+ARIMA(1,1,0)(1,0,0)[12] 
+
+Coefficients:
+          ar1    sar1
+      -0.3956  0.3016
+s.e.   0.1360  0.1381
+
+sigma^2 = 1142:  log likelihood = -281.17
+AIC=568.34   AICc=568.8   BIC=574.47
+
+Training set error measures:
+                     ME     RMSE      MAE       MPE     MAPE      MASE
+Training set -0.1691686 32.90678 26.65938 -1.989033 14.30025 0.5170032
+                   ACF1
+Training set 0.03124461
+                      ME     RMSE      MAE         MPE      MAPE      MASE
+Training set  -0.1691686 32.90678 26.65938   -1.989033  14.30025 0.5170032
+Test set     -78.4100374 89.26691 79.26493 -171.518981 171.98870 1.5371782
+                   ACF1 Theil's U
+Training set 0.03124461        NA
+Test set     0.73383075   7.11443
+dropping predictors with <2 levels: gender
+classification threshold (training frequent_so share): 0.384
+
+Call:
+glm(formula = logit_formula, family = binomial(link = "logit"), 
+    data = survey_train)
+
+Coefficients:
+                  Estimate Std. Error z value Pr(>|z|)    
+(Intercept)      -0.358743   0.013009 -27.577  < 2e-16 ***
+uses_chatgpt     -0.006783   0.066977  -0.101  0.91933    
+age_group25-34    0.040677   0.015439   2.635  0.00842 ** 
+age_group35-44   -0.207571   0.017478 -11.876  < 2e-16 ***
+age_group45+     -0.345739   0.020289 -17.041  < 2e-16 ***
+age_groupunknown -0.222739   0.096177  -2.316  0.02056 *  
+year2024         -0.082452   0.012319  -6.693 2.18e-11 ***
+---
+Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
+
+(Dispersion parameter for binomial family taken to be 1)
+
+    Null deviance: 156271  on 117339  degrees of freedom
+Residual deviance: 155647  on 117333  degrees of freedom
+AIC: 155661
+
+Number of Fisher Scoring iterations: 4
+
+# A tibble: 7 × 7
+  term             estimate std.error statistic   p.value conf.low conf.high
+  <chr>               <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
+1 (Intercept)         0.699    0.0130   -27.6   2.10e-167    0.681     0.717
+2 uses_chatgpt        0.993    0.0670    -0.101 9.19e-  1    0.870     1.13 
+3 age_group25-34      1.04     0.0154     2.63  8.42e-  3    1.01      1.07 
+4 age_group35-44      0.813    0.0175   -11.9   1.57e- 32    0.785     0.841
+5 age_group45+        0.708    0.0203   -17.0   4.09e- 65    0.680     0.736
+6 age_groupunknown    0.800    0.0962    -2.32  2.06e-  2    0.662     0.965
+7 year2024            0.921    0.0123    -6.69  2.18e- 11    0.899     0.943
+     pred
+truth     0     1
+    0  7560 10549
+    1  4009  7218
+$accuracy
+[1] 0.5037497
+
+$precision
+[1] 0.4062588
+
+$recall
+[1] 0.6429144
+
+
+[Done] exited with code=0 in 12.272 seconds
+
@@ -0,0 +1,13 @@
+
+6. Oral Presentation (5%). Plan for a ~5 minute presentation; slides must cover the 
+following: 
+a). Title (with your name) 
+b). Problem area – what you wanted to explore/ solve/ predict and why, and what you 
+wanted to predict? 
+c). The data – where it came from, why it was applicable and the preliminary assessments 
+you made. 
+d). How you conducted your analysis: distribution, pattern/ relationship and model 
+construction. What techniques did you use/ not use and why? What worked? What did not 
+work? How did you apply the model? How did you optimize, account for uncertainties? 
+f). What did you predict and what decisions (prescriptions) were possible. What was the 
+outcome, conclusions?