Merge branch 'transfer' of https://git.ion606.com/ION606/Data-Analytics into transfer

added assignment 5
2025-11-14 15:54:16 -05:00 · 2025-11-14 15:53:48 -05:00
44 changed files with 1697 additions and 1992 deletions
@@ -1,2 +0,0 @@
-data/
-1.json
@@ -1,4 +0,0 @@
-{
-	"r.linting.lineLength": false,
-	"r.editor.tabSize": 4,
-}
@@ -1,194 +0,0 @@
-# Data Analytics Fall 2025 – Assignment IV
-
-## Measuring How Generative AI Adoption Reshaped Stack Overflow Participation 2018–2025
-
-Itamar Oren-Naftalovich
-
-<!-- **Repository Artifacts:** `analysis.r`, `data/*.csv`, `imgs/*.png`, `out.log` (model console output) -->
-
---
-
-## 1. Abstract and Introduction
-
-On 30 November 2022, ChatGPT became publicly available. Within days, the Stack Overflow community faced two major shocks: developers suddenly had a new source of code-specific answers, and Stack Overflow introduced a temporary ban on AI-generated content on 5 December 2022 while already struggling with limited (and often terrible) moderation capacity. In this project I will look at whether the combination of generative AI adoption and these policy changes produced a statistically detectable shift in Stack Overflow content creation, and whether developers who say they use ChatGPT still treat Stack Overflow as a daily resource.
-
-My initial hypothesis was that monthly answer counts would show a break after the ChatGPT launch and AI policy ban, even after controlling for the pre-2022 downward trend. I also expected that respondents who explicitly name ChatGPT as an AI assistant would be less likely to visit Stack Overflow daily. To test these ideas, I built two complementary datasets:
-1. A Stack Overflow Data Explorer (SEDE) exports of monthly deleted and non-deleted answers from January 2018 through November 2025
-2. Microdata from the 2023 and 2024 Stack Overflow Developer Surveys, which record both visit frequency and generative AI usage
-
-If you want to see *how* I did this (the code) see `analysis.r`
-
-The analysis relies on four modeling strategies: an interrupted time-series (ITS) linear regression, a Poisson regression for counts, a seasonal ARIMA model trained only on pre-ChatGPT data, and a logistic regression relating survey-reported AI usage to daily Stack Overflow visitation. Smashed together, these models indicate that Stack Overflow answer production fell by more than 53% in the post-ChatGPT period (mean 90.5 vs. 193.0 answers per month). At the same time, daily visitors are increasingly concentrated in older age cohorts, and survey respondents who explicitly mention ChatGPT do not differ meaningfully from others in how often they visit the site. The following sections describe the datasets, exploratory patterns, modeling choices, and implications for the community.
-
---
-
-## 2. Data Description and Preliminary Analysis
-
-### 2.1 Stack Overflow Answer Volume (Dataset 1)
-
-* **Source and Scope**
-  `data/so_new_answers_per_month_2018_2025.csv` is a SEDE export of every new answer (deleted and non-deleted) by month from January 2018 through November 2025 (95 monthly observations). The script standardizes month formats, aggregates across deletion statuses, and adds indicators for the ChatGPT release (30 Nov 2022), the AI policy ban (5 Dec 2022), and the Stack Exchange moderator strike (5 Jun–7 Aug 2023).
-
-* **Variables**
-  After cleaning, the main table `answers_monthly` contains `answers_total`, `answers_non_deleted`, `answers_deleted`, calendar year and month, a sequential `time_index`, binary indicators for the events listed above, and a categorical `period` flagging pre- vs. post-ChatGPT months. A 3-month moving average (`answers_ma3`) is computed to smooth short-term noise for exploratory plots.
-
-* **Quality Checks**
-  Duplicate rows were removed by grouping on `month`, and all transformations are recorded in `out.log`. The only missing values arise in the first two moving-average entries, which plotting functions simply omit. Because SEDE distinguishes deleted from non-deleted answers, the analysis keeps both so that any changes in moderation are visible in the time series.
-
-![Figure 1. Monthly Stack Overflow answers with ChatGPT (dashed) and AI policy (dotted) markers.](imgs/01_answers_ts.png)
-*Figure 1. Monthly answer counts follow a long downward trend that becomes steeper after November 2022.*
-
-![Figure 2. Distribution of monthly answers pre- vs. post-ChatGPT.](imgs/02_box_pre_post.png)
-*Figure 2. Box plots highlight the magnitude of the drop between the pre- and post-ChatGPT regimes.*
-
-**Table 1. Descriptive Statistics by Regime (source: `data/answers_summary_period.csv`)**
-
-| period       | n_months | mean_answers | median_answers | sd_answers | min_answers | max_answers |
-| ------------ | -------- | ------------ | -------------- | ---------- | ----------- | ----------- |
-| pre_chatgpt  | 59       | 193.0        | 185            | 44.7       | 122         | 313         |
-| post_chatgpt | 36       | 90.5         | 88             | 38.0       | 11          | 157         |
-
-A quick comparison of the six months immediately before and after 30 November 2022 shows only a −10.0% change in average answers, suggesting that the full −53% decline in Table 1 unfolded gradually across 2023–2025 rather than occurring instantly. This gradual pattern is one reason for using time-series models instead of treating the policy change as a simple before/after difference.
-
-### 2.2 Stack Overflow Developer Survey (Dataset 2)
-
-* **Source and Scope.**
-  The second dataset uses the publicly released 2023 and 2024 Stack Overflow Developer Survey microdata (`stack-overflow-developer-survey-2023.zip` and `stack-overflow-developer-survey-2024.zip`, downloaded 19 November 2025). Combined, these files contain 146,676 responses from professional and hobbyist developers worldwide.
-
-* **Schema Harmonization!**
-  Column names differ slightly across years (for example, `SOAI` vs. `AISelect`), so helper functions search for the first matching column for each concept. The harmonized frame retains `year`, `main_branch`, `country`, numeric `age`, `gender`, reported Stack Overflow visit frequency (`so_visit`), and free-text AI assistant preferences (`ai_select`).
-
-* **Feature Engineering**
-  Two binary indicators are constructed: `frequent_so` (1 if the respondent reports visiting Stack Overflow daily or multiple times per day) and `uses_chatgpt` (1 if the string “ChatGPT” appears anywhere in `ai_select`). Age is grouped into buckets (`<25`, `25–34`, `35–44`, `45+`, `unknown`), and gender is collapsed into a simplified label to absorb inconsistent free-text entries.
-
-* **Sample Considerations**
-  Because the 2024 instrument asks about AI search preferences rather than naming specific tools, only 1,181 respondents in 2023 explicitly mention ChatGPT and almost none do in 2024. This change in wording is treated as a measurement artifact and revisited as a source of bias in Sections 3 and 4.
-
---
-
-## 3. Exploratory Analysis
-
-### 3.1 Seasonal and Trend Patterns in Answer Volume
-
-The `answers_monthly` series preserves the familiar seasonal dip every December, but the overall level shifts downward after 2022. As Figure 3 shows, even typically slow months such as July now fall below 60 answers, compared with roughly 150–220 answers in earlier years.
-
-![Figure 3. Seasonality of Stack Overflow answers by calendar month and period.](imgs/03_seasonal.png)
-*Figure 3. Post-ChatGPT seasons follow a similar seasonal shape but sit on a much lower baseline.*
-
-The 3-month moving average in Figure 4 provides additional context. It peaks near 210 answers in mid-2018, drifts below 150 answers by late 2021, crosses under 100 answers in August 2023, and reaches about 23 answers by November 2025. The timing of the Stack Exchange moderator strike (June–August 2023) aligns with the first extended period below 100 answers per month, hinting at compounding effects from generative AI substitution and reduced moderation capacity.
-
-![Figure 4. Raw answers (faint) vs. 3-month moving average.](imgs/04_ma3.png)
-*Figure 4. The smoothed series marks a clear structural break soon after the ChatGPT launch and policy ban.*
-
-### 3.2 Survey Signals on Engagement and AI Adoption
-
-A stacked bar chart (Figure 5) summarizes how daily Stack Overflow visitation relates to ChatGPT usage. In 2023, daily visitation rates are essentially identical for explicit ChatGPT users (39.1%) and non-users (also 39.1%), suggesting that early adopters of ChatGPT continued to visit Stack Overflow at similar rates while experimenting with AI. By 2024, daily visitation among respondents who *do not* mention ChatGPT falls to 37.3%. The near-absence of explicit ChatGPT mentions that year, however, is driven by the different survey question wording rather than a real disappearance of the tool. This reinforces the idea that self-reported tool usage is noisy and needs to be combined with behavioral indicators like monthly answer counts.
-
-![Figure 5. Share of respondents visiting Stack Overflow daily, split by ChatGPT usage, 2023–2024.](imgs/08_survey_bar.png)
-*Figure 5. Small differences between groups and across years illustrate how limited the AI usage field is for explaining engagement.*
-
-### 3.3 Sources of Uncertainty and Bias
-
-Several sources of uncertainty shape the analysis:
-
-* **Measurement bias.**
-  SEDE relies on Stack Overflow’s internal logging. Deleted answers can be removed retroactively, so counts for the most recent months remain somewhat fluid.
-
-* **Event alignment.**
-  The interrupted time-series design treats 30 November 2022 as the breakpoint between regimes, but the 2023 moderator strike and evolving AI policies create overlapping shocks that blur a clean “pre vs. post” distinction.
-
-* **Survey sampling.**
-  The developer survey is voluntary, conducted in English, and heavily skewed toward respondents in North America and India. Age and tool usage are self-reported, and the 2024 wording change likely undercounts ChatGPT adoption.
-
-* **Missingness.**
-  “Prefer not to say” responses in age and gender are mapped to `NA` or `Unknown`, which softens demographic differences in downstream models.
-
-These limitations motivated the use of several modeling approaches in Section 4 instead of relying on a single model family.
-
---
-
-## 4. Model Development and Application of Models
-
-Each model addresses a slightly different question about Stack Overflow activity. All diagnostics and figures are produced directly by `analysis.r` and saved in `imgs/`.
-
-### 4.1 Interrupted Time-Series Linear Regression
-
-* **Specification.**
-  The primary linear model is
-  `answers_total ~ time + post_chatgpt + chatgpt_time`,
-  where `time` is the number of months since January 2018 and `chatgpt_time` resets to 1 in December 2022 to allow the post-ChatGPT slope to differ from the pre-ChatGPT trend.
-
-* **Results.**
-  The model explains 71.7% of the variance (adjusted R² = 0.708, σ = 35.3). Before ChatGPT, monthly answers were already declining by −0.86 answers per month (p = 0.002). After November 2022, the slope becomes steeper by an additional −2.37 answers per month (p < 0.001). The immediate level change of −18 answers at the breakpoint is not statistically significant (p = 0.24).
-
-* **Interpretation.**
-  Rather than a sudden cliff, the data show an acceleration of an existing decline. The post-ChatGPT trend line loses almost three extra answers each month relative to the pre-2023 trajectory, which accumulates to roughly 108 fewer answers per year.
-
-![Figure 6. Observed vs. fitted answers under the interrupted time-series model.](imgs/05_lm_fit.png)
-*Figure 6. The fitted line captures a gradual erosion in answer volume instead of a single large discontinuity.*
-
-### 4.2 Poisson Regression for Count Data
-
-* **Specification.**
-  The Poisson model uses the same predictors but applies a log link appropriate for count outcomes.
-
-* **Results.**
-  The estimated multiplicative time effect before ChatGPT is `exp(time) = 0.996` (p < 0.001), corresponding to a 0.4% monthly contraction. After the release, the effective slope multiplier drops to 0.968 (p ≈ 2.7 × 10⁻⁶⁸), implying a 3.2% shrinkage per month. The residual deviance is 713.8 on 91 degrees of freedom, compared with a null deviance of 2,879.9.
-
-* **Interpretation.**
-  Expressed in percentage terms, the Poisson model tells a similar story to the linear ITS: by late 2025, the expected answer count decays toward single digits if post-2022 dynamics continue unchanged.
-
-![Figure 7. Poisson regression fit vs. observed counts.](imgs/06_pois_fit.png)
-*Figure 7. The Poisson model slightly overestimates the lowest post-2024 points, consistent with some dispersion in the counts.*
-
-### 4.3 Seasonal ARIMA Forecasting (Pre-ChatGPT Baseline)
-
-* **Specification.**
-  To estimate what would have happened without ChatGPT and related policy changes, a seasonal ARIMA model, ARIMA(1,1,0)[1,0,0](12), is fit only to data through October 2022 (`train_ts`). The model then generates forecasts for November 2022–November 2025, which are compared to the actual counts.
-
-* **Results.**
-  On the training window, fit statistics are solid (RMSE = 32.9, MASE = 0.52). Out-of-sample, however, errors grow large: RMSE = 89.3, MAE = 79.3, MAPE ≈ 172%, and Theil’s U = 7.11. Observed counts soon fall below the 80% prediction interval and remain there, indicating that historical seasonality and trends alone cannot explain the post-2022 decline.
-
-* **Interpretation.**
-  The ARIMA baseline functions as a counterfactual. Its consistent over-prediction of post-ChatGPT activity reinforces the conclusion that a structural break occurred, rather than a continuation of prior dynamics.
-
-![Figure 8. ARIMA forecast (trained through Oct 2022) vs. actual counts.](imgs/07_arima_forecast.png)
-*Figure 8. Actual activity diverges from the ARIMA forecast almost immediately and never returns to the predicted band.*
-
-### 4.4 Logistic Regression on Survey Engagement
-
-* **Specification.**
-  The survey-based model predicts `frequent_so` (daily or multiple-times-per-day visitor). Predictors that retain more than one level after cleaning are `uses_chatgpt`, `age_group`, and `year`. The harmonized `gender` variable collapses to a single `Unknown` level and is therefore dropped automatically. The data are split 80/20 into training and test sets with a fixed seed (123). The decision threshold is set to the training positive rate (0.384) to reduce the impact of class imbalance.
-
-* **Results.**
-  Relative to respondents younger than 25, the odds ratio for the 25–34 group is 1.04 (p = 0.008), while the 35–44 and 45+ groups have odds ratios of 0.81 and 0.71, respectively (both p < 10⁻³²). The ChatGPT usage indicator has an odds ratio of 0.99 (p = 0.92), effectively indistinguishable from 1. The 2024 indicator yields an odds ratio of 0.92 (p ≈ 2.2 × 10⁻¹¹), pointing to a modest overall decline in daily visitation from 2023 to 2024.
-
-  On the 29,336-observation test set, the confusion matrix reports 7,560 true negatives, 10,549 false positives, 4,009 false negatives, and 7,218 true positives, giving an accuracy of 50.4%, precision of 40.6%, and recall of 64.3%.
-
-* **Interpretation.**
-  The weak predictive performance and scarcity of explicit ChatGPT mentions in 2024 both suggest that the current survey instrument is not well suited for isolating the impact of AI usage on Stack Overflow engagement. Age shows a clearer pattern than AI usage: older cohorts are less likely to be daily visitors, while ChatGPT adoption, at least as self-reported in these surveys, does not significantly distinguish frequent users from others.
-
-![Figure 9. Predicted probabilities of daily Stack Overflow usage by ChatGPT adoption.](imgs/09_logit_probs.png)
-*Figure 9. Predicted probability distributions overlap heavily, mirroring the non-significant odds ratio for `uses_chatgpt`.*
-
---
-
-## 5. Conclusions and Discussion
-
-The evidence points to some sort of structural break in Stack Overflow answer production beginning in late 2022. Average monthly answers drop from 193.0 in the 2018–October 2022 period to 90.5 between December 2022 and November 2025. The interrupted time-series model shows the slope of decline by becoming steeper by about −2.37 answers per month after ChatGPT’s release, and the Poisson model implies a post-ChatGPT decay rate of roughly 3.2% per month. ARIMA forecasts trained only on pre-ChatGPT data substantially overestimate post-2022 activity, which reinforces the conclusion that pre-existing seasonal and secular trends cannot account for the observed collapse.
-
-The survey-based models show more information about *who* remains active. Despite common assumptions that ChatGPT usage directly crowds out Stack Overflow visits, the current survey data do not show a strong link: the odds ratio for reported ChatGPT usage is essentially 1, and differences in daily visitation are driven more by age and year than by AI adoption. Given the 2024 wording change and the limitations of self-reported tool usage, it would be premature to claim that ChatGPT users as a group have already abandoned Stack Overflow.
-
-Taken together, these findings suggest that any response from Stack Overflow should combine supply-side interventions (such as incentives for high-quality answers and additional moderation support to limit deleted content) with better measurement of how developers actually integrate AI tools and community Q&A into their workflows.
-
-Future work could extend the time-series models with covariates for major product changes (e.g., Collectives, Discussions), incorporate question volume alongside answers, and revisit the survey analysis once the 2025 instrument becomes available. Causal impact methods, such as Bayesian structural time series using the ARIMA forecast as a prior, could offer a more formal estimate of the counterfactual number of answers that would have been produced without the post-2022 shocks.
-
---
-
-## References
-
-1. Stack Exchange Data Explorer. “New answers (deleted + non-deleted) per month,” query exported 19 Nov 2025 from the Stack Overflow SEDE interface.
-2. Stack Overflow. “Stack Overflow Developer Survey 2023” and “Stack Overflow Developer Survey 2024,” datasets accessed 19 Nov 2025 from the Stack Overflow survey site.
-3. OpenAI. “Introducing ChatGPT,” OpenAI Blog, 30 Nov 2022.
-4. Stack Overflow Meta. “Temporary policy: ChatGPT is banned,” Meta Stack Overflow, 5 Dec 2022.
-5. Stack Exchange. “Moderator Strike: Stack Overflow, Stack Exchange Network,” Meta Stack Exchange updates, Jun–Aug 2023.
@@ -1,683 +0,0 @@
-# install.packages(
-#     c("tidyverse", "lubridate", "broom", "forecast", "stringr", "dplyr"),
-#     repos = "http://cran.us.r-project.org"
-# )
-
-library(tidyverse)
-library(lubridate)
-library(broom)
-library(forecast)
-library(stringr)
-library(dplyr)
-
-
-# directory for data files (adjust if desired)
-data_dir <- "data"
-if (!dir.exists(data_dir)) {
-    dir.create(data_dir, recursive = TRUE)
-}
-
-# directory for plots
-imgs_dir <- "imgs"
-if (!dir.exists(imgs_dir)) {
-    dir.create(imgs_dir, recursive = TRUE)
-}
-
-# constants: key event dates related to chatgpt and so policy
-
-# chatgpt public research preview launch
-chatgpt_launch_date <- as.Date("2022-11-30") # openai "introducing chatgpt" blog
-
-# stack overflow generative ai ban policy (meta so, 5 dec 2022)
-so_ai_policy_date <- as.Date("2022-12-05")
-
-# moderation strike on stack exchange (june–aug 2023) from meta posts
-so_mod_strike_start <- as.Date("2023-06-05")
-so_mod_strike_end <- as.Date("2023-08-07")
-
-# helper: safe downloader
-download_if_missing <- function(url, destfile) {
-    if (!file.exists(destfile)) {
-        message("downloading ", basename(destfile), " ...")
-        download.file(url, destfile, mode = "wb")
-        message("saved to ", destfile)
-    } else {
-        message("file already exists: ", destfile)
-    }
-}
-
-coerce_month_to_date <- function(x) {
-    if (inherits(x, "Date")) {
-        return(x)
-    }
-    if (inherits(x, "POSIXct")) {
-        return(lubridate::as_date(x))
-    }
-    if (inherits(x, "POSIXlt")) {
-        return(as.Date(x))
-    }
-    if (is.numeric(x)) {
-        return(as.Date(x, origin = "1970-01-01"))
-    }
-    if (is.character(x)) {
-        parsed <- suppressWarnings(lubridate::ymd_hms(x))
-        if (all(is.na(parsed))) {
-            parsed <- suppressWarnings(lubridate::ymd(x))
-        }
-        if (all(is.na(parsed))) {
-            parsed <- suppressWarnings(as.Date(x))
-        }
-        return(parsed)
-    }
-    suppressWarnings(as.Date(x))
-}
-
-# 1) load stack overflow monthly answers (dataset 1)
-
-answers_csv_path <- file.path(data_dir, "so_new_answers_per_month_2018_2025.csv")
-
-if (!file.exists(answers_csv_path)) {
-    stop(
-        "missing ", answers_csv_path,
-        "\nrun the sede query in this script and download the csv to that path first."
-    )
-}
-
-answers_raw <- readr::read_csv(answers_csv_path, show_col_types = FALSE) |>
-    rename(
-        month = matches("^Date$|Month", ignore.case = TRUE),
-        status = matches("^Status$", ignore.case = TRUE),
-        new_answers = matches("NewAnswers|Count", ignore.case = TRUE)
-    )
-
-answers_raw <- answers_raw |>
-    mutate(
-        month = coerce_month_to_date(month),
-        status = tolower(status)
-    )
-
-
-# inspect column names so you can adjust if sede changes them
-print(names(answers_raw))
-
-# expected columns: "Month", "Status", "NewAnswers"
-# normalise to lower snake case just in case
-
-answers_raw <- answers_raw |>
-    rename(
-        month = matches("Month", ignore.case = TRUE),
-        status = matches("Status", ignore.case = TRUE),
-        new_answers = matches("NewAnswers|Count", ignore.case = TRUE)
-    )
-
-print(head(answers_raw))
-
-# aggregate deleted vs non-deleted into separate columns per month
-answers_monthly <- answers_raw |>
-    mutate(
-        month = as.Date(month),
-        status = tolower(status)
-    ) |>
-    group_by(month) |>
-    summarise(
-        answers_total = sum(new_answers, na.rm = TRUE),
-        answers_non_deleted = sum(if_else(status == "non-deleted", new_answers, 0L)),
-        answers_deleted = sum(if_else(status == "deleted", new_answers, 0L)),
-        .groups = "drop"
-    ) |>
-    arrange(month) |>
-    mutate(
-        year = year(month),
-        month_num = month(month),
-        time_index = row_number(),
-        post_chatgpt = month >= chatgpt_launch_date,
-        post_ai_policy = month >= so_ai_policy_date,
-        during_mod_strike = month >= so_mod_strike_start & month <= so_mod_strike_end,
-        period = case_when(
-            month < chatgpt_launch_date ~ "pre_chatgpt",
-            TRUE ~ "post_chatgpt"
-        )
-    )
-
-glimpse(answers_monthly)
-
-# 2) download and load stack overflow developer survey 2023/2024 (dataset 2)
-
-# official survey zip files as exposed on survey.stackoverflow.co
-# these urls are the same ones behind the "download full data set (csv)" links
-# see: https://survey.stackoverflow.co/
-survey_2023_url <- "https://survey.stackoverflow.co/datasets/stack-overflow-developer-survey-2023.zip"
-survey_2024_url <- "https://survey.stackoverflow.co/datasets/stack-overflow-developer-survey-2024.zip"
-
-survey_2023_zip <- file.path(data_dir, "stack-overflow-developer-survey-2023.zip")
-survey_2024_zip <- file.path(data_dir, "stack-overflow-developer-survey-2024.zip")
-
-download_if_missing(survey_2023_url, survey_2023_zip)
-download_if_missing(survey_2024_url, survey_2024_zip)
-
-# helper to read the "survey_results_public.csv" inside each zip
-read_so_survey_from_zip <- function(zip_path, csv_pattern = "survey_results_public.csv") {
-    if (!file.exists(zip_path)) {
-        stop("zip file not found: ", zip_path)
-    }
-
-    # list files inside zip (works even when the CSV is in a subfolder)
-    zlist <- utils::unzip(zip_path, list = TRUE)
-    # try to find the csv by exact name or by pattern
-    csv_name <- zlist$Name[stringr::str_detect(zlist$Name, regex(csv_pattern, ignore_case = TRUE))]
-    if (length(csv_name) == 0) {
-        stop("could not find a csv matching ", csv_pattern, " inside ", zip_path)
-    }
-    csv_name <- csv_name[1] # take first match
-
-    # read it without extracting to disk using unz() connection
-    # optionally supply col_types to speed parsing
-    df <- readr::read_csv(
-        unz(zip_path, csv_name),
-        show_col_types = FALSE,
-        # col_types = cols(.default = col_character()) # uncomment & customize if you want explicit types
-    )
-    df
-}
-
-survey2023_raw <- read_so_survey_from_zip(survey_2023_zip)
-survey2024_raw <- read_so_survey_from_zip(survey_2024_zip)
-
-# look at column names to locate ai + stackoverflow usage questions
-names(survey2023_raw)[1:80]
-
-############################################################
-# create a harmonised survey subset focusing on:
-#  - so visit frequency (column like SOVisitFreq)
-#  - ai tool usage (column like AISelect or SOAI)
-############################################################
-
-find_first_col <- function(df, pattern) {
-    cols <- names(df)[stringr::str_detect(names(df), regex(pattern, ignore_case = TRUE))]
-    if (length(cols) == 0) {
-        return(NA_character_)
-    }
-    cols[1]
-}
-
-pull_col_or_default <- function(df, col_name, default = NA_character_) {
-    if (is.na(col_name)) {
-        return(rep(default, nrow(df)))
-    }
-    df[[col_name]]
-}
-
-pull_age_numeric <- function(df, col_name) {
-    vec <- pull_col_or_default(df, col_name, default = NA_real_)
-    if (is.numeric(vec)) {
-        return(vec)
-    }
-    if (is.factor(vec)) {
-        vec <- as.character(vec)
-    }
-    if (is.character(vec)) {
-        vec <- stringr::str_trim(vec)
-        vec[vec == ""] <- NA_character_
-        vec[stringr::str_detect(vec, regex("prefer not to say", ignore_case = TRUE))] <- NA_character_
-        # parse_number extracts the leading numeric value (e.g., 25 from "25-34 years old")
-        return(suppressWarnings(readr::parse_number(vec)))
-    }
-    suppressWarnings(as.numeric(vec))
-}
-
-main_branch_col_2023 <- find_first_col(survey2023_raw, "^MainBranch$|MainBranch")
-country_col_2023 <- find_first_col(survey2023_raw, "^Country$|Country")
-age_col_2023 <- find_first_col(survey2023_raw, "^Age$|Age")
-gender_col_2023 <- find_first_col(survey2023_raw, "^Gender$|Gender")
-so_visit_col_2023 <- find_first_col(survey2023_raw, "SOVisitFreq")
-ai_select_col_2023 <- find_first_col(survey2023_raw, "AISelect|SOAI")
-
-main_branch_col_2024 <- find_first_col(survey2024_raw, "^MainBranch$|MainBranch")
-country_col_2024 <- find_first_col(survey2024_raw, "^Country$|Country")
-age_col_2024 <- find_first_col(survey2024_raw, "^Age$|Age")
-gender_col_2024 <- find_first_col(survey2024_raw, "^Gender$|Gender")
-so_visit_col_2024 <- find_first_col(survey2024_raw, "SOVisitFreq")
-ai_select_col_2024 <- find_first_col(survey2024_raw, "AISelect|SOAI")
-
-message("2023 so visit col: ", so_visit_col_2023)
-message("2023 ai col      : ", ai_select_col_2023)
-message("2024 so visit col: ", so_visit_col_2024)
-message("2024 ai col      : ", ai_select_col_2024)
-
-# build a clean survey frame for 2023
-
-survey2023 <- survey2023_raw |>
-    transmute(
-        year = 2023L,
-        main_branch = pull_col_or_default(survey2023_raw, main_branch_col_2023),
-        country = pull_col_or_default(survey2023_raw, country_col_2023),
-        age = pull_age_numeric(survey2023_raw, age_col_2023),
-        gender = pull_col_or_default(survey2023_raw, gender_col_2023),
-        so_visit = pull_col_or_default(survey2023_raw, so_visit_col_2023),
-        ai_select = pull_col_or_default(survey2023_raw, ai_select_col_2023)
-    )
-
-# same idea for 2024 (schema is very similar)
-
-survey2024 <- survey2024_raw |>
-    transmute(
-        year = 2024L,
-        main_branch = pull_col_or_default(survey2024_raw, main_branch_col_2024),
-        country = pull_col_or_default(survey2024_raw, country_col_2024),
-        age = pull_age_numeric(survey2024_raw, age_col_2024),
-        gender = pull_col_or_default(survey2024_raw, gender_col_2024),
-        so_visit = pull_col_or_default(survey2024_raw, so_visit_col_2024),
-        ai_select = pull_col_or_default(survey2024_raw, ai_select_col_2024)
-    )
-
-survey_all <- bind_rows(survey2023, survey2024)
-
-# engineer features:
-#   - binary flag: frequent so visitor
-#   - binary flag: uses chatgpt as ai tool (from ai_select free text / semicolon list)
-#   - coarser age groups
-
-survey_model <- survey_all |>
-    filter(!is.na(so_visit)) |>
-    mutate(
-        so_visit = as.character(so_visit),
-        ai_select = as.character(ai_select),
-
-        # frequent so visitor: daily or multiple times per day etc.
-        frequent_so = dplyr::case_when(
-            stringr::str_detect(so_visit, regex("multiple times per day", ignore_case = TRUE)) ~ 1L,
-            stringr::str_detect(so_visit, regex("daily|almost every day", ignore_case = TRUE)) ~ 1L,
-            TRUE ~ 0L
-        ),
-        uses_chatgpt = dplyr::case_when(
-            is.na(ai_select) ~ 0L,
-            stringr::str_detect(ai_select, regex("chatgpt", ignore_case = TRUE)) ~ 1L,
-            TRUE ~ 0L
-        ),
-        age_group = dplyr::case_when(
-            !is.na(age) & age < 25 ~ "<25",
-            !is.na(age) & age >= 25 & age < 35 ~ "25-34",
-            !is.na(age) & age >= 35 & age < 45 ~ "35-44",
-            !is.na(age) & age >= 45 ~ "45+",
-            TRUE ~ "unknown"
-        ),
-        gender = if_else(is.na(gender) | gender == "", "Unknown", gender)
-    ) |>
-    filter(!is.na(frequent_so)) |>
-    mutate(
-        frequent_so = as.integer(frequent_so),
-        uses_chatgpt = as.integer(uses_chatgpt),
-        age_group = factor(age_group),
-        gender = factor(gender),
-        year = factor(year)
-    )
-
-glimpse(survey_model)
-
-# SECTION 2: data description + preliminary plots (dataset 1)
-# basic time series plot of answers over time (for section 2)
-
-p_answers_ts <- ggplot(answers_monthly, aes(x = month, y = answers_total)) +
-    geom_line() +
-    geom_vline(xintercept = chatgpt_launch_date, linetype = "dashed") +
-    geom_vline(xintercept = so_ai_policy_date, linetype = "dotted") +
-    labs(
-        title = "monthly new answers on stack overflow",
-        x = "month",
-        y = "number of answers"
-    )
-
-print(p_answers_ts)
-ggsave(
-    filename = file.path(imgs_dir, "01_answers_ts.png"),
-    plot = p_answers_ts,
-    width = 10, height = 6, units = "in", dpi = 300
-)
-
-# boxplot pre vs post chatgpt
-
-p_box_pre_post <- ggplot(answers_monthly, aes(x = period, y = answers_total)) +
-    geom_boxplot() +
-    labs(
-        title = "distribution of monthly answers: pre vs post chatgpt launch",
-        x = "period",
-        y = "monthly answers"
-    )
-
-print(p_box_pre_post)
-ggsave(file.path(imgs_dir, "02_box_pre_post.png"), plot = p_box_pre_post, width = 8, height = 6, units = "in", dpi = 300)
-
-# basic summary table
-
-answers_summary_period <- answers_monthly |>
-    group_by(period) |>
-    summarise(
-        n_months = n(),
-        mean_answers = mean(answers_total),
-        median_answers = median(answers_total),
-        sd_answers = sd(answers_total),
-        min_answers = min(answers_total),
-        max_answers = max(answers_total),
-        .groups = "drop"
-    )
-
-print(answers_summary_period)
-
-# SECTION 3: exploratory analysis
-# seasonal pattern: answers by calendar month across years
-
-p_seasonal <- answers_monthly |>
-    mutate(month_label = factor(month_num, labels = month.abb)) |>
-    ggplot(aes(x = month_label, y = answers_total, group = year, color = period)) +
-    geom_line(alpha = 0.6) +
-    labs(
-        title = "seasonality of answers by calendar month and year",
-        x = "calendar month",
-        y = "monthly answers"
-    )
-
-print(p_seasonal)
-ggsave(file.path(imgs_dir, "03_seasonal.png"), plot = p_seasonal, width = 10, height = 6, units = "in", dpi = 300)
-
-# rolling 3-month moving average to smooth noise
-
-answers_monthly <- answers_monthly |>
-    arrange(month) |>
-    mutate(
-        answers_ma3 = zoo::rollmean(answers_total, k = 3, fill = NA, align = "right")
-    )
-
-p_ma3 <- ggplot(answers_monthly, aes(x = month)) +
-    geom_line(aes(y = answers_total), alpha = 0.3) +
-    geom_line(aes(y = answers_ma3)) +
-    geom_vline(xintercept = chatgpt_launch_date, linetype = "dashed") +
-    labs(
-        title = "monthly answers with 3-month moving average",
-        x = "month",
-        y = "answers"
-    )
-
-print(p_ma3)
-ggsave(file.path(imgs_dir, "04_ma3.png"), plot = p_ma3, width = 10, height = 6, units = "in", dpi = 300)
-
-# simple percentage change around chatgpt launch
-
-pre_window <- answers_monthly |>
-    filter(
-        month >= chatgpt_launch_date - months(6),
-        month < chatgpt_launch_date
-    )
-
-post_window <- answers_monthly |>
-    filter(
-        month >= chatgpt_launch_date,
-        month < chatgpt_launch_date + months(6)
-    )
-
-pre_mean <- mean(pre_window$answers_total)
-post_mean <- mean(post_window$answers_total)
-pct_change <- (post_mean - pre_mean) / pre_mean * 100
-
-pct_change
-
-# survey exploratory: relation between ai usage and so visit frequency
-
-survey_counts <- survey_model |>
-    mutate(
-        uses_chatgpt_label = if_else(uses_chatgpt == 1L, "uses chatgpt", "does not use chatgpt"),
-        freq_label = if_else(frequent_so == 1L, "visits so daily", "visits so less often")
-    ) |>
-    count(year, uses_chatgpt_label, freq_label) |>
-    group_by(year, uses_chatgpt_label) |>
-    mutate(prop = n / sum(n)) |>
-    ungroup()
-
-p_survey_bar <- ggplot(survey_counts, aes(x = uses_chatgpt_label, y = prop, fill = freq_label)) +
-    geom_col(position = "fill") +
-    facet_wrap(~year) +
-    scale_y_continuous(labels = scales::percent_format()) +
-    labs(
-        title = "relationship between chatgpt use and stack overflow visit frequency (survey)",
-        x = "ai usage segment",
-        y = "share of respondents",
-        fill = "so visit frequency"
-    )
-
-print(p_survey_bar)
-ggsave(file.path(imgs_dir, "08_survey_bar.png"), plot = p_survey_bar, width = 10, height = 6, units = "in", dpi = 300)
-
-# SECTION 4: model development (four different model types)
-
-# MODEL 1: interrupted time series linear regression
-# outcome: monthly answers_total
-# predictors: time trend, post_chatgpt level change, slope change after chatgpt
-
-its_data <- answers_monthly |>
-    mutate(
-        time = time_index,
-        chatgpt_time = if_else(month >= chatgpt_launch_date,
-            time_index - min(time_index[month >= chatgpt_launch_date]) + 1L,
-            0L
-        )
-    )
-
-model_lm <- lm(
-    answers_total ~ time + post_chatgpt + chatgpt_time,
-    data = its_data
-)
-
-summary(model_lm)
-tidy(model_lm)
-glance(model_lm)
-
-# predictions and plot
-
-its_data <- its_data |>
-    mutate(
-        lm_fitted = predict(model_lm)
-    )
-
-p_lm_fit <- ggplot(its_data, aes(x = month)) +
-    geom_line(aes(y = answers_total), alpha = 0.4) +
-    geom_line(aes(y = lm_fitted), color = "blue") +
-    geom_vline(xintercept = chatgpt_launch_date, linetype = "dashed") +
-    labs(
-        title = "interrupted time series regression: observed vs fitted answers",
-        x = "month",
-        y = "answers"
-    )
-
-print(p_lm_fit)
-ggsave(file.path(imgs_dir, "05_lm_fit.png"), plot = p_lm_fit, width = 10, height = 6, units = "in", dpi = 300)
-
-# MODEL 2: poisson regression for count data
-
-model_pois <- glm(
-    answers_total ~ time + post_chatgpt + chatgpt_time,
-    data = its_data,
-    family = poisson(link = "log")
-)
-
-summary(model_pois)
-tidy(model_pois, exponentiate = TRUE) # exp(coef) ~ multiplicative effect
-
-# compare predicted counts
-
-its_data <- its_data |>
-    mutate(
-        pois_fitted = predict(model_pois, type = "response")
-    )
-
-p_pois_fit <- ggplot(its_data, aes(x = month)) +
-    geom_line(aes(y = answers_total), alpha = 0.3) +
-    geom_line(aes(y = pois_fitted), color = "red") +
-    geom_vline(xintercept = chatgpt_launch_date, linetype = "dashed") +
-    labs(
-        title = "poisson regression: observed vs predicted monthly answers",
-        x = "month",
-        y = "answers"
-    )
-
-print(p_pois_fit)
-ggsave(file.path(imgs_dir, "06_pois_fit.png"), plot = p_pois_fit, width = 10, height = 6, units = "in", dpi = 300)
-
-# MODEL 3: arima time series forecast (pre-chatgpt vs actual)
-
-# construct monthly ts object (frequency = 12)
-start_year <- year(min(answers_monthly$month))
-start_month <- month(min(answers_monthly$month))
-
-answers_ts <- ts(
-    answers_monthly$answers_total,
-    start = c(start_year, start_month),
-    frequency = 12
-)
-
-# train on pre-chatgpt data (up to oct 2022) and forecast forward
-train_end <- c(2022, 10) # october 2022
-train_ts <- window(answers_ts, end = train_end)
-test_ts <- window(answers_ts, start = c(2022, 11))
-
-arima_fit <- auto.arima(train_ts)
-
-summary(arima_fit)
-
-h <- length(test_ts)
-fc <- forecast(arima_fit, h = h)
-
-# compare forecast vs actual on the holdout period
-
-fc_df <- data.frame(
-    month = answers_monthly$month[answers_monthly$month >= as.Date("2022-11-01")],
-    actual = as.numeric(test_ts),
-    forecast = as.numeric(fc$mean),
-    lower_80 = as.numeric(fc$lower[, "80%"]),
-    upper_80 = as.numeric(fc$upper[, "80%"])
-)
-
-p_arima <- ggplot(fc_df, aes(x = month)) +
-    geom_line(aes(y = actual), alpha = 0.6) +
-    geom_line(aes(y = forecast), linetype = "dashed") +
-    geom_ribbon(aes(ymin = lower_80, ymax = upper_80), alpha = 0.2) +
-    labs(
-        title = "arima forecast (trained on pre-chatgpt) vs actual answers",
-        x = "month",
-        y = "answers"
-    )
-
-print(p_arima)
-ggsave(file.path(imgs_dir, "07_arima_forecast.png"), plot = p_arima, width = 10, height = 6, units = "in", dpi = 300)
-
-# simple accuracy metrics on the holdout
-
-fc_accuracy <- accuracy(fc, test_ts)
-print(fc_accuracy)
-
-# MODEL 4: logistic regression – does using chatgpt predict being a frequent stack overflow visitor?
-
-set.seed(123)
-
-survey_model_complete <- survey_model |>
-    filter(!is.na(uses_chatgpt), !is.na(frequent_so))
-
-candidate_predictors <- c("uses_chatgpt", "age_group", "gender", "year")
-valid_predictors <- candidate_predictors[sapply(
-    candidate_predictors,
-    function(col) dplyr::n_distinct(survey_model_complete[[col]], na.rm = TRUE) > 1
-)]
-drop_predictors <- setdiff(candidate_predictors, valid_predictors)
-if (length(drop_predictors) > 0) {
-    message("dropping predictors with <2 levels: ", paste(drop_predictors, collapse = ", "))
-}
-
-logit_formula <- if (length(valid_predictors) == 0) {
-    frequent_so ~ 1
-} else {
-    as.formula(paste("frequent_so ~", paste(valid_predictors, collapse = " + ")))
-}
-
-n <- nrow(survey_model_complete)
-train_idx <- sample(seq_len(n), size = floor(0.8 * n))
-
-survey_train <- survey_model_complete[train_idx, ]
-survey_test <- survey_model_complete[-train_idx, ]
-
-positive_rate <- mean(survey_train$frequent_so, na.rm = TRUE)
-classification_threshold <- dplyr::case_when(
-    is.na(positive_rate) ~ 0.5,
-    positive_rate <= 0 ~ 0.5,
-    positive_rate >= 1 ~ 0.5,
-    TRUE ~ positive_rate
-)
-
-message(
-    "classification threshold (training frequent_so share): ",
-    round(classification_threshold, 3)
-)
-
-logit_model <- glm(
-    formula = logit_formula,
-    family = binomial(link = "logit"),
-    data = survey_train
-)
-
-summary(logit_model)
-tidy(logit_model, exponentiate = TRUE, conf.int = TRUE)
-
-# predict on test set
-
-survey_test <- survey_test |>
-    mutate(
-        pred_prob = predict(logit_model, newdata = survey_test, type = "response"),
-        pred_class = if_else(pred_prob >= classification_threshold, 1L, 0L)
-    )
-
-# confusion matrix and simple metrics
-
-conf_mat <- table(
-    truth = factor(survey_test$frequent_so, levels = c(0, 1)),
-    pred = factor(survey_test$pred_class, levels = c(0, 1))
-)
-
-conf_mat
-
-tp <- conf_mat["1", "1"]
-tn <- conf_mat["0", "0"]
-fp <- conf_mat["0", "1"]
-fn <- conf_mat["1", "0"]
-
-accuracy <- (tp + tn) / sum(conf_mat)
-precision <- if ((tp + fp) > 0) tp / (tp + fp) else NA_real_
-recall <- if ((tp + fn) > 0) tp / (tp + fn) else NA_real_
-
-list(
-    accuracy = accuracy,
-    precision = precision,
-    recall = recall
-)
-
-# visual: predicted probability vs ai usage
-
-p_logit_probs <- survey_test |>
-    mutate(uses_chatgpt_label = if_else(uses_chatgpt == 1L, "uses chatgpt", "does not use chatgpt")) |>
-    ggplot(aes(x = uses_chatgpt_label, y = pred_prob)) +
-    geom_boxplot() +
-    labs(
-        title = "predicted probability of being a frequent so visitor by chatgpt use",
-        x = "ai usage segment",
-        y = "predicted probability (logistic model)"
-    )
-
-print(p_logit_probs)
-ggsave(file.path(imgs_dir, "09_logit_probs.png"), plot = p_logit_probs, width = 8, height = 6, units = "in", dpi = 300)
-
-# save key tables and model outputs to disk for report
-
-write_csv(answers_monthly, file.path(data_dir, "answers_monthly_clean.csv"))
-write_csv(answers_summary_period, file.path(data_dir, "answers_summary_period.csv"))
-write_csv(survey_counts, file.path(data_dir, "survey_ai_vs_so_visit.csv"))
-
-saveRDS(model_lm, file.path(data_dir, "model_lm_its.rds"))
-saveRDS(model_pois, file.path(data_dir, "model_pois.rds"))
-saveRDS(arima_fit, file.path(data_dir, "model_arima_prechatgpt.rds"))
-saveRDS(logit_model, file.path(data_dir, "model_logit_survey.rds"))
@@ -1,245 +0,0 @@
-[Running] Rscript "/home/ion606/Desktop/Homework/Data Analytics/Assignment IV/analysis.r"
-── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
-✔ dplyr     1.1.4     ✔ readr     2.1.5
-✔ forcats   1.0.1     ✔ stringr   1.6.0
-✔ ggplot2   4.0.0     ✔ tibble    3.3.0
-✔ lubridate 1.9.4     ✔ tidyr     1.3.1
-✔ purrr     1.1.0     
-── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
-✖ dplyr::filter() masks stats::filter()
-✖ dplyr::lag()    masks stats::lag()
-ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
-Registered S3 method overwritten by 'quantmod':
-  method            from
-  as.zoo.data.frame zoo 
-[1] "month"       "status"      "new_answers"
-# A tibble: 6 × 3
-  month      status      new_answers
-  <date>     <chr>             <dbl>
-1 2018-01-01 deleted              26
-2 2018-01-01 non-deleted         159
-3 2018-02-01 deleted              20
-4 2018-02-01 non-deleted         175
-5 2018-03-01 deleted              18
-6 2018-03-01 non-deleted         193
-Rows: 95
-Columns: 11
-$ month               <date> 2018-01-01, 2018-02-01, 2018-03-01, 2018-04-01, 2…
-$ answers_total       <dbl> 185, 195, 211, 221, 227, 189, 149, 179, 198, 232, …
-$ answers_non_deleted <dbl> 159, 175, 193, 191, 203, 172, 133, 154, 170, 198, …
-$ answers_deleted     <dbl> 26, 20, 18, 30, 24, 17, 16, 25, 28, 34, 20, 45, 33…
-$ year                <dbl> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 20…
-$ month_num           <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4,…
-$ time_index          <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,…
-$ post_chatgpt        <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
-$ post_ai_policy      <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
-$ during_mod_strike   <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
-$ period              <chr> "pre_chatgpt", "pre_chatgpt", "pre_chatgpt", "pre_…
-file already exists: data/stack-overflow-developer-survey-2023.zip
-file already exists: data/stack-overflow-developer-survey-2024.zip
- [1] "ResponseId"                          "Q120"                               
- [3] "MainBranch"                          "Age"                                
- [5] "Employment"                          "RemoteWork"                         
- [7] "CodingActivities"                    "EdLevel"                            
- [9] "LearnCode"                           "LearnCodeOnline"                    
-[11] "LearnCodeCoursesCert"                "YearsCode"                          
-[13] "YearsCodePro"                        "DevType"                            
-[15] "OrgSize"                             "PurchaseInfluence"                  
-[17] "TechList"                            "BuyNewTool"                         
-[19] "Country"                             "Currency"                           
-[21] "CompTotal"                           "LanguageHaveWorkedWith"             
-[23] "LanguageWantToWorkWith"              "DatabaseHaveWorkedWith"             
-[25] "DatabaseWantToWorkWith"              "PlatformHaveWorkedWith"             
-[27] "PlatformWantToWorkWith"              "WebframeHaveWorkedWith"             
-[29] "WebframeWantToWorkWith"              "MiscTechHaveWorkedWith"             
-[31] "MiscTechWantToWorkWith"              "ToolsTechHaveWorkedWith"            
-[33] "ToolsTechWantToWorkWith"             "NEWCollabToolsHaveWorkedWith"       
-[35] "NEWCollabToolsWantToWorkWith"        "OpSysPersonal use"                  
-[37] "OpSysProfessional use"               "OfficeStackAsyncHaveWorkedWith"     
-[39] "OfficeStackAsyncWantToWorkWith"      "OfficeStackSyncHaveWorkedWith"      
-[41] "OfficeStackSyncWantToWorkWith"       "AISearchHaveWorkedWith"             
-[43] "AISearchWantToWorkWith"              "AIDevHaveWorkedWith"                
-[45] "AIDevWantToWorkWith"                 "NEWSOSites"                         
-[47] "SOVisitFreq"                         "SOAccount"                          
-[49] "SOPartFreq"                          "SOComm"                             
-[51] "SOAI"                                "AISelect"                           
-[53] "AISent"                              "AIAcc"                              
-[55] "AIBen"                               "AIToolInterested in Using"          
-[57] "AIToolCurrently Using"               "AIToolNot interested in Using"      
-[59] "AINextVery different"                "AINextNeither different nor similar"
-[61] "AINextSomewhat similar"              "AINextVery similar"                 
-[63] "AINextSomewhat different"            "TBranch"                            
-[65] "ICorPM"                              "WorkExp"                            
-[67] "Knowledge_1"                         "Knowledge_2"                        
-[69] "Knowledge_3"                         "Knowledge_4"                        
-[71] "Knowledge_5"                         "Knowledge_6"                        
-[73] "Knowledge_7"                         "Knowledge_8"                        
-[75] "Frequency_1"                         "Frequency_2"                        
-[77] "Frequency_3"                         "TimeSearching"                      
-[79] "TimeAnswering"                       "ProfessionalTech"                   
-2023 so visit col: SOVisitFreq
-2023 ai col      : SOAI
-2024 so visit col: SOVisitFreq
-2024 ai col      : AISelect
-Rows: 146,676
-Columns: 10
-$ year         <fct> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 202…
-$ main_branch  <chr> "I am a developer by profession", "I am a developer by pr…
-$ country      <chr> "United States of America", "United States of America", "…
-$ age          <dbl> 25, 45, 25, 25, 35, 35, 25, 45, 25, 25, 25, 25, 35, 25, 3…
-$ gender       <fct> Unknown, Unknown, Unknown, Unknown, Unknown, Unknown, Unk…
-$ so_visit     <chr> "Daily or almost daily", "A few times per month or weekly…
-$ ai_select    <chr> "I don't think it's super necessary, but I think improvin…
-$ frequent_so  <int> 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, …
-$ uses_chatgpt <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, …
-$ age_group    <fct> 25-34, 45+, 25-34, 25-34, 35-44, 35-44, 25-34, 45+, 25-34…
-# A tibble: 2 × 7
-  period n_months mean_answers median_answers sd_answers min_answers max_answers
-  <chr>     <int>        <dbl>          <dbl>      <dbl>       <dbl>       <dbl>
-1 post_…       36         90.5             88       38.0          11         157
-2 pre_c…       59        193.             185       44.7         122         313
-Warning message:
-Removed 2 rows containing missing values or values outside the scale range
-(`geom_line()`). 
-Warning message:
-Removed 2 rows containing missing values or values outside the scale range
-(`geom_line()`). 
-[1] -10.02227
-
-Call:
-lm(formula = answers_total ~ time + post_chatgpt + chatgpt_time, 
-    data = its_data)
-
-Residuals:
-    Min      1Q  Median      3Q     Max 
-76.623 -22.914  -3.868  13.431 123.402 
-
-Coefficients:
-                 Estimate Std. Error t value Pr(>|t|)    
-(Intercept)      218.8013     9.3214  23.473  < 2e-16 ***
-time              -0.8589     0.2702  -3.179 0.002022 ** 
-post_chatgptTRUE -17.9635    15.0779  -1.191 0.236601    
-chatgpt_time      -2.3661     0.6282  -3.767 0.000293 ***
---
-Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
-
-Residual standard error: 35.35 on 91 degrees of freedom
-Multiple R-squared:  0.717,	Adjusted R-squared:  0.7077 
-F-statistic: 76.86 on 3 and 91 DF,  p-value: < 2.2e-16
-
-# A tibble: 4 × 5
-  term             estimate std.error statistic  p.value
-  <chr>               <dbl>     <dbl>     <dbl>    <dbl>
-1 (Intercept)       219.        9.32      23.5  2.23e-40
-2 time               -0.859     0.270     -3.18 2.02e- 3
-3 post_chatgptTRUE  -18.0      15.1       -1.19 2.37e- 1
-4 chatgpt_time       -2.37      0.628     -3.77 2.93e- 4
-# A tibble: 1 × 12
-  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
-      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
-1     0.717         0.708  35.3      76.9 7.39e-25     3  -471.  953.  966.
-# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
-
-Call:
-glm(formula = answers_total ~ time + post_chatgpt + chatgpt_time, 
-    family = poisson(link = "log"), data = its_data)
-
-Coefficients:
-                   Estimate Std. Error z value Pr(>|z|)    
-(Intercept)       5.3936301  0.0183909 293.277  < 2e-16 ***
-time             -0.0044547  0.0005512  -8.082 6.38e-16 ***
-post_chatgptTRUE -0.0187737  0.0365851  -0.513    0.608    
-chatgpt_time     -0.0322028  0.0018440 -17.464  < 2e-16 ***
---
-Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
-
-(Dispersion parameter for poisson family taken to be 1)
-
-    Null deviance: 2879.9  on 94  degrees of freedom
-Residual deviance:  713.8  on 91  degrees of freedom
-AIC: 1363
-
-Number of Fisher Scoring iterations: 4
-
-# A tibble: 4 × 5
-  term             estimate std.error statistic  p.value
-  <chr>               <dbl>     <dbl>     <dbl>    <dbl>
-1 (Intercept)       220.     0.0184     293.    0       
-2 time                0.996  0.000551    -8.08  6.38e-16
-3 post_chatgptTRUE    0.981  0.0366      -0.513 6.08e- 1
-4 chatgpt_time        0.968  0.00184    -17.5   2.71e-68
-Series: train_ts 
-ARIMA(1,1,0)(1,0,0)[12] 
-
-Coefficients:
-          ar1    sar1
-      -0.3956  0.3016
-s.e.   0.1360  0.1381
-
-sigma^2 = 1142:  log likelihood = -281.17
-AIC=568.34   AICc=568.8   BIC=574.47
-
-Training set error measures:
-                     ME     RMSE      MAE       MPE     MAPE      MASE
-Training set -0.1691686 32.90678 26.65938 -1.989033 14.30025 0.5170032
-                   ACF1
-Training set 0.03124461
-                      ME     RMSE      MAE         MPE      MAPE      MASE
-Training set  -0.1691686 32.90678 26.65938   -1.989033  14.30025 0.5170032
-Test set     -78.4100374 89.26691 79.26493 -171.518981 171.98870 1.5371782
-                   ACF1 Theil's U
-Training set 0.03124461        NA
-Test set     0.73383075   7.11443
-dropping predictors with <2 levels: gender
-classification threshold (training frequent_so share): 0.384
-
-Call:
-glm(formula = logit_formula, family = binomial(link = "logit"), 
-    data = survey_train)
-
-Coefficients:
-                  Estimate Std. Error z value Pr(>|z|)    
-(Intercept)      -0.358743   0.013009 -27.577  < 2e-16 ***
-uses_chatgpt     -0.006783   0.066977  -0.101  0.91933    
-age_group25-34    0.040677   0.015439   2.635  0.00842 ** 
-age_group35-44   -0.207571   0.017478 -11.876  < 2e-16 ***
-age_group45+     -0.345739   0.020289 -17.041  < 2e-16 ***
-age_groupunknown -0.222739   0.096177  -2.316  0.02056 *  
-year2024         -0.082452   0.012319  -6.693 2.18e-11 ***
---
-Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
-
-(Dispersion parameter for binomial family taken to be 1)
-
-    Null deviance: 156271  on 117339  degrees of freedom
-Residual deviance: 155647  on 117333  degrees of freedom
-AIC: 155661
-
-Number of Fisher Scoring iterations: 4
-
-# A tibble: 7 × 7
-  term             estimate std.error statistic   p.value conf.low conf.high
-  <chr>               <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
-1 (Intercept)         0.699    0.0130   -27.6   2.10e-167    0.681     0.717
-2 uses_chatgpt        0.993    0.0670    -0.101 9.19e-  1    0.870     1.13 
-3 age_group25-34      1.04     0.0154     2.63  8.42e-  3    1.01      1.07 
-4 age_group35-44      0.813    0.0175   -11.9   1.57e- 32    0.785     0.841
-5 age_group45+        0.708    0.0203   -17.0   4.09e- 65    0.680     0.736
-6 age_groupunknown    0.800    0.0962    -2.32  2.06e-  2    0.662     0.965
-7 year2024            0.921    0.0123    -6.69  2.18e- 11    0.899     0.943
-     pred
-truth     0     1
-    0  7560 10549
-    1  4009  7218
-$accuracy
-[1] 0.5037497
-
-$precision
-[1] 0.4062588
-
-$recall
-[1] 0.6429144
-
-
-[Done] exited with code=0 in 12.272 seconds
-
@@ -1,13 +0,0 @@
-
-6. Oral Presentation (5%). Plan for a ~5 minute presentation; slides must cover the
-following:
-a). Title (with your name)
-b). Problem area – what you wanted to explore/ solve/ predict and why, and what you
-wanted to predict?
-c). The data – where it came from, why it was applicable and the preliminary assessments
-you made.
-d). How you conducted your analysis: distribution, pattern/ relationship and model
-construction. What techniques did you use/ not use and why? What worked? What did not
-work? How did you apply the model? How did you optimize, account for uncertainties?
-f). What did you predict and what decisions (prescriptions) were possible. What was the
-outcome, conclusions?
@@ -0,0 +1,164 @@
+I used Manhattan (BOROUGH code = 1) for Question 1 and Brooklyn (BOROUGH code = 3) for Question 2.
+
+NYC Dept. of Finance / BBL conventions:
+- 1 = Manhattan
+- 2 = Bronx
+- 3 = Brooklyn
+- 4 = Queens
+- 5 = Staten Island
+
+The dataset itself is NYC’s annualized file of residential property sales across all five boroughs
+
+## Loading and Cleaning the Data
+
+- `manhattan_clean` ended up with **6,313** sales.
+- `brooklyn_clean` ended up with **40,921** sales.
+
+---
+
+# 1. One-borough analysis (Manhattan, BOROUGH = 1)
+
+---
+
+## 1(a) Patterns, trends, and modeling plan
+
+For Manhattan, I’m interested first in how **SALE PRICE** is distributed and how it relates to building characteristics like **GROSS SQUARE FEET**, **LAND SQUARE FEET**, **YEAR BUILT**, and **unit counts**. Because this is a citywide residential sales file, I expect the price distribution to be extremely right‑skewed with a small number of ultra‑expensive transactions and many more moderately priced ones.
+
+I’d start with **univariate** distributions of price, square footage, and year built, and then move to **bivariate** relationships (scatter plots of price vs. size, boxplots of price by neighborhood) and **correlation matrices**. For modeling, I’d use **log‑transformed sale price** as the response to stabilize variance and compare a baseline **linear regression** to a non‑linear **Random Forest** regressor. Finally, I’d treat **NEIGHBORHOOD** as a label and build supervised classifiers (k‑NN, Random Forest, logistic regression) using quantitative features (price, areas, units, year) to see how well location can be inferred from physical characteristics and price alone.
+
+---
+
+## 1(b) Exploratory Data Analysis and Outliers
+
+### Findings
+
+1. **Distribution of Sale Price**
+   For Manhattan, after cleaning, `SALE PRICE` is extremely right‑skewed:
+
+   - Count is approx **6,313**.
+   - Median is approx **$3.86M**.
+   - Mean is approx **$15.7M**.
+   - 75th percentile is approx **$9.25M**.
+   - 90th percentile is approx **$21.8M**.
+   - 99th percentile is approx **$228.9M**.
+   - Maximum is approx **$2.40B**.
+
+   That huge gap between the median and max confirms a heavy upper tail.
+
+2. **Outliers**
+   Using the 1.5 IQR rule, the upper outlier threshold is around **$21.1M**; there are about **650** outlier sales above this bound. The most extreme sale (approx$2.4B) is orders of magnitude larger than a typical sale and shows up as a solitary point far above the rest in the boxplot. On the lower side, I also saw many sales at or near zero in the raw data, which is why I filtered out sales ≤$10,000 as likely non‑arm’s‑length or data errors.
+
+3. **Effect of log transformation.**
+   When I plotted a histogram of `log(1 + SALE PRICE)`, the distribution became much closer to symmetric: the bulk of log‑prices fell approximately between ~14 and ~17 (roughly $1.2M to $24M), with a long but much more manageable upper tail. This supported using the log scale for regression.
+
+4. **Relationships with size & other features.**
+   The correlation of `SALE PRICE` with **GROSS SQUARE FEET** was about **0.49**, substantially higher than any other feature. Correlations with **COMMERCIAL UNITS**, **LAND SQUARE FEET**, and **TOTAL UNITS** were modestly positive (~0.16–0.21), and correlation with **YEAR BUILT** was essentially near zero. This suggests that building size is the main driver captured in these quantitative covariates, while age and unit counts are relatively weak predictors by themselves.
+
+5. **Scatter plots & heteroskedasticity.**
+   The scatter plot of price vs gross square feet (with a log scale on price) showed a clear upward trend but with wide vertical spread, especially for larger buildings. High‑end buildings with similar square footage can sell at very different prices, which is consistent with neighborhood effects, building quality, and other unobserved characteristics. Overall, the EDA shows strong skewness, many high‑end outliers, and a moderate but noisy link between size and price.
+
+---
+
+## 1(c) Regression analysis to predict sale price
+
+### Findings
+
+1. **Cleaning for regression.**
+   In addition to the general cleaning above, I required that observations have non‑missing **GROSS SQUARE FEET**, **LAND SQUARE FEET**, **YEAR BUILT**, and unit counts, and that the areas be strictly positive. I also removed sales at or below $10,000 as non‑market outliers. I did **not** remove very high prices; instead I relied on the log transformation and the tree‑based model to reduce their influence.
+
+2. **Linear regression (baseline).**
+   The standardized linear model on log‑price achieved only about **R² is approx 0.13** on the test set, with RMSE is approx **1.71** and MAE is approx **1.24** in log units. This means a purely linear relationship between size, units, year built and log price is a poor approximation – unsurprising given the complexity of Manhattan’s housing market and the missing effects of location and building quality.
+
+3. **Random Forest regression (non‑linear).**
+   The Random Forest model performed much better, with **R² is approx 0.75**, RMSE is approx **0.92**, and MAE is approx **0.59** on log‑price. Interpreting roughly, an MAE of 0.59 in log units corresponds to prediction errors on the order of ±80% in price (because exp(0.59) is approx 1.8), which is not great for individual deals but decent for a coarse citywide model based only on a few structural features.
+
+4. **Interpretation of predictors.**
+   Based on the earlier correlations and typical real‑estate patterns, most of the predictive power comes from **GROSS SQUARE FEET**, with **LAND SQUARE FEET** and unit counts adding secondary information. Year built contributes little signal by itself, consistent with the near‑zero correlation with sale price and the fact that historic vs modern buildings can command premiums or discounts depending on context.
+
+5. **Model choice.**
+   Because the Random Forest explains substantially more variance in log price and is more robust to non‑linearity and heteroskedasticity than linear regression, I treated it as the **best Manhattan regression model** and used it as the model to generalize to Brooklyn in Question 2.
+
+---
+
+## 1(d) Classification: predicting neighborhood from quantitative variables
+
+### Findings
+
+1. **Cleaning for classification.**
+   I restricted to Manhattan neighborhoods with at least **100** sales to avoid tiny classes. That left **5,476** observations and **23 neighborhoods** (e.g., Midtown West, multiple Upper East/West Side segments, several Harlem neighborhoods, Chelsea, Lower East Side, etc.). I also dropped any rows with missing numeric features.
+
+2. **k‑NN classifier.**
+   With standardized features and k=7, k‑NN achieved **accuracy approx 0.50**, macro **F1 is approx 0.45**, and weighted **F1 is approx 0.50**. The confusion matrix showed that it often confused nearby or similar neighborhoods (e.g., different segments of the Upper East/West Side), which is intuitive because those areas have similar price/size profiles.
+
+3. **Random Forest classifier (best).**
+   The Random Forest neighborhood classifier performed best, with **accuracy approx 0.61**, macro **precision approx 0.59**, macro **recall is approx 0.55**, and macro **F1 is approx 0.57** (weighted F1 is approx 0.61). The confusion matrix had a reasonably strong diagonal for major neighborhoods like Midtown West and Central Harlem, though there were still frequent misclassifications between adjacent, similar market segments.
+
+4. **Logistic regression.**
+   Multinomial logistic regression performed poorly on this feature set, with **accuracy approx 0.27** and macro **F1 approx 0.12**. This suggests that the decision boundaries between neighborhoods in this feature space are highly non‑linear, and a simple linear model in the original feature space is not expressive enough.
+
+5. **Overall assessment.**
+   Even the best classifier (Random Forest) makes many mistakes, which is expected: we are trying to reconstruct a very fine‑grained location label (neighborhood) from crude variables (square feet, units, year, plus price) and ignoring explicit spatial coordinates. The contingency tables show that neighborhoods with similar densities and price levels systematically get confused, highlighting the limits of using only structural attributes and price to infer location.
+
+---
+
+# 2. Second borough (Brooklyn, BOROUGH = 3)
+
+For Question 2, I used **Brooklyn** as the second borough, cleaned with the same logic as Manhattan (BOROUGH=3, same min price, same handling of missing areas and year built).
+
+---
+
+## 2(a) Applying the Manhattan regression model to Brooklyn
+
+### Findings
+
+1. **Performance metrics.**
+   When the Manhattan Random Forest regression model was applied directly to the cleaned Brooklyn data (no retraining), I got:
+
+   - **R² is approx –0.77** on log‑price.
+   - **RMSE is approx 1.21**.
+   - **MAE is approx 0.95**.
+
+   The negative R² means the model does **worse** than simply predicting the mean log price for every Brooklyn sale.
+
+2. **Predicted vs actual plot.**
+   The predicted vs actual log‑price scatter for Brooklyn is very diffuse and does not cluster around the 45° line. The model tends to **systematically mis‑scale Brooklyn prices**: for some segments it over‑predicts (especially cheaper properties) and for more expensive Brooklyn neighborhoods it under‑predicts compared to their actual sale prices.
+
+3. **Residual diagnostics.**
+   The residual‑vs‑prediction plot shows large, structured patterns rather than random noise; the mean residual is substantially negative (indicating systematic under‑prediction on the log scale). This indicates poor generalization: the relationships between square footage, units, year built and price that the model learned in Manhattan do not transfer well to Brooklyn.
+
+4. **Interpretation.**
+   Brooklyn has a very different mix of housing types, neighborhood price levels, and land availability compared to Manhattan. Without explicit location variables and more detailed building characteristics, a model calibrated on Manhattan cannot capture Brooklyn’s pricing structure, so it generalizes poorly even though the algorithms are fairly powerful.
+
+---
+
+## 2(b) Applying Manhattan neighborhood classifiers to Brooklyn
+
+### Findings
+
+1. **Label-space mismatch.**
+   The classifiers trained on Manhattan were all trained to predict **Manhattan neighborhoods** (e.g., “MIDTOWN WEST”, “UPPER EAST SIDE (79–96)”) as labels. When evaluated on Brooklyn, the **true labels** are Brooklyn neighborhoods (“BAY RIDGE”, “WILLIAMSBURG”, etc.), which **do not overlap at all** with the Manhattan label set. As a result, the Manhattan models will *never* predict the correct Brooklyn neighborhood name.
+
+2. **Metrics.**
+   As expected, for all three models (k‑NN, Random Forest, logistic regression), I got **accuracy approx 0.0** and macro/weighted **F1 approx 0.0** when using Brooklyn neighborhood names as the true labels. In other words, the models made effectively zero correct predictions across all observations.
+
+3. **Contingency tables.**
+   The resulting “confusion matrices” are degenerate: for each Brooklyn neighborhood, all counts are off‑diagonal because the classifier is outputting Manhattan labels that never equal the Brooklyn labels on the y‑axis. This still technically produces a contingency table, but it visually demonstrates that the model is fundamentally mis-specified for this task when moved to a different borough.
+
+4. **Interpretation.**
+   This exercise highlights a key point: **classification models can’t generalize across domains when the label space itself changes**. Because Brooklyn neighborhoods are a completely different set of categories, a Manhattan neighborhood classifier cannot be expected to perform well without re-training on Brooklyn labels. At best, you might interpret the predictions as a “closest Manhattan analog,” but they have no predictive validity for the true Brooklyn neighborhood names.
+
+---
+
+## 2(c) General observations & confidence
+
+The datasets for Manhattan and Brooklyn are both large but quite **noisy**, with many missing or inconsistent values for square footage and units and with a lot of non‑arm’s‑length sales at very low or zero prices. Even after cleaning, high‑end outliers still exert influence and reflect market segments (e.g., trophy assets) that behave differently from the bulk of the distribution. Across both boroughs, models based only on size, units, and age of the building capture some signal but miss many important drivers like exact location, building quality, and amenities. My confidence in the **relative** conclusions (e.g., Random Forest beats linear regression; Manhattan‑trained models generalize poorly to Brooklyn) is high, but I would not rely on these models for **precise valuation** of individual properties.
+
+---
+
+# 3. 6000-level: Conclusions about model types & suitability
+
+Across this study, **non‑linear tree‑based models (Random Forests)** consistently outperformed simpler linear models for both regression and classification. In the Manhattan regression, linear regression on log‑price captured only a small fraction of the variance, while the Random Forest achieved a much higher R² and more realistic error levels, indicating that the relationship between size/units and price is non‑linear, with interactions and thresholds that linear models cannot represent. For classification, the Random Forest neighborhood model again beat k‑NN and especially logistic regression, suggesting that neighborhood decision boundaries in this feature space are complex and benefit from hierarchical splits rather than a single global linear separator.
+
+However, these stronger models are still limited by **feature quality and domain shift**. When the Manhattan regression model is applied to Brooklyn, its performance collapses (negative R²), showing that even a flexible model trained on one borough’s distribution cannot simply be transplanted to another borough with different price levels and housing stock. The classification models fail even more dramatically when moved across boroughs, because the label space itself changes; this is a reminder that good predictive performance is inherently **domain-specific** and that models must be re-trained or at least adapted when the domain or label space shifts.
+
+Methodologically, what “worked” was combining **sensible cleaning (dropping non‑arm’s‑length sales, fixing square‑footage fields, filtering small classes), log transformations, and non‑linear models**; what did not work was assuming that limited structural variables alone could fully explain prices or that a model trained on Manhattan would generalize to Brooklyn without explicit location features. In a production setting, I would layer on richer covariates (latitude/longitude, transit accessibility, building quality proxies, zoning, etc.) and likely move to gradient‑boosted trees or other ensemble methods, but the main lessons about non‑linearity, outliers, and domain dependence would remain the same.
@@ -0,0 +1,135 @@
+## 1. One-Borough Analysis: Manhattan
+
+I filtered the dataset to Manhattan by only keeping rows where `BOROUGH == 1`. After cleaning, parsing numeric fields, dropping implausible or missing values, etc, I had 7,294 usable sales out of 96,088 raw Manhattan rows, and 42,743 Brooklyn rows for later comparison.s
+
+1(a) Planned patterns, trends, and modeling approach
+
+For Manhattan, the overall big-picture trends I wanted to look at were: how sale price changes with building size (gross and land square footage), intensity of use (number of residential and commercial units), and building age (year built). Because NYC housing prices are known to be highly skewed with extreme luxury outliers - particularly in Manhattan's condo and co-op markets - I wanted to examine both raw prices and log-transformed prices to better see the central bulk of transactions.
+
+I also expected nonlinear relationships, such as the fact that price per square foot starts to decrease for very large buildings, and there could be thresholds beyond which additional units change value in a nonadditive way. To detect these, I compared the simple linear regression on log sale price against a more flexible model like a random forest regression, which is well-suited for modeling nonlinear relationships and interactions among predictors.
+
+For the classification, I treated "neighborhood" as the target and price/size variables as predictors to see whether basic quantitative attributes are enough to distinguish markets like "Upper East Side", "Harlem", and "SoHo". Then I compared three supervised models: k-NN, random forest, and multinomial logistic regression, using accuracy and macro-F1 as evaluation metrics.
+
+---
+
+1(b) Exploratory data analysis and outliers
+
+I first cleaned the raw NYC file by converting the strings to numeric for `SALE PRICE`, `LAND SQUARE FEET`, `GROSS SQUARE FEET`, and `YEAR BUILT`, dropping obvious invalid numeric codes (e.g., “0”, “.”),removing very small sales below $10,000 to exclude non–arm’s-length transactions and recording artifacts, treating years before 1800 as missing, and requiring positive, non-missing values for land area, gross area, and year built. I set the remaining missing unit counts to 0 for residential, commercial, and total units.
+
+In Manhattan, sale prices are extremely right-skewed-the minimum valid sale is about $10,050, the median $4.0M, the mean $16.3M, the 75th percentile $9.58M, and the maximum an enormous $2.40B. This suggests that a relative few ultra-high-value transactions pull the mean far above the median, which for Manhattan is fairly normal for high-end, expensive markets.
+
+Using the quartiles, I then created a rule for outliers based on the inter-quartile range: with Q1 as approx $1.37M and Q3 as approx $9.58M, the IQR is about $8.21M, so the upper “fence” is Q3 + 1.5 \* IQR approx $21.9M. Any sale above that is thus classified as an outlier and this 1.5 \* IQR rule is the usual box-plot definition of outliers. With that rule I found there were 756 outliers in Manhattan, while the maximum sale price is more than $2.39B.
+
+The histogram of raw sale prices, Figure 1, has almost all sales piled up near the left-hand side, with a long low-frequency tail extending out toward billions of dollars; this visually confirms heavy skew and extreme outliers.
+
+![Figure 1: manhattan sale price distribution (raw) ](./plots/manhattan_hist_raw.png)
+
+Once I applied `log1p(sale_price)`, the histogram on the log scale, Fig. 2, is much more symmetric and interpretable, compressing the ultra-luxury outliers into a reasonable range while still preserving their relative ordering.
+
+Figure 2: manhattan sale price distribution (log scale)
+![Figure 2](./plots/manhattan_hist_log.png)
+
+Figure 3: manhattan sale price with outliers.
+![Figure 3](./plots/manhattan_box.png)
+
+Finally, the scatterplot of gross square feet vs. sale price (with price on a log10 scale; Figure 4) displays a strong positive association-larger buildings sell for more-but also a lot of vertical spread, indicating that other factors beyond size drive price differences as well, namely location, building class, and quality. This is supported by the correlation analysis I have done: sale price is moderately correlated with gross square feet (about 0.49), weakly with land area (about 0.16) and total units (about 0.17), almost uncorrelated with year built (about 0.02).
+
+Below is a scatterplot showing sale price vs. gross square feet for Manhattan: Figure 4: manhattan sale price vs gross square feet.
+
+---
+
+1(c) Regression analysis for sale price
+I then created a modeling dataset with predictors for regression:
+
+* `land_sqft`, `gross_sqft`,
+
+* `year_built`,
+
+`res_units`, `comm_units`, and `total_units`
+
+and the target log_price = log1p(sale_price), removing any rows with missing values in those fields. I then split the data into a 75% training set and 25% test set using createDataPartition to preserve the distribution of log_price.
+
+The baseline model was a multiple linear regression of `log_price` on all six predictors. On the held-out Manhattan test set this model achieved APPROXIMATELY R² = 0.19, RMSE = 1.69, and MAE = 1.26 on the log scale, meaning it explains only about 19% of the variance in log sale price and leaves relatively large residual errors. This suggests that linear relationships in these basic structural variables alone are not sufficient to capture the complexity of Manhattan real estate pricing.
+
+I then trained a random forest regressor on the same predictors and target. The random forest model did considerably better on the Manhattan test set, with (approx) R² = 0.70, RMSE = 1.03, and MAE = 0.66 on log price-over tripling the explained variance relative to the linear model. This performance gain is consistent with the idea that random forests are effective at modeling nonlinear relationships and interactions without requiring me to specify them by hand.
+
+The predicted-vs-actual plot for Manhattan's random forest (Figure 5) has points clustered around the 45 deg line-particularly in the log price mid-range-indicating generally accurate predictions, yet with scatter at the extremes.
+
+Figure 5: random forest – predicted vs actual log(1 + sale price) (manhattan)
+
+[Figure 5](./plots/rf_pred_vs_actual_manhattan.png)
+
+Figure 6: random forest residuals (manhattan)
+[Figure 6](./plots/rf_resid_manhattan.png)
+
+Overall, following fairly aggressive cleaning-dropping non-arm's-length sales, invalid years and missing areas-and a log transformation, the random forest regression yields a reasonably strong model for Manhattan sale prices, whereas the linear model substantially underfits.
+
+*
+
+1(d) Classification: predicting neighborhood in Manhattan
+
+For neighborhood classification I created a subset in which `neighborhood` is not missing and only neighborhoods with 100 or more sales are retained in order to avoid tiny, noisy classes. After this filtering I had 6,792 records spanning 28 neighborhoods. I then selected the quantitative predictors
+
+* `sale_price`, `land_sqft`, `gross_sqft`,
+
+* `year_built`, `res_units`, `comm_units`, `total_units`
+
+and dropped any rows with remaining missing values.
+
+First, I fit a k-NN classifier with k = 7 and standardization (center/scale) applied to all predictors. On the Manhattan test set, k-NN achieved an approximate accuracy = 0.48 and macro-F1 = 0.42, indicating that it correctly predicts the neighborhood for just under half of the held-out sales and gives moderate average F1 across classes. Next, I trained a random forest classifier using the same predictors with 300 trees and `mtry = 3`. This performed the best of the three, with accuracy of about 0.58 and macro-F1 about 0.53.
+
+Lastly, I fit a multinomial logistic regression model, whose performance, despite its interpretability, was substantially worse (accuracy about 0.25 and macro-F1 about 0.30). It would be impossible to distinguish 28 neighborhoods using such numeric predictors. The confusion matrix of the random forest-a 28 \* 28 table-reveals the most common confusion: geographically close or socio-economically similar neighbourhoods, like different slices of the Upper East/West Side or adjacent parts of Harlem (which makes perfect sense given that their price and size profiles have a very high overlap).
+
+This cleaning for the classification task was meant to remove neighbourhoods with too few samples that would make F1 metrics unstable, enforce complete predictor data, and restrict attention to clearly labelled neighbourhoods. Even after cleaning, the modest accuracy and macro-F1 remind me that neighbourhood captures many qualitative aspects-exact location, school zones, views, amenities-that are not fully represented by simple size and unit counts.
+
+---
+
+## 2. Second-borough analysis: Brooklyn, using Manhattan models
+
+For this second derived dataset I focused on Brooklyn (`BOROUGH == 3`). I repeated all of the same cleaning and feature engineering steps: numeric parsing, filtering on sale price and physical characteristics, construction of the same predictor variables (`land_sqft`, `gross_sqft`, `year_built`, `res_units`, `comm_units`, `total_units`, `sale_price`). This yielded 42,743 cleaned Brooklyn records-a far larger sample than Manhattan’s 7,294 cleaned rows.
+
+2(a) Applying Manhattan regression to Brooklyn
+
+I tested how well the Manhattan-trained random forest regressor generalizes by applying it directly to the Brooklyn dataset. On Brooklyn, the model's performance fell to APPROXIMATELY R² = 0.24, RMSE = 1.31 and MAE = 1.08 on log price-far worse than its performance of R² = 0.70 and MAE = 0.66 on Manhattan.
+
+Figure 7: The predicted-vs-actual plot for Brooklyn displays a broad cloud of points with a strong linear trend but much more scatter around the 45 deg line than in Manhattan, which suggests that the model systematically underand over-predicts across different price ranges.
+
+![Figure 7: random forest – manhattan model on brooklyn (predicted vs actual log price)](./plots/rf_pred_vs_actual_brooklyn.png)
+
+The residual plot shown in Figure 8 depicts residuals which are not symmetrically centered around zero; there are pockets of the predicted price which generate clusters of large negative residuals, indicative of consistent overestimation in parts of the Brooklyn market.
+
+![Figure 8: residuals on brooklyn using manhattan random forest](./plots/rf_resid_brooklyn.png)
+
+This poor generalization makes sense: Manhattan and Brooklyn have different mixes of property types-e.g. ultra-luxury high-rise condos vs. more rowhouses and small multi-family homes-and different price levels, even as Brooklyn has become increasingly expensive. A model that has been trained only on Manhattan data learns relationships calibrated to its particular price structure and building stock, so when it is applied to Brooklyn it picks up some broad patterns-e.g. larger buildings are more expensive-but misses borough-specific effects leading to a large drop in R².
+
+---
+
+2(b) Manhattan neighborhood classifiers applied to Brooklyn
+
+To classify these, I made a Brooklyn subset with the same methodology as Manhattan, keeping only the neighborhoods with over 100 sales, removing rows with missing predictors, which gave me a total of 42,533 records across 56 Brooklyn neighborhoods. I then applied the three Manhattan-trained classifiers: k-NN, random forest, and multinomial logistic regression to predict Brooklyn neighborhoods.
+
+The basic problem is that there's a fundamental label mismatch: the Manhattan models' output classes are 28 Manhattan neighborhoods, while the true labels in the Brooklyn data are 56 *different* Brooklyn neighborhoods. The resulting contingency tables are thus 56 \* 28, with nonzero counts almost entirely off the diagonal: Brooklyn neighborhoods systematically get mapped to some Manhattan label that best matches their numeric features, but there is no notion of “correct” prediction in terms of neighborhood identity.
+
+Because of this mismatch, usual metrics like accuracy, precision, recall, and F1 are not meaningful-every prediction is technically wrong with respect to the true Brooklyn neighborhood labels. The contingency tables are still useful descriptively: they show which Manhattan neighborhoods Brooklyn neighborhoods *look like* in terms of price and size (for example, some high-end Brooklyn areas may be frequently mapped to Upper East/West Side or SoHo/Tribeca). But as a generalization test of a neighborhood classifier, these results show that a model trained in one borough does not transfer to another when the class labels themselves change.
+
+The Manhattan neighborhood classifiers therefore do not generalize to Brooklyn, in the sense of predicting *Brooklyn* neighborhood names; they act more like a rough borough-agnostic similarity mapping.
+
+Hint:
+
+2(c) Further remarks and confidence
+
+One thing that stood out right away is how much more filtering Manhattan needed. Only about 7.6% of its original rows survived the cleaning process, while Brooklyn kept a far larger share. That points to either more missing or odd entries in the Manhattan data, or simply stricter criteria knocking out extreme cases there. Even after cleaning, both boroughs still have very skewed price distributions with plenty of outliers, which makes modeling tough in a market where a handful of ultra-expensive properties dominate the landscape.
+
+Within Manhattan, I’m fairly confident in the regression results for mid-range homes-random forest handles that part of the market well-but the model struggles with luxury listings and doesn’t transfer cleanly across boroughs. The weak performance of multinomial logistic regression, along with the only-okay results from k-NN and random forest for neighborhood prediction, makes it clear that numeric features alone aren’t enough. Getting neighborhood right would require richer location details and stronger categorical features.
+
+### Overall conclusions about model types and suitability
+
+A log transform on sale price and basic cleaning-removing outliers, invalid years, missing area data, and tiny non-arm’s-length transactions-are the minimum steps needed before modeling NYC housing data. Without them, extreme values and inconsistent records overwhelm everything and drag down model performance. The contrast between the raw and log-scaled price histograms, and between linear regression and random forest, shows how strongly the distribution’s shape affects model behavior.
+
+Using only structural features, linear regression reaches an R² of about 0.19 for Manhattan, which reflects the strong nonlinearities and missing variables at play. Random forest, on the other hand, captures most of the variation with an R² around 0.70 and far smaller errors, highlighting the advantage of flexible ensemble methods in a market as messy as this one.
+
+But even the best Manhattan model doesn’t travel well: applying it to Brooklyn drops performance to roughly R² approx 0.24. That makes it obvious that models trained in one borough don’t work elsewhere without accounting for differences in market structure and price levels. Purely structural, cross-sectional models miss the spatial and neighborhood effects needed for transferability.
+
+For neighborhood classification, random forest again does better than k-NN or multinomial logistic regression, but overall accuracy is still modest (about 0.58, macro-F1 around 0.53). This reinforces that simple numerical features-price, area, and the like-aren’t enough to reliably identify neighborhoods. More detailed spatial information, building class categories, and possibly time-related features would likely improve results.
+
+Overall, random forests are the strongest of the models you tested. They handle nonlinear, heavy-tailed relationships and deliver solid within-borough predictions, though they’re harder to interpret and struggle when applied to a different borough. Linear and logistic models are easy to explain but miss too much of the structure here. Taken together, the results point toward flexible nonlinear models for within-borough price predictions, with careful feature engineering and borough-specific training needed to generalize across NYC.
@@ -0,0 +1,553 @@
+import numpy as np
+import pandas as pd
+import matplotlib.pyplot as plt
+
+from matplotlib.ticker import FuncFormatter
+
+from sklearn.model_selection import train_test_split
+from sklearn.metrics import (
+    r2_score,
+    mean_squared_error,
+    mean_absolute_error,
+    accuracy_score,
+    precision_recall_fscore_support,
+    confusion_matrix,
+)
+from sklearn.linear_model import LinearRegression, LogisticRegression
+from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
+from sklearn.neighbors import KNeighborsClassifier
+from sklearn.preprocessing import StandardScaler
+from sklearn.pipeline import Pipeline
+
+RANDOM_STATE = 42
+np.random.seed(RANDOM_STATE)
+
+file_path = "Given/NYC_Citywide_Annualized_Calendar_Sales_Update_20241107.csv"
+
+cols_needed = [
+    "BOROUGH", "NEIGHBORHOOD", "BUILDING CLASS CATEGORY",
+    "TAX CLASS AS OF FINAL ROLL", "BLOCK", "LOT",
+    "BUILDING CLASS AS OF FINAL ROLL", "ZIP CODE",
+    "RESIDENTIAL UNITS", "COMMERCIAL UNITS", "TOTAL UNITS",
+    "LAND SQUARE FEET", "GROSS SQUARE FEET", "YEAR BUILT",
+    "TAX CLASS AT TIME OF SALE", "BUILDING CLASS AT TIME OF SALE",
+    "SALE PRICE", "SALE DATE",
+]
+
+# loading...
+nyc = pd.read_csv(file_path, dtype=str, low_memory=False)
+cols_present = [c for c in cols_needed if c in nyc.columns]
+nyc = nyc[cols_present]
+
+# force borough numeric
+nyc["BOROUGH"] = pd.to_numeric(nyc["BOROUGH"], errors="coerce")
+
+manhattan_raw = nyc[nyc["BOROUGH"] == 1].copy()
+brooklyn_raw = nyc[nyc["BOROUGH"] == 3].copy()
+
+print(f"raw manhattan rows: {len(manhattan_raw)}")
+print(f"raw brooklyn rows: {len(brooklyn_raw)}")
+
+
+def clean_borough(df: pd.DataFrame, min_price: float = 10000.0) -> pd.DataFrame:
+    """clean borough-level dataframe similarly to the r function."""
+    df = df.copy()
+
+    def parse_numeric(series: pd.Series) -> pd.Series:
+        """numeric from char / factor with commas and junk values."""
+        s = series.astype(str).str.strip()
+        s = s.replace(
+            {
+                "0": np.nan,
+                "0.0": np.nan,
+                "- 0": np.nan,
+                "": np.nan,
+                ".": np.nan,
+                "NA": np.nan,
+                "NaN": np.nan,
+            }
+        )
+        s = s.str.replace(",", "", regex=False)
+        return pd.to_numeric(s, errors="coerce")
+
+    # convert numeric columns
+    if "SALE PRICE" in df.columns:
+        df["SALE PRICE"] = parse_numeric(df["SALE PRICE"])
+    if "LAND SQUARE FEET" in df.columns:
+        df["LAND SQUARE FEET"] = parse_numeric(df["LAND SQUARE FEET"])
+    if "GROSS SQUARE FEET" in df.columns:
+        df["GROSS SQUARE FEET"] = parse_numeric(df["GROSS SQUARE FEET"])
+    if "YEAR BUILT" in df.columns:
+        df["YEAR BUILT"] = parse_numeric(df["YEAR BUILT"])
+
+    unit_cols = ["RESIDENTIAL UNITS", "COMMERCIAL UNITS", "TOTAL UNITS"]
+    for col in unit_cols:
+        if col in df.columns:
+            df[col] = parse_numeric(df[col])
+
+    # drop non-arms-length / tiny sales
+    if "SALE PRICE" in df.columns:
+        df = df[df["SALE PRICE"].notna()]
+        df = df[df["SALE PRICE"] > min_price]
+
+    # very old or zero years --> missing
+    if "YEAR BUILT" in df.columns:
+        df.loc[df["YEAR BUILT"] < 1800, "YEAR BUILT"] = np.nan
+
+    # need usable size / year
+    required_cols = ["GROSS SQUARE FEET", "LAND SQUARE FEET", "YEAR BUILT"]
+    if all(c in df.columns for c in required_cols):
+        df = df[
+            df["GROSS SQUARE FEET"].notna()
+            & df["LAND SQUARE FEET"].notna()
+            & df["YEAR BUILT"].notna()
+            & (df["GROSS SQUARE FEET"] > 0)
+            & (df["LAND SQUARE FEET"] > 0)
+        ]
+
+    # fill missing units with 0
+    for col in unit_cols:
+        if col in df.columns:
+            df[col] = df[col].fillna(0)
+
+    # rename neighborhood
+    if "NEIGHBORHOOD" in df.columns:
+        df = df.rename(columns={"NEIGHBORHOOD": "neighborhood"})
+
+    # create new columns
+    if "LAND SQUARE FEET" in df.columns:
+        df["land_sqft"] = df["LAND SQUARE FEET"]
+    if "GROSS SQUARE FEET" in df.columns:
+        df["gross_sqft"] = df["GROSS SQUARE FEET"]
+    if "YEAR BUILT" in df.columns:
+        df["year_built"] = df["YEAR BUILT"]
+    if "RESIDENTIAL UNITS" in df.columns:
+        df["res_units"] = df["RESIDENTIAL UNITS"]
+    else:
+        df["res_units"] = 0
+    if "COMMERCIAL UNITS" in df.columns:
+        df["comm_units"] = df["COMMERCIAL UNITS"]
+    else:
+        df["comm_units"] = 0
+    if "TOTAL UNITS" in df.columns:
+        df["total_units"] = df["TOTAL UNITS"]
+    else:
+        df["total_units"] = 0
+    if "SALE PRICE" in df.columns:
+        df["sale_price"] = df["SALE PRICE"]
+
+    return df
+
+
+manhattan = clean_borough(manhattan_raw)
+brooklyn = clean_borough(brooklyn_raw)
+
+print(f"clean manhattan rows: {len(manhattan)}")
+print(f"clean brooklyn rows: {len(brooklyn)}")
+
+
+# manhattan exploratory data analysis
+
+# summary stats for sale price
+summary_manhattan_price = manhattan["sale_price"].describe()
+print("summary of manhattan sale_price:")
+print(summary_manhattan_price)
+
+quantiles_manhattan = manhattan["sale_price"].quantile(
+    [0.25, 0.5, 0.75, 0.9, 0.95, 0.99]
+)
+print("selected quantiles for manhattan sale_price:")
+print(quantiles_manhattan)
+
+q1 = quantiles_manhattan.loc[0.25]
+q3 = quantiles_manhattan.loc[0.75]
+iqr = q3 - q1
+lower_bound = q1 - 1.5 * iqr
+upper_bound = q3 + 1.5 * iqr
+
+print(f"manhattan iqr upper bound: {upper_bound}")
+print(
+    f"manhattan max sale price: {manhattan['sale_price'].max(skipna=True)}"
+)
+
+outlier_mask = (manhattan["sale_price"] < lower_bound) | (
+    manhattan["sale_price"] > upper_bound
+)
+print(f"number of sale price outliers: {outlier_mask.sum()}")
+
+# correlation with other numeric vars
+num_cols = [
+    "sale_price", "gross_sqft", "land_sqft",
+    "year_built", "res_units", "comm_units", "total_units",
+]
+
+corr_manhattan = manhattan[num_cols].corr()
+print("correlation with sale_price:")
+print(corr_manhattan["sale_price"])
+
+# formatter for comma-separated tick labels
+
+
+def comma_format(x, pos):
+    try:
+        return f"{int(x):,}"
+    except Exception:
+        return str(x)
+
+
+# histogram of raw sale prices
+fig, ax = plt.subplots()
+ax.hist(manhattan["sale_price"], bins=50)
+ax.set_title("manhattan sale price distribution (raw)")
+ax.set_xlabel("sale price (usd)")
+ax.set_ylabel("count of sales")
+ax.xaxis.set_major_formatter(FuncFormatter(comma_format))
+plt.tight_layout()
+plt.show()
+
+# histogram of log(1 + sale price)
+fig, ax = plt.subplots()
+ax.hist(np.log1p(manhattan["sale_price"]), bins=50)
+ax.set_title("manhattan sale price distribution (log scale)")
+ax.set_xlabel("log(1 + sale price)")
+ax.set_ylabel("count of sales")
+plt.tight_layout()
+plt.show()
+
+# boxplot of sale price (for outliers)
+fig, ax = plt.subplots()
+ax.boxplot(manhattan["sale_price"].values, vert=True)
+ax.set_title("manhattan sale price with outliers")
+ax.set_ylabel("sale price (usd)")
+ax.set_xticks([])
+ax.yaxis.set_major_formatter(FuncFormatter(comma_format))
+plt.tight_layout()
+plt.show()
+
+# scatter: gross sqft vs sale price (log y)
+fig, ax = plt.subplots()
+ax.scatter(manhattan["gross_sqft"], manhattan["sale_price"], alpha=0.3)
+ax.set_yscale("log")
+ax.set_title("manhattan sale price vs gross square feet")
+ax.set_xlabel("gross square feet")
+ax.set_ylabel("sale price (log10-ish scale)")
+plt.tight_layout()
+plt.show()
+
+# manhattan regression analysis
+
+reg_vars = [
+    "land_sqft", "gross_sqft", "year_built",
+    "res_units", "comm_units", "total_units",
+]
+
+reg_df = manhattan[reg_vars + ["sale_price"]].dropna()
+reg_df["log_price"] = np.log1p(reg_df["sale_price"])
+
+X = reg_df[reg_vars]
+y = reg_df["log_price"]
+
+X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
+    X, y, test_size=0.25, random_state=RANDOM_STATE
+)
+
+
+# linear regression on log_price
+lm = LinearRegression()
+lm.fit(X_train_reg, y_train_reg)
+lm_pred = lm.predict(X_test_reg)
+
+r2_lm = r2_score(y_test_reg, lm_pred)
+rmse_lm = np.sqrt(mean_squared_error(y_test_reg, lm_pred))
+mae_lm = mean_absolute_error(y_test_reg, lm_pred)
+
+print("\nlinear model (log price) metrics on manhattan:")
+print(f"r2: {r2_lm:.4f}  rmse: {rmse_lm:.4f}  mae: {mae_lm:.4f}")
+
+
+# random forest regression on log_price
+rf_reg = RandomForestRegressor(
+    n_estimators=200,
+    max_features=3,
+    max_leaf_nodes=100,
+    random_state=RANDOM_STATE,
+    n_jobs=-1,
+)
+rf_reg.fit(X_train_reg, y_train_reg)
+rf_pred = rf_reg.predict(X_test_reg)
+
+r2_rf = r2_score(y_test_reg, rf_pred)
+rmse_rf = np.sqrt(mean_squared_error(y_test_reg, rf_pred))
+mae_rf = mean_absolute_error(y_test_reg, rf_pred)
+
+print("\nrandom forest (log price) metrics on manhattan:")
+print(f"r2: {r2_rf:.4f}  rmse: {rmse_rf:.4f}  mae: {mae_rf:.4f}")
+
+
+# manhattan predicted vs actual plot
+rf_diag_df = pd.DataFrame(
+    {
+        "actual": y_test_reg.values,
+        "predicted": rf_pred,
+    }
+)
+fig, ax = plt.subplots()
+ax.scatter(rf_diag_df["actual"], rf_diag_df["predicted"], alpha=0.3)
+min_val = min(rf_diag_df["actual"].min(), rf_diag_df["predicted"].min())
+max_val = max(rf_diag_df["actual"].max(), rf_diag_df["predicted"].max())
+ax.plot([min_val, max_val], [min_val, max_val], linestyle="--")
+ax.set_title(
+    "random forest: predicted vs actual log(1 + sale price) (manhattan)"
+)
+
+ax.set_xlabel("actual log(1 + sale price)")
+ax.set_ylabel("predicted log(1 + sale price)")
+plt.tight_layout()
+plt.show()
+
+
+# manhattan residuals vs predicted plot
+rf_diag_df["residual"] = (
+    rf_diag_df["actual"] - rf_diag_df["predicted"]
+)
+fig, ax = plt.subplots()
+ax.scatter(rf_diag_df["predicted"], rf_diag_df["residual"], alpha=0.3)
+ax.axhline(0.0, linestyle="--")
+ax.set_title("random forest residuals (manhattan)")
+ax.set_xlabel("predicted log(1 + sale price)")
+ax.set_ylabel("residual")
+plt.tight_layout()
+plt.show()
+
+# use manhattan regression model on brooklyn
+
+brook_reg_df = brooklyn[reg_vars + ["sale_price"]].dropna()
+brook_reg_df["log_price"] = np.log1p(brook_reg_df["sale_price"])
+
+X_brook = brook_reg_df[reg_vars]
+y_brook = brook_reg_df["log_price"]
+
+rf_pred_brook = rf_reg.predict(X_brook)
+
+r2_rf_brook = r2_score(y_brook, rf_pred_brook)
+rmse_rf_brook = np.sqrt(mean_squared_error(y_brook, rf_pred_brook))
+mae_rf_brook = mean_absolute_error(y_brook, rf_pred_brook)
+
+print(
+    "\nrandom forest (log price) metrics on brooklyn "
+    "(trained on manhattan):"
+)
+print(
+    f"r2: {r2_rf_brook:.4f}  rmse: {rmse_rf_brook:.4f}  "
+    f"mae: {mae_rf_brook:.4f}"
+)
+
+brook_diag_df = pd.DataFrame(
+    {
+        "actual": y_brook.values,
+        "predicted": rf_pred_brook,
+    }
+)
+brook_diag_df["residual"] = (
+    brook_diag_df["actual"] - brook_diag_df["predicted"]
+)
+
+fig, ax = plt.subplots()
+ax.scatter(brook_diag_df["actual"], brook_diag_df["predicted"], alpha=0.3)
+min_val = min(brook_diag_df["actual"].min(), brook_diag_df["predicted"].min())
+max_val = max(brook_diag_df["actual"].max(), brook_diag_df["predicted"].max())
+ax.plot([min_val, max_val], [min_val, max_val], linestyle="--")
+ax.set_title(
+    "random forest: manhattan model on brooklyn (log price)"
+)
+
+ax.set_xlabel("actual log(1 + sale price) (brooklyn)")
+ax.set_ylabel("predicted log(1 + sale price)")
+plt.tight_layout()
+plt.show()
+
+
+fig, ax = plt.subplots()
+ax.scatter(brook_diag_df["predicted"], brook_diag_df["residual"], alpha=0.3)
+ax.axhline(0.0, linestyle="--")
+ax.set_title(
+    "residuals on brooklyn using manhattan random forest"
+)
+
+ax.set_xlabel("predicted log(1 + sale price)")
+ax.set_ylabel("residual")
+plt.tight_layout()
+plt.show()
+
+
+# classification: manhattan predict neighborhood
+
+clf_vars = [
+    "sale_price", "land_sqft", "gross_sqft",
+    "year_built", "res_units", "comm_units", "total_units",
+]
+
+
+def prepare_classification_df(
+    df: pd.DataFrame, min_per_class: int = 100
+) -> pd.DataFrame:
+    """prepare classification df similar to r code."""
+    tmp = df.copy()
+    if "neighborhood" not in tmp.columns:
+        raise ValueError("neighborhood column missing")
+    tmp = tmp[tmp["neighborhood"].notna()]
+
+    counts = tmp["neighborhood"].value_counts()
+    keep = counts[counts >= min_per_class].index
+    tmp = tmp[tmp["neighborhood"].isin(keep)]
+
+    cols = ["neighborhood"] + clf_vars
+    tmp = tmp[cols].dropna()
+
+    tmp["neighborhood"] = tmp["neighborhood"].astype("category")
+    return tmp
+
+
+manhattan_clf = prepare_classification_df(
+    manhattan, min_per_class=100
+)
+
+print(
+    f"\nmanhattan classification subset rows: {len(manhattan_clf)}"
+)
+print(
+    "manhattan neighborhoods: "
+    f"{manhattan_clf['neighborhood'].nunique()}"
+)
+
+X_clf = manhattan_clf[clf_vars]
+y_clf = manhattan_clf["neighborhood"]
+
+X_train_clf, X_test_clf, y_train_clf, y_test_clf = train_test_split(
+    X_clf,
+    y_clf,
+    test_size=0.25,
+    stratify=y_clf,
+    random_state=RANDOM_STATE,
+)
+
+
+def macro_f1_score(y_true, y_pred) -> float:
+    """macro f1 similar to caret::confusionMatrix byClass averaging."""
+    _, _, f1, _ = precision_recall_fscore_support(
+        y_true, y_pred, average="macro", zero_division=0
+    )
+    return float(f1)
+
+
+# k-nn classifier with scaling
+knn_pipeline = Pipeline(
+    [
+        ("scaler", StandardScaler()),
+        ("knn", KNeighborsClassifier(n_neighbors=7)),
+    ]
+)
+knn_pipeline.fit(X_train_clf, y_train_clf)
+knn_pred = knn_pipeline.predict(X_test_clf)
+
+acc_knn = accuracy_score(y_test_clf, knn_pred)
+macro_f1_knn = macro_f1_score(y_test_clf, knn_pred)
+
+print(
+    f"\nknn (manhattan) accuracy: {acc_knn:.4f}  "
+    f"macro f1: {macro_f1_knn:.4f}"
+)
+
+# random forest classifier
+rf_clf = RandomForestClassifier(
+    n_estimators=300,
+    max_features=3,
+    random_state=RANDOM_STATE,
+    n_jobs=-1,
+)
+rf_clf.fit(X_train_clf, y_train_clf)
+rf_clf_pred = rf_clf.predict(X_test_clf)
+
+acc_rf_clf = accuracy_score(y_test_clf, rf_clf_pred)
+macro_f1_rf_clf = macro_f1_score(y_test_clf, rf_clf_pred)
+
+print(
+    "\nrandom forest classifier (manhattan) accuracy: "
+    f"{acc_rf_clf:.4f}  macro f1: {macro_f1_rf_clf:.4f}"
+)
+
+# contingency table (rf example)
+labels = sorted(y_clf.unique())
+rf_cm_table = confusion_matrix(
+    y_test_clf, rf_clf_pred, labels=labels
+)
+print("rf confusion matrix shape:", rf_cm_table.shape)
+print("rf confusion matrix (rows=true, cols=pred):")
+print(rf_cm_table)
+
+# multinomial logistic regression
+logit_clf = LogisticRegression(
+    multi_class="multinomial",
+    max_iter=2000,
+    solver="lbfgs",
+    n_jobs=-1,
+)
+logit_clf.fit(X_train_clf, y_train_clf)
+logit_pred = logit_clf.predict(X_test_clf)
+
+acc_logit = accuracy_score(y_test_clf, logit_pred)
+macro_f1_logit = macro_f1_score(y_test_clf, logit_pred)
+
+print(
+    "\nmultinomial logistic regression (manhattan) accuracy: "
+    f"{acc_logit:.4f}  macro f1: {macro_f1_logit:.4f}"
+)
+
+# use manhattan classifiers on brooklyn
+
+brooklyn_clf = prepare_classification_df(
+    brooklyn, min_per_class=100
+)
+print(
+    f"\nbrooklyn classification subset rows: {len(brooklyn_clf)}"
+)
+print(
+    "brooklyn neighborhoods: "
+    f"{brooklyn_clf['neighborhood'].nunique()}"
+)
+
+X_brook_clf = brooklyn_clf[clf_vars]
+y_brook_clf = brooklyn_clf["neighborhood"]
+
+# predictions from manhattan-trained models on brooklyn data
+knn_pred_brook = knn_pipeline.predict(X_brook_clf)
+rf_pred_brook_clf = rf_clf.predict(X_brook_clf)
+logit_pred_brook = logit_clf.predict(X_brook_clf)
+
+# contingency tables (true brooklyn neigh vs predicted manhattan neigh)
+tab_knn_brook = pd.crosstab(
+    y_brook_clf, knn_pred_brook,
+    rownames=["true"], colnames=["pred"]
+)
+
+tab_rf_brook = pd.crosstab(
+    y_brook_clf, rf_pred_brook_clf,
+    rownames=["true"], colnames=["pred"]
+)
+
+tab_logit_brook = pd.crosstab(
+    y_brook_clf, logit_pred_brook,
+    rownames=["true"], colnames=["pred"]
+)
+
+print("\ncontingency table dimensions (knn):", tab_knn_brook.shape)
+print(tab_knn_brook)
+
+print(
+    "contingency table dimensions (random forest):",
+    tab_rf_brook.shape,
+)
+print(tab_rf_brook)
+
+print("contingency table dimensions (logit):", tab_logit_brook.shape)
+print(tab_logit_brook)
@@ -0,0 +1,504 @@
+# install.packages(
+#     c("dplyr", "ggplot2", "randomForest", "caret", "nnet", "e1071", "scales"),
+#     repos = "https://cloud.r-project.org"
+# )
+
+library(dplyr)
+library(ggplot2)
+library(randomForest)
+library(caret)
+library(nnet)
+library(e1071)
+library(scales)
+
+# load data / basic subsets
+options(stringsAsFactors = FALSE)
+set.seed(42L)
+
+file_path <- "Given/NYC_Citywide_Annualized_Calendar_Sales_Update_20241107.csv"
+
+# columns we actually need
+cols_needed <- c(
+    "BOROUGH", "NEIGHBORHOOD", "BUILDING CLASS CATEGORY",
+    "TAX CLASS AS OF FINAL ROLL", "BLOCK", "LOT",
+    "BUILDING CLASS AS OF FINAL ROLL", "ZIP CODE",
+    "RESIDENTIAL UNITS", "COMMERCIAL UNITS", "TOTAL UNITS",
+    "LAND SQUARE FEET", "GROSS SQUARE FEET", "YEAR BUILT",
+    "TAX CLASS AT TIME OF SALE", "BUILDING CLASS AT TIME OF SALE",
+    "SALE PRICE", "SALE DATE"
+)
+
+nyc <- read.csv(file_path, stringsAsFactors = FALSE, check.names = FALSE)
+nyc <- nyc[, intersect(cols_needed, colnames(nyc))]
+
+# force borough numeric
+nyc$BOROUGH <- suppressWarnings(as.numeric(nyc$BOROUGH))
+manhattan_raw <- nyc %>% filter(BOROUGH == 1)
+brooklyn_raw <- nyc %>% filter(BOROUGH == 3)
+
+cat("raw manhattan rows:", nrow(manhattan_raw), "\n")
+cat("raw brooklyn rows:", nrow(brooklyn_raw), "\n")
+
+
+clean_borough <- function(df, min_price = 10000) {
+    # numeric from char / factor with commas
+    parse_numeric <- function(x) {
+        x <- as.character(x)
+        x <- trimws(x)
+        x[x %in% c("0", "0.0", "- 0", "", ".", "NA", "NaN")] <- NA
+        x <- gsub(",", "", x, fixed = TRUE)
+        suppressWarnings(as.numeric(x))
+    }
+
+    df <- df
+
+    # convert TO COMPUTER SCIENCE ITWS OVERRATED
+    df$`SALE PRICE` <- parse_numeric(df$`SALE PRICE`)
+    df$`LAND SQUARE FEET` <- parse_numeric(df$`LAND SQUARE FEET`)
+    df$`GROSS SQUARE FEET` <- parse_numeric(df$`GROSS SQUARE FEET`)
+    df$`YEAR BUILT` <- parse_numeric(df$`YEAR BUILT`)
+
+    unit_cols <- c("RESIDENTIAL UNITS", "COMMERCIAL UNITS", "TOTAL UNITS")
+    for (col in unit_cols) {
+        if (col %in% names(df)) {
+            df[[col]] <- parse_numeric(df[[col]])
+        }
+    }
+
+    # drop non-arms-length / tiny sales
+    df <- df %>%
+        filter(!is.na(`SALE PRICE`)) %>%
+        filter(`SALE PRICE` > min_price)
+
+    # very old or zero years --> missing
+    df$`YEAR BUILT`[df$`YEAR BUILT` < 1800] <- NA
+
+    # need usable size / year
+    df <- df %>%
+        filter(
+            !is.na(`GROSS SQUARE FEET`),
+            !is.na(`LAND SQUARE FEET`),
+            !is.na(`YEAR BUILT`),
+            `GROSS SQUARE FEET` > 0,
+            `LAND SQUARE FEET` > 0
+        )
+
+    # fill missing units with 0 because I am creative 10
+    for (col in unit_cols) {
+        if (col %in% names(df)) {
+            df[[col]][is.na(df[[col]])] <- 0
+        }
+    }
+
+    # I AM LAZY (create new cols)
+    df <- df %>%
+        rename(
+            neighborhood = NEIGHBORHOOD
+        ) %>%
+        mutate(
+            land_sqft   = `LAND SQUARE FEET`,
+            gross_sqft  = `GROSS SQUARE FEET`,
+            year_built  = `YEAR BUILT`,
+            res_units   = `RESIDENTIAL UNITS`,
+            comm_units  = `COMMERCIAL UNITS`,
+            total_units = `TOTAL UNITS`,
+            sale_price  = `SALE PRICE`
+        )
+
+    df
+}
+
+# http://localhost:21486/library/psych/html/manhattan.html
+manhattan <- clean_borough(manhattan_raw)
+brooklyn <- clean_borough(brooklyn_raw)
+
+cat("clean manhattan rows:", nrow(manhattan), "\n")
+cat("clean brooklyn rows:", nrow(brooklyn), "\n")
+
+# manhattan exploratory data analysis
+
+# summary stats for sale price
+summary_manhattan_price <- summary(manhattan$sale_price)
+print(summary_manhattan_price)
+
+quantiles_manhattan <- quantile(
+    manhattan$sale_price,
+    probs = c(0.25, 0.5, 0.75, 0.9, 0.95, 0.99),
+    na.rm = TRUE
+)
+
+print(quantiles_manhattan)
+
+# iqr-based outlier bounds
+q1 <- quantiles_manhattan[1]
+q3 <- quantiles_manhattan[3]
+iqr <- q3 - q1
+lower_bound <- q1 - 1.5 * iqr
+upper_bound <- q3 + 1.5 * iqr
+
+cat("manhattan iqr upper bound:", upper_bound, "\n")
+cat("manhattan max sale price:", max(manhattan$sale_price, na.rm = TRUE), "\n")
+
+outlier_mask <- (manhattan$sale_price < lower_bound) |
+    (manhattan$sale_price > upper_bound)
+cat("number of sale price outliers:", sum(outlier_mask, na.rm = TRUE), "\n")
+
+
+# correlation with other numeric vars
+num_cols <- c(
+    "sale_price", "gross_sqft", "land_sqft",
+    "year_built", "res_units", "comm_units", "total_units"
+)
+
+corr_manhattan <- cor(manhattan[, num_cols], use = "complete.obs")
+print(corr_manhattan[, "sale_price"])
+
+
+# histogram of raw sale prices
+p_hist_raw <- ggplot(manhattan, aes(x = sale_price)) +
+    geom_histogram(bins = 50, color = "black", fill = NA) +
+    scale_x_continuous(labels = comma) +
+    labs(
+        title = "manhattan sale price distribution (raw)",
+        x = "sale price (usd)",
+        y = "count of sales"
+    )
+
+
+# histogram of log(1 + sale price)
+p_hist_log <- ggplot(manhattan, aes(x = log1p(sale_price))) +
+    geom_histogram(bins = 50, color = "black", fill = NA) +
+    labs(
+        title = "manhattan sale price distribution (log scale)",
+        x = "log(1 + sale price)",
+        y = "count of sales"
+    )
+
+
+# boxplot of sale price (for show outliers)
+p_box <- ggplot(manhattan, aes(y = sale_price)) +
+    geom_boxplot(outlier.alpha = 0.4) +
+    scale_y_continuous(labels = comma) +
+    labs(
+        title = "manhattan sale price with outliers",
+        y = "sale price (usd)",
+        x = ""
+    )
+
+
+# scatter: gross sqft vs sale price (log y)
+p_scatter <- ggplot(manhattan, aes(x = gross_sqft, y = sale_price)) +
+    geom_point(alpha = 0.3) +
+    scale_y_continuous(trans = "log10", labels = comma) +
+    labs(
+        title = "manhattan sale price vs gross square feet",
+        x = "gross square feet",
+        y = "sale price (log10 scale)"
+    )
+
+
+# print or save plots as needed
+print(p_hist_raw)
+print(p_hist_log)
+print(p_box)
+print(p_scatter)
+
+# regression analysis (manhattan)
+
+reg_vars <- c(
+    "land_sqft", "gross_sqft", "year_built",
+    "res_units", "comm_units", "total_units"
+)
+
+reg_df <- manhattan %>%
+    select(all_of(reg_vars), sale_price) %>%
+    tidyr::drop_na()
+
+reg_df$log_price <- log1p(reg_df$sale_price)
+
+set.seed(42L)
+
+train_idx_reg <- createDataPartition(reg_df$log_price, p = 0.75, list = FALSE)
+train_reg <- reg_df[train_idx_reg, ]
+test_reg <- reg_df[-train_idx_reg, ]
+
+# linear regression
+lm_fit <- lm(
+    log_price ~ land_sqft + gross_sqft + year_built +
+        res_units + comm_units + total_units,
+    data = train_reg
+)
+
+lm_pred <- predict(lm_fit, newdata = test_reg)
+r2_lm <- cor(test_reg$log_price, lm_pred)^2
+rmse_lm <- sqrt(mean((test_reg$log_price - lm_pred)^2))
+mae_lm <- mean(abs(test_reg$log_price - lm_pred))
+
+cat("\nlinear model (log price) metrics on manhattan:\n")
+cat("r2:", r2_lm, " rmse:", rmse_lm, " mae:", mae_lm, "\n")
+
+# random forest regression on log price
+set.seed(42L)
+
+rf_fit <- randomForest(
+    x = train_reg[, reg_vars],
+    y = train_reg$log_price,
+    ntree = 200,
+    mtry = 3,
+    maxnodes = 100,
+    importance = TRUE
+)
+
+rf_pred <- predict(rf_fit, newdata = test_reg[, reg_vars])
+r2_rf <- cor(test_reg$log_price, rf_pred)^2
+rmse_rf <- sqrt(mean((test_reg$log_price - rf_pred)^2))
+mae_rf <- mean(abs(test_reg$log_price - rf_pred))
+
+cat("\nrandom forest (log price) metrics on manhattan:\n")
+cat("r2:", r2_rf, " rmse:", rmse_rf, " mae:", mae_rf, "\n")
+
+# predicted vs actual plot (manhattan)
+rf_diag_df <- data.frame(
+    actual    = test_reg$log_price,
+    predicted = rf_pred
+)
+
+p_rf_pred_vs_actual <- ggplot(rf_diag_df, aes(x = actual, y = predicted)) +
+    geom_point(alpha = 0.3) +
+    geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
+    labs(
+        title = "random forest: predicted vs actual log(1 + sale price) (manhattan)",
+        x = "actual log(1 + sale price)",
+        y = "predicted log(1 + sale price)"
+    )
+
+
+# residuals vs predicted plot (manhattan)
+rf_diag_df$residual <- rf_diag_df$actual - rf_diag_df$predicted
+p_rf_resid <- ggplot(rf_diag_df, aes(x = predicted, y = residual)) +
+    geom_point(alpha = 0.3) +
+    geom_hline(yintercept = 0, linetype = "dashed") +
+    labs(
+        title = "random forest residuals (manhattan)",
+        x = "predicted log(1 + sale price)",
+        y = "residual"
+    )
+
+print(p_rf_pred_vs_actual)
+print(p_rf_resid)
+
+
+# apply manhattan regression model to brooklyn
+
+brook_reg_df <- brooklyn %>%
+    select(all_of(reg_vars), sale_price) %>%
+    tidyr::drop_na()
+brook_reg_df$log_price <- log1p(brook_reg_df$sale_price)
+
+rf_pred_brook <- predict(rf_fit, newdata = brook_reg_df[, reg_vars])
+r2_rf_brook <- cor(brook_reg_df$log_price, rf_pred_brook)^2
+
+rmse_rf_brook <- sqrt(mean((brook_reg_df$log_price - rf_pred_brook)^2))
+mae_rf_brook <- mean(abs(brook_reg_df$log_price - rf_pred_brook))
+
+cat("\nrandom forest (log price) metrics on brooklyn (trained on manhattan):\n")
+cat("r2:", r2_rf_brook, " rmse:", rmse_rf_brook, " mae:", mae_rf_brook, "\n")
+
+brook_diag_df <- data.frame(
+    actual    = brook_reg_df$log_price,
+    predicted = rf_pred_brook
+)
+
+p_brook_pred_vs_actual <- ggplot(brook_diag_df, aes(x = actual, y = predicted)) +
+    geom_point(alpha = 0.3) +
+    geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
+    labs(
+        title = "random forest: manhattan model on brooklyn (log price)",
+        x = "actual log(1 + sale price) (brooklyn)",
+        y = "predicted log(1 + sale price)"
+    )
+
+
+brook_diag_df$residual <- brook_diag_df$actual - brook_diag_df$predicted
+
+p_brook_resid <- ggplot(brook_diag_df, aes(x = predicted, y = residual)) +
+    geom_point(alpha = 0.3) +
+    geom_hline(yintercept = 0, linetype = "dashed") +
+    labs(
+        title = "residuals on brooklyn using manhattan random forest",
+        x = "predicted log(1 + sale price)",
+        y = "residual"
+    )
+
+print(p_brook_pred_vs_actual)
+print(p_brook_resid)
+
+
+# classification: manhattan predict neighborhood
+
+clf_vars <- c(
+    "sale_price", "land_sqft", "gross_sqft",
+    "year_built", "res_units", "comm_units", "total_units"
+)
+
+prepare_classification_df <- function(df, min_per_class = 100L) {
+    df <- df %>%
+        filter(!is.na(neighborhood))
+
+    counts <- table(df$neighborhood)
+    keep <- names(counts[counts >= min_per_class])
+
+    df <- df %>%
+        filter(neighborhood %in% keep) %>%
+        mutate(neighborhood = factor(neighborhood)) %>%
+        select(neighborhood, all_of(clf_vars)) %>%
+        tidyr::drop_na()
+
+    df
+}
+
+manhattan_clf <- prepare_classification_df(manhattan, min_per_class = 100L)
+
+cat("\nmanhattan classification subset rows:", nrow(manhattan_clf), "\n")
+cat("manhattan neighborhoods:", nlevels(manhattan_clf$neighborhood), "\n")
+
+set.seed(42L)
+
+train_idx_clf <- createDataPartition(manhattan_clf$neighborhood, p = 0.75, list = FALSE)
+train_clf <- manhattan_clf[train_idx_clf, ]
+test_clf <- manhattan_clf[-train_idx_clf, ]
+
+# helper function for macro f1 from a confusionMatrix object
+macro_f1_from_cm <- function(cm_obj) {
+    byc <- cm_obj$byClass
+
+    if (!is.matrix(byc)) {
+        precision <- byc["Pos Pred Value"]
+        recall <- byc["Sensitivity"]
+        return(2 * precision * recall / (precision + recall))
+    } else {
+        precision <- byc[, "Pos Pred Value"]
+        recall <- byc[, "Sensitivity"]
+        f1 <- 2 * precision * recall / (precision + recall)
+        mean(f1, na.rm = TRUE)
+    }
+}
+
+# k-nn classifier (<-  ->)
+
+ctrl_none <- trainControl(method = "none")
+set.seed(42L)
+
+knn_fit <- train(
+    neighborhood ~ sale_price + land_sqft + gross_sqft + year_built +
+        res_units + comm_units + total_units,
+    data = train_clf,
+    method = "knn",
+    preProcess = c("center", "scale"),
+    tuneGrid = data.frame(k = 7),
+    trControl = ctrl_none
+)
+
+knn_pred <- predict(knn_fit, newdata = test_clf)
+cm_knn <- confusionMatrix(knn_pred, test_clf$neighborhood)
+macro_f1_knn <- macro_f1_from_cm(cm_knn)
+
+cat(
+    "\nknn (manhattan) accuracy:", cm_knn$overall["Accuracy"],
+    " macro f1:", macro_f1_knn, "\n"
+)
+
+# random forest classifier
+set.seed(42L)
+rf_clf_fit <- randomForest(
+    neighborhood ~ sale_price + land_sqft + gross_sqft + year_built +
+        res_units + comm_units + total_units,
+    data = train_clf,
+    ntree = 300,
+    mtry = 3
+)
+
+rf_clf_pred <- predict(rf_clf_fit, newdata = test_clf)
+cm_rf_clf <- confusionMatrix(rf_clf_pred, test_clf$neighborhood)
+macro_f1_rf_clf <- macro_f1_from_cm(cm_rf_clf)
+
+cat(
+    "\nrandom forest classifier (manhattan) accuracy:",
+    cm_rf_clf$overall["Accuracy"],
+    " macro f1:", macro_f1_rf_clf, "\n"
+)
+
+# ex contingency table
+rf_cm_table <- cm_rf_clf$table
+print(dim(rf_cm_table))
+
+# num neighborhoods x num neighborhoods
+print(rf_cm_table)
+
+# multinomial logistic regression
+
+set.seed(42L)
+logit_fit <- multinom(
+    neighborhood ~ sale_price + land_sqft + gross_sqft + year_built +
+        res_units + comm_units + total_units,
+    data = train_clf,
+    MaxNWts = 10000,
+    maxit = 2000,
+    trace = FALSE
+)
+
+logit_pred <- predict(logit_fit, newdata = test_clf)
+cm_logit <- confusionMatrix(logit_pred, test_clf$neighborhood)
+macro_f1_logit <- macro_f1_from_cm(cm_logit)
+
+cat(
+    "\nmultinomial logistic regression (manhattan) accuracy:",
+    cm_logit$overall["Accuracy"],
+    " macro f1:", macro_f1_logit, "\n"
+)
+
+# use manhattan classifiers on brooklyn
+
+brooklyn_clf <- prepare_classification_df(brooklyn, min_per_class = 100L)
+cat("\nbrooklyn classification subset rows:", nrow(brooklyn_clf), "\n")
+cat("brooklyn neighborhoods:", nlevels(brooklyn_clf$neighborhood), "\n")
+
+# predictions from manhattan-trained models on brooklyn data
+knn_pred_brook <- predict(knn_fit, newdata = brooklyn_clf)
+rf_pred_brook <- predict(rf_clf_fit, newdata = brooklyn_clf)
+logit_pred_brook <- predict(logit_fit, newdata = brooklyn_clf)
+
+# contingency tables (true brooklyn neigh vs predicted manhattan neigh)
+# these will be essentially all off-diagonal because label sets differ
+# idk how to make this look better though :(
+tab_knn_brook <- table(true = brooklyn_clf$neighborhood, pred = knn_pred_brook)
+tab_rf_brook <- table(true = brooklyn_clf$neighborhood, pred = rf_pred_brook)
+tab_logit_brook <- table(true = brooklyn_clf$neighborhood, pred = logit_pred_brook)
+
+cat("\ncontingency table dimensions (knn):", dim(tab_knn_brook), "\n")
+cat("contingency table dimensions (random forest):", dim(tab_rf_brook), "\n")
+cat("contingency table dimensions (logit):", dim(tab_logit_brook), "\n")
+
+
+plots <- list(
+    manhattan_hist_raw = p_hist_raw,
+    manhattan_hist_log = p_hist_log,
+    manhattan_box = p_box,
+    manhattan_scatter = p_scatter,
+    rf_pred_vs_actual_manhattan = p_rf_pred_vs_actual,
+    rf_resid_manhattan = p_rf_resid,
+    rf_pred_vs_actual_brooklyn = p_brook_pred_vs_actual,
+    rf_resid_brooklyn = p_brook_resid
+)
+
+dir.create("plots", showWarnings = FALSE)
+
+for (nm in names(plots)) {
+    ggsave(
+        filename = file.path("plots", paste0(nm, ".png")),
+        plot = plots[[nm]],
+        width = 7,
+        height = 5,
+        dpi = 300
+    )
+}
@@ -0,0 +1,341 @@
+raw manhattan rows: 96088 
+raw brooklyn rows: 123813 
+clean manhattan rows: 7294 
+clean brooklyn rows: 42743 
+     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
+1.005e+04 1.368e+06 4.000e+06 1.633e+07 9.575e+06 2.398e+09 
+      25%       50%       75%       90%       95%       99% 
+  1367750   4000000   9575000  22736850  50618132 266148692 
+manhattan iqr upper bound: 21885875 
+manhattan max sale price: 2397501899 
+number of sale price outliers: 756 
+ sale_price  gross_sqft   land_sqft  year_built   res_units  comm_units 
+ 1.00000000  0.49144986  0.16392585  0.02117231  0.11589681  0.20912401 
+total_units 
+ 0.16618138 
+
+linear model (log price) metrics on manhattan:
+r2: 0.1867558  rmse: 1.689562  mae: 1.260938 
+
+random forest (log price) metrics on manhattan:
+r2: 0.6978428  rmse: 1.031439  mae: 0.661946 
+
+random forest (log price) metrics on brooklyn (trained on manhattan):
+r2: 0.2442946  rmse: 1.305062  mae: 1.084852 
+
+manhattan classification subset rows: 6792 
+manhattan neighborhoods: 28 
+
+knn (manhattan) accuracy: 0.4774882  macro f1: 0.4174939 
+
+random forest classifier (manhattan) accuracy: 0.5841232  macro f1: 0.5328558 
+[1] 28 28
+                           Reference
+Prediction                  ALPHABET CITY CHELSEA CHINATOWN CLINTON
+  ALPHABET CITY                         9       1         0       0
+  CHELSEA                               0      23         2       1
+  CHINATOWN                             0       0        10       0
+  CLINTON                               0       2         0      17
+  EAST VILLAGE                          2       2         2       0
+  FASHION                               0       4         0       0
+  GRAMERCY                              0       0         0       0
+  GREENWICH VILLAGE-CENTRAL             0       2         1       0
+  GREENWICH VILLAGE-WEST                1       3         0       1
+  HARLEM-CENTRAL                        3       4         3       5
+  HARLEM-EAST                           4       1         0       0
+  HARLEM-UPPER                          1       1         0       1
+  KIPS BAY                              0       2         0       0
+  LOWER EAST SIDE                       1       3         2       0
+  MANHATTAN VALLEY                      0       0         0       0
+  MIDTOWN CBD                           0       1         1       0
+  MIDTOWN EAST                          1       2         0       1
+  MIDTOWN WEST                          0       3         1       1
+  MURRAY HILL                           0       3         1       0
+  SOHO                                  0       2         1       0
+  TRIBECA                               0       1         0       0
+  UPPER EAST SIDE (59-79)               1       3         0       2
+  UPPER EAST SIDE (79-96)               0       2         1       0
+  UPPER WEST SIDE (59-79)               1       1         0       1
+  UPPER WEST SIDE (79-96)               1       3         0       0
+  UPPER WEST SIDE (96-116)              0       1         0       1
+  WASHINGTON HEIGHTS LOWER              0       0         0       0
+  WASHINGTON HEIGHTS UPPER              0       0         0       1
+                           Reference
+Prediction                  EAST VILLAGE FASHION GRAMERCY
+  ALPHABET CITY                        2       0        0
+  CHELSEA                              1       1        0
+  CHINATOWN                            0       1        0
+  CLINTON                              1       0        0
+  EAST VILLAGE                        33       0        0
+  FASHION                              0      10        0
+  GRAMERCY                             1       0       29
+  GREENWICH VILLAGE-CENTRAL            2       1        0
+  GREENWICH VILLAGE-WEST               2       0        1
+  HARLEM-CENTRAL                       4       2        2
+  HARLEM-EAST                          0       2        0
+  HARLEM-UPPER                         0       0        0
+  KIPS BAY                             0       0        0
+  LOWER EAST SIDE                      0       1        0
+  MANHATTAN VALLEY                     0       0        1
+  MIDTOWN CBD                          0       3        2
+  MIDTOWN EAST                         0       1        0
+  MIDTOWN WEST                         0       2        0
+  MURRAY HILL                          0       2        0
+  SOHO                                 0       0        0
+  TRIBECA                              1       1        0
+  UPPER EAST SIDE (59-79)              1       0        1
+  UPPER EAST SIDE (79-96)              0       0        0
+  UPPER WEST SIDE (59-79)              0       1        0
+  UPPER WEST SIDE (79-96)              2       0        2
+  UPPER WEST SIDE (96-116)             1       0        0
+  WASHINGTON HEIGHTS LOWER             1       0        0
+  WASHINGTON HEIGHTS UPPER             0       0        0
+                           Reference
+Prediction                  GREENWICH VILLAGE-CENTRAL GREENWICH VILLAGE-WEST
+  ALPHABET CITY                                     0                      0
+  CHELSEA                                           1                      5
+  CHINATOWN                                         1                      1
+  CLINTON                                           0                      0
+  EAST VILLAGE                                      3                      1
+  FASHION                                           0                      0
+  GRAMERCY                                          0                      0
+  GREENWICH VILLAGE-CENTRAL                        14                      3
+  GREENWICH VILLAGE-WEST                            5                     42
+  HARLEM-CENTRAL                                    1                      4
+  HARLEM-EAST                                       1                      1
+  HARLEM-UPPER                                      0                      0
+  KIPS BAY                                          0                      1
+  LOWER EAST SIDE                                   1                      0
+  MANHATTAN VALLEY                                  0                      0
+  MIDTOWN CBD                                       0                      0
+  MIDTOWN EAST                                      1                      1
+  MIDTOWN WEST                                      0                      1
+  MURRAY HILL                                       0                      0
+  SOHO                                              1                      0
+  TRIBECA                                           0                      0
+  UPPER EAST SIDE (59-79)                           1                      6
+  UPPER EAST SIDE (79-96)                           1                      6
+  UPPER WEST SIDE (59-79)                           0                      0
+  UPPER WEST SIDE (79-96)                           2                      3
+  UPPER WEST SIDE (96-116)                          1                      0
+  WASHINGTON HEIGHTS LOWER                          0                      0
+  WASHINGTON HEIGHTS UPPER                          0                      0
+                           Reference
+Prediction                  HARLEM-CENTRAL HARLEM-EAST HARLEM-UPPER KIPS BAY
+  ALPHABET CITY                          2           0            0        0
+  CHELSEA                                0           0            0        0
+  CHINATOWN                              2           0            0        0
+  CLINTON                                0           0            1        0
+  EAST VILLAGE                           0           0            0        0
+  FASHION                                0           0            0        0
+  GRAMERCY                               0           0            0        0
+  GREENWICH VILLAGE-CENTRAL              0           0            0        0
+  GREENWICH VILLAGE-WEST                 0           0            2        0
+  HARLEM-CENTRAL                       158          15           21        0
+  HARLEM-EAST                            9          35            2        0
+  HARLEM-UPPER                           9           2           19        0
+  KIPS BAY                               2           0            1       27
+  LOWER EAST SIDE                        3           0            0        0
+  MANHATTAN VALLEY                       1           4            0        0
+  MIDTOWN CBD                            0           0            0        0
+  MIDTOWN EAST                           2           0            0        2
+  MIDTOWN WEST                           2           0            0        0
+  MURRAY HILL                            2           1            0        0
+  SOHO                                   1           1            0        0
+  TRIBECA                                0           0            0        0
+  UPPER EAST SIDE (59-79)                2           0            1        0
+  UPPER EAST SIDE (79-96)                2           4            0        1
+  UPPER WEST SIDE (59-79)                0           0            0        0
+  UPPER WEST SIDE (79-96)                0           0            0        0
+  UPPER WEST SIDE (96-116)               0           0            4        0
+  WASHINGTON HEIGHTS LOWER               7           3            1        0
+  WASHINGTON HEIGHTS UPPER               1           1            1        0
+                           Reference
+Prediction                  LOWER EAST SIDE MANHATTAN VALLEY MIDTOWN CBD
+  ALPHABET CITY                           0                0           0
+  CHELSEA                                 4                0           0
+  CHINATOWN                               0                1           0
+  CLINTON                                 0                0           0
+  EAST VILLAGE                            3                0           0
+  FASHION                                 2                0           1
+  GRAMERCY                                0                1           0
+  GREENWICH VILLAGE-CENTRAL               0                0           0
+  GREENWICH VILLAGE-WEST                  1                0           0
+  HARLEM-CENTRAL                          4                7           2
+  HARLEM-EAST                             1                0           0
+  HARLEM-UPPER                            0                0           0
+  KIPS BAY                                0                0           0
+  LOWER EAST SIDE                        47                2           0
+  MANHATTAN VALLEY                        0               11           0
+  MIDTOWN CBD                             0                0          12
+  MIDTOWN EAST                            0                0           3
+  MIDTOWN WEST                            0                2           1
+  MURRAY HILL                             2                0           0
+  SOHO                                    0                0           1
+  TRIBECA                                 0                0           0
+  UPPER EAST SIDE (59-79)                 1                0           2
+  UPPER EAST SIDE (79-96)                 1                1           1
+  UPPER WEST SIDE (59-79)                 0                0           2
+  UPPER WEST SIDE (79-96)                 0                0           1
+  UPPER WEST SIDE (96-116)                0                1           0
+  WASHINGTON HEIGHTS LOWER                0                1           0
+  WASHINGTON HEIGHTS UPPER                0                1           0
+                           Reference
+Prediction                  MIDTOWN EAST MIDTOWN WEST MURRAY HILL SOHO TRIBECA
+  ALPHABET CITY                        0            0           1    1       0
+  CHELSEA                              3            1           2    1       0
+  CHINATOWN                            0            1           2    0       0
+  CLINTON                              0            0           1    2       0
+  EAST VILLAGE                         1            0           0    2       0
+  FASHION                              0            3           3    1       0
+  GRAMERCY                             1            0           0    0       1
+  GREENWICH VILLAGE-CENTRAL            0            1           0    3       1
+  GREENWICH VILLAGE-WEST               1            0           1    3       1
+  HARLEM-CENTRAL                       4            5           5    0       0
+  HARLEM-EAST                          0            0           1    0       0
+  HARLEM-UPPER                         0            1           2    0       0
+  KIPS BAY                             0            0           0    0       0
+  LOWER EAST SIDE                      0            2           1    2       1
+  MANHATTAN VALLEY                     0            0           0    1       0
+  MIDTOWN CBD                          0            6           0    1       0
+  MIDTOWN EAST                        24            1           1    0       0
+  MIDTOWN WEST                         0          177           0    2       0
+  MURRAY HILL                          3            0          16    1       0
+  SOHO                                 1            1           1   16       1
+  TRIBECA                              0            3           0    0      36
+  UPPER EAST SIDE (59-79)              1            0           2    2       2
+  UPPER EAST SIDE (79-96)              3            2           2    1       1
+  UPPER WEST SIDE (59-79)              2            0           0    0       0
+  UPPER WEST SIDE (79-96)              3            0           2    0       2
+  UPPER WEST SIDE (96-116)             1            0           0    0       0
+  WASHINGTON HEIGHTS LOWER             0            0           0    0       0
+  WASHINGTON HEIGHTS UPPER             0            0           0    0       0
+                           Reference
+Prediction                  UPPER EAST SIDE (59-79) UPPER EAST SIDE (79-96)
+  ALPHABET CITY                                   0                       0
+  CHELSEA                                         1                       3
+  CHINATOWN                                       1                       4
+  CLINTON                                         1                       1
+  EAST VILLAGE                                    2                       1
+  FASHION                                         2                       1
+  GRAMERCY                                        0                       0
+  GREENWICH VILLAGE-CENTRAL                       0                       0
+  GREENWICH VILLAGE-WEST                          2                       7
+  HARLEM-CENTRAL                                  6                       5
+  HARLEM-EAST                                     0                       0
+  HARLEM-UPPER                                    1                       2
+  KIPS BAY                                        0                       0
+  LOWER EAST SIDE                                 1                       1
+  MANHATTAN VALLEY                                0                       0
+  MIDTOWN CBD                                     2                       2
+  MIDTOWN EAST                                    4                       2
+  MIDTOWN WEST                                    1                       0
+  MURRAY HILL                                     0                       1
+  SOHO                                            2                       2
+  TRIBECA                                         1                       0
+  UPPER EAST SIDE (59-79)                        47                      14
+  UPPER EAST SIDE (79-96)                        20                      53
+  UPPER WEST SIDE (59-79)                         2                       1
+  UPPER WEST SIDE (79-96)                         3                       2
+  UPPER WEST SIDE (96-116)                        0                       3
+  WASHINGTON HEIGHTS LOWER                        1                       0
+  WASHINGTON HEIGHTS UPPER                        0                       1
+                           Reference
+Prediction                  UPPER WEST SIDE (59-79) UPPER WEST SIDE (79-96)
+  ALPHABET CITY                                   1                       1
+  CHELSEA                                         0                       2
+  CHINATOWN                                       0                       0
+  CLINTON                                         0                       1
+  EAST VILLAGE                                    1                       1
+  FASHION                                         0                       0
+  GRAMERCY                                        0                       0
+  GREENWICH VILLAGE-CENTRAL                       1                       1
+  GREENWICH VILLAGE-WEST                          0                       1
+  HARLEM-CENTRAL                                  4                       3
+  HARLEM-EAST                                     1                       1
+  HARLEM-UPPER                                    0                       1
+  KIPS BAY                                        0                       0
+  LOWER EAST SIDE                                 0                       0
+  MANHATTAN VALLEY                                0                       2
+  MIDTOWN CBD                                     0                       0
+  MIDTOWN EAST                                    0                       0
+  MIDTOWN WEST                                    3                       1
+  MURRAY HILL                                     0                       0
+  SOHO                                            0                       0
+  TRIBECA                                         1                       1
+  UPPER EAST SIDE (59-79)                         4                       2
+  UPPER EAST SIDE (79-96)                         4                       8
+  UPPER WEST SIDE (59-79)                        46                       2
+  UPPER WEST SIDE (79-96)                         7                      31
+  UPPER WEST SIDE (96-116)                        1                       0
+  WASHINGTON HEIGHTS LOWER                        0                       0
+  WASHINGTON HEIGHTS UPPER                        0                       1
+                           Reference
+Prediction                  UPPER WEST SIDE (96-116) WASHINGTON HEIGHTS LOWER
+  ALPHABET CITY                                    0                        0
+  CHELSEA                                          1                        0
+  CHINATOWN                                        0                        0
+  CLINTON                                          0                        0
+  EAST VILLAGE                                     0                        0
+  FASHION                                          0                        0
+  GRAMERCY                                         0                        0
+  GREENWICH VILLAGE-CENTRAL                        1                        0
+  GREENWICH VILLAGE-WEST                           0                        0
+  HARLEM-CENTRAL                                   6                       18
+  HARLEM-EAST                                      1                        1
+  HARLEM-UPPER                                     1                        5
+  KIPS BAY                                         1                        0
+  LOWER EAST SIDE                                  1                        1
+  MANHATTAN VALLEY                                 0                        0
+  MIDTOWN CBD                                      0                        0
+  MIDTOWN EAST                                     0                        0
+  MIDTOWN WEST                                     0                        0
+  MURRAY HILL                                      0                        0
+  SOHO                                             0                        0
+  TRIBECA                                          1                        0
+  UPPER EAST SIDE (59-79)                          1                        0
+  UPPER EAST SIDE (79-96)                          2                        1
+  UPPER WEST SIDE (59-79)                          1                        0
+  UPPER WEST SIDE (79-96)                          2                        0
+  UPPER WEST SIDE (96-116)                        16                        0
+  WASHINGTON HEIGHTS LOWER                         0                       23
+  WASHINGTON HEIGHTS UPPER                         1                        2
+                           Reference
+Prediction                  WASHINGTON HEIGHTS UPPER
+  ALPHABET CITY                                    0
+  CHELSEA                                          0
+  CHINATOWN                                        0
+  CLINTON                                          0
+  EAST VILLAGE                                     0
+  FASHION                                          0
+  GRAMERCY                                         0
+  GREENWICH VILLAGE-CENTRAL                        0
+  GREENWICH VILLAGE-WEST                           0
+  HARLEM-CENTRAL                                   7
+  HARLEM-EAST                                      3
+  HARLEM-UPPER                                     4
+  KIPS BAY                                         0
+  LOWER EAST SIDE                                  0
+  MANHATTAN VALLEY                                 0
+  MIDTOWN CBD                                      0
+  MIDTOWN EAST                                     0
+  MIDTOWN WEST                                     0
+  MURRAY HILL                                      0
+  SOHO                                             1
+  TRIBECA                                          0
+  UPPER EAST SIDE (59-79)                          0
+  UPPER EAST SIDE (79-96)                          0
+  UPPER WEST SIDE (59-79)                          0
+  UPPER WEST SIDE (79-96)                          1
+  UPPER WEST SIDE (96-116)                         0
+  WASHINGTON HEIGHTS LOWER                         7
+  WASHINGTON HEIGHTS UPPER                         5
+
+multinomial logistic regression (manhattan) accuracy: 0.2517773  macro f1: 0.2957543 
+
+brooklyn classification subset rows: 42533 
+brooklyn neighborhoods: 56 
+
+contingency table dimensions (knn): 56 28 
+contingency table dimensions (random forest): 56 28 
+contingency table dimensions (logit): 56 28 
@@ -1,410 +0,0 @@
-# News Popularity in Multiple Social Media Platforms
-
-This project analyzes the **News Popularity in Multiple Social Media Platforms** dataset from the UCI Machine Learning Repository. The data contains ~93k news items collected between November 2015 and July 2016, with their final popularity on Facebook, Google+ and LinkedIn across four topics: *economy*, *microsoft*, *obama* and *palestine*.
-
---
-
-## 1. Exploratory Data Analysis
-
-### 1.1 Data overview and cleaning
-
-We work primarily with `Data/News_Final.csv`, which has **93,239** rows and 11 variables:
-
- `IDLink` – numeric id of the article  
- `Title`, `Headline` – short text fields  
- `Source` – news outlet that originally published the story  
- `Topic` – one of {economy, microsoft, obama, palestine}  
- `PublishDate` – publication timestamp  
- `SentimentTitle`, `SentimentHeadline` – numeric sentiment scores derived from title and headline text  
- `Facebook`, `GooglePlus`, `LinkedIn` – final popularity on each social media platform
-
-According to the dataset documentation, **-1** in the popularity variables indicates that no final popularity value was observed. In the code, any value `< 0` in `Facebook`, `GooglePlus`, or `LinkedIn` is therefore replaced with `NaN`. Missing popularity values are later dropped on a per–model basis. `PublishDate` is converted to a proper timestamp, and a numeric time feature
-
-```text
-DaysSinceEpoch = days since 1970-01-01
-````
-
-is created to allow inclusion of temporal trends in the models. We also log‑transform Facebook popularity:
-
-```text
-log_Facebook = log1p(Facebook)
-```
-
-which is used as the target for regression models.
-
---
-
-### 1.2 Popularity distributions
-
-A histogram of Facebook share counts on a **logarithmic x‑axis**, after removing missing and zero values
-
-![distribution of facebook popularity](imgs/eda_facebook_hist.png)
-
-*Figure 1: Distribution of Facebook popularity on a log x‑axis.*
-
-The distribution is extremely right‑skewed:
-
- Most articles receive very few shares.
- A small number of “viral” articles receive thousands of shares.
-
-On the cleaned data, summary statistics for Facebook shares are approximately:
-
- median is approx 8
- mean is approx 129
- 90th percentile is approx 214
- 99th percentile is approx 2,322
- max = 49,211
-
-Google+ and LinkedIn exhibit similar heavy‑tailed patterns (with smaller absolute scales), which matches the description of the dataset creators ([arXiv][1]).
-
-The distribution of `log1p(Facebook)`
-
-![distribution of log-transformed facebook popularity](imgs/eda_log_facebook_hist.png)
-
-*Figure 2: Distribution of log‑transformed Facebook popularity.*
-
-The log transform compresses the heavy tail and produces a more regular, unimodal distribution. This justifies using `log1p(popularity)` as the regression target: it reduces the influence of rare extreme outliers while keeping them in the data, which is important because viral stories are the phenomena of interest.
-
---
-
-### 1.3 Topic effects
-
-The four topics are not equally represented:
-
- economy: 33,928 items
- obama: 28,610
- microsoft: 21,858
- palestine: 8,843
-
-The mean log‑Facebook popularity by topic.
-
-![average facebook popularity by topic](imgs/eda_mean_by_topic.png)
-
-*Figure 3: Mean log‑Facebook popularity by topic.*
-
-Key observations:
-
- **obama** stories clearly have the highest average popularity.
- **microsoft** is slightly above **economy** and **palestine**.
- In original share counts, obama articles average roughly an order of magnitude more shares than economy/microsoft/palestine stories, but all topics remain strongly skewed.
-
-This suggests that topic is an important categorical predictor for popularity, and motivates including it as a one‑hot encoded feature in the models.
-
---
-
-### 1.4 Sentiment and popularity
-
-Sentiment scores from the title and headline are continuous values roughly in the interval [-1, 1]. Their empirical distributions are centered very close to 0 with standard deviations around 0.14, indicating that most titles and headlines are only mildly positive or negative.
-
-A 5,000‑row sample of `SentimentTitle` vs `log_Facebook`
-
-![title sentiment vs facebook popularity](imgs/eda_sentiment_vs_popularity.png)
-
-*Figure 4: Scatter of title sentiment vs log‑Facebook popularity (sample of 5,000 articles).*
-
-The scatter plot shows:
-
- A dense vertical band near sentiment 0, reflecting many neutral titles.
- Viral and non‑viral articles scattered across the full sentiment range, with no obvious linear trend.
-
-Empirically, the correlation between sentiment and Facebook popularity is almost zero (|r| is approx 0.01). This suggests that sentiment alone is a weak predictor of popularity; we still include it in models because it may interact with topic or time, but we do not expect it to explain much variance by itself.
-
---
-
-### 1.5 EDA conclusions
-
-From the exploratory analysis we conclude:
-
-1. **Popularity variables are non‑negative, highly skewed, and heavy‑tailed.**
-
-   - Log‑transforming shares yields more regular distributions, so regression models should target `log1p(popularity)` instead of raw counts.
-
-2. **Topic has a strong effect on expected popularity.**
-
-   - Particularly, obama‑related news is more popular on Facebook; microsoft is relatively stronger on LinkedIn (from descriptive statistics, not shown here).
-
-3. **Title/headline sentiment has little linear relationship with popularity.**
-
-   - It should not be expected to drive predictions strongly.
-
-4. **There are many extreme outliers (viral stories), but these are the signal we care about.**
-
-   - We choose *not* to remove them; instead, we rely on robust models and log‑transformed targets.
-
-These observations motivate a modeling strategy that combines:
-
- **Linear models** (to quantify simple topic/sentiment effects on log‑popularity).
- **Non‑linear tree‑based models** (to capture complex relationships and heavy‑tailed behaviour).
- **Classification** of viral vs non‑viral stories.
- **Clustering** of time‑series trajectories to identify typical growth patterns.
-
-The next section formalizes these ideas.
-
---
-
-## 2. Model Development, Validation and Optimization
-
-We develop **five** models: three regression models (including a dimension‑reduced variant), one classification model, and one clustering model. This covers regression, classification and unsupervised learning objectives, and explicitly examines the impact of dimensionality reduction.
-
-All supervised models use:
-
- Train/test split: **80% training, 20% test**, `random_state=42`.
- Evaluation on the held‑out test set only (no peeking).
- Metrics:
-
-  - Regression: R² and RMSE on log‑scale (using `root_mean_squared_error`).
-  - Classification: accuracy, F1 for the positive class, ROC AUC and confusion matrix.
-
-### 2.1 Common preprocessing
-
-For each model:
-
-1. Replace `-1` in `Facebook`, `GooglePlus`, `LinkedIn` with `NaN`.
-2. Drop rows with missing values in the specific target variable.
-3. Use `DaysSinceEpoch` as a numeric representation of `PublishDate`.
-4. Where appropriate, use `log_Facebook = log1p(Facebook)` as the regression target.
-5. Encode `Topic` using one‑hot encoding with economy as the reference level (`drop_first=True`).
-
-For time‑series models we also use `Data/Facebook_Economy.csv`, which stores Facebook popularity snapshots TS1–TS144 every 20 minutes for economy articles. We join it with `News_Final.csv` on `IDLink` and restrict to:
-
- `Topic == "economy"`
- Time slices **TS1–TS50** as predictors (roughly first 16–17 hours)
- Final log‑Facebook popularity as the target
-
-Negative TS values are interpreted as “no observed popularity yet” and are set to 0.
-
---
-
-### 2.2 Regression Model 1 – Linear regression on static features
-
-**Goal.** Predict log‑Facebook popularity using only static metadata (no early popularity feedback).
-
- **Target:** `y = log_Facebook` for all topics.
- **Features:**
-
-  - `SentimentTitle`, `SentimentHeadline`
-  - `DaysSinceEpoch` (publication time)
-  - Topic one‑hot dummies: `Topic_microsoft`, `Topic_obama`, `Topic_palestine` (economy is implicit baseline).
-
-We fit an ordinary least squares linear regression on the training split and evaluate on the test set.
-
-**Results (test set):**
-
- **R² is approx 0.157**
- **RMSE is approx 1.86** in log‑space
-
-Actual vs predicted log‑Facebook values
-
-![model 1: actual vs predicted](imgs/model1_actual_vs_predicted.png)
-
-*Figure 5: Model 1 predictions vs actual log‑Facebook values.*
-
-The predictions are compressed into a narrow band, under‑predicting viral articles and over‑predicting low‑popularity ones. Key coefficients:
-
- `Topic_obama` is approx +1.78 (large positive shift vs economy)
- `Topic_microsoft` is approx +0.10
- `Topic_palestine` is approx +0.02
- `SentimentTitle` is approx −0.38, `SentimentHeadline` is approx −0.06
- `DaysSinceEpoch` is approx −0.0007 (tiny downward trend over time)
-
-Interpretation:
-
- Topic has a clear effect (especially obama).
- Sentiment effects are small and slightly negative.
- The model explains only ~16% of the variance in log‑popularity, confirming that static features alone are weak predictors.
-
---
-
-### 2.3 Regression Model 2 – Random forest on early time slices
-
-**Goal.** Predict final log‑Facebook popularity for **economy** stories using early Facebook popularity time slices and sentiment.
-
- **Target:** `log_Facebook` for economy topic, joined with Facebook_Economy time‑series.
- **Features:**
-
-  - TS1–TS50 (early cumulative popularity counts, cleaned: negative → 0)
-  - `SentimentTitle`, `SentimentHeadline`
-
-We fit a `RandomForestRegressor` with:
-
- 120 estimators,
- `min_samples_leaf=2`,
- `max_depth=None` (trees grow fully),
- `n_jobs=-1`, `random_state=42`.
-
-**Results (test set):**
-
- **R² is approx 0.746**
- **RMSE is approx 0.86** (log‑scale)
-
-Feature importances indicate:
-
- `TS50` alone contributes ~81% of total importance.
- Combined sentiment variables contribute ~17%.
- Earlier TS features each have very small marginal importance.
-
-Thus, knowing an article’s popularity after ~17 hours (TS50) is already highly predictive of its final 2‑day popularity. Early engagement is a much stronger signal than sentiment or publish time.
-
---
-
-### 2.4 Regression Model 3 – PCA + random forest (dimension reduction)
-
-Model 3 examines the effect of **dimension reduction** on performance.
-
-Instead of using all 50 TS features directly, we:
-
-1. Standardize TS1–TS50 with `StandardScaler`.
-2. Apply PCA with `n_components=10`.
-3. Concatenate the 10 PCA components with the two sentiment features (`SentimentTitle`, `SentimentHeadline`).
-4. Train the same `RandomForestRegressor` as Model 2 on this 12‑dimensional feature space.
-
-PCA results:
-
- 1st component explains is approx **93.5%** of variance.
- First 10 components together explain is approx **99.9%** of variance.
-
-**Results (test set):**
-
- **R² is approx 0.745**
- **RMSE is approx 0.87**
-
-Compared to Model 2:
-
- R² decreases only slightly (0.746 → 0.745).
- RMSE increases minimally (0.862 → 0.865).
-
-So PCA reduces dimensionality from 50 TS features to 10 components with **negligible loss of predictive performance**. The first PCA components effectively summarize overall popularity level and growth pattern, which are the dominant signals for final popularity.
-
---
-
-### 2.5 Classification Model 4 – Logistic regression for viral vs non‑viral
-
-**Goal.** Classify whether an article is *viral* on Facebook, defined as being in the top 10% of final popularity.
-
- **Target:**
-
-  - `viral_fb = 1` if `Facebook ≥ 214` (90th percentile), otherwise 0.
-  - Class distribution: ~10% positive, ~90% negative.
- **Features:**
-
-  - `SentimentTitle`, `SentimentHeadline`
-  - `DaysSinceEpoch`
-  - Topic dummies as before
-
-We intentionally **do not use time‑slice features** here to simulate making a decision at or before publication, when no engagement data is available yet.
-
-We fit a `LogisticRegression` with `max_iter=500` and `class_weight="balanced"` to counter class imbalance.
-
-**Results (test set):**
-
- **Accuracy is approx 0.73**
-
-  - A naive classifier that always predicts “non‑viral” would obtain is approx 0.90 accuracy, highlighting that raw accuracy is misleading under imbalance.
- **F1 (viral class) is approx 0.36**
- **ROC AUC is approx 0.75**
-
-The ROC AUC of 0.75 indicates decent **ranking ability**: the model tends to assign higher probabilities to truly viral articles than to non‑viral ones. However, at the default 0.5 threshold it generates many false positives; tuning the probability threshold would be necessary in practice depending on the business trade‑off between missing viral content and wasting attention on non‑viral items.
-
---
-
-### 2.6 Clustering Model 5 – K‑means on time‑series shapes
-
-To understand typical growth trajectories of popularity, we cluster early time‑series patterns.
-
- **Features:** TS1–TS50, standardized with `StandardScaler`.
- **Sample:** random subset of 5,000 economy+Facebook articles to keep computation manageable.
- **Algorithm:** `KMeans(n_clusters=3, n_init=10, random_state=42)`.
-
-**Results:**
-
- **Silhouette score is approx 0.97**, indicating well‑separated clusters (although partly due to one large cluster vs a few small ones).
- Cluster sizes and mean final Facebook shares:
-
-  | cluster | count | mean shares | median |   max |
-  | ------: | ----: | ----------: | -----: | ----: |
-  |       0 | 4,978 |         ~37 |      3 | 7,045 |
-  |       1 |     1 |       1,886 |  1,886 | 1,886 |
-  |       2 |    21 |      ~2,478 |  1,291 | 8,010 |
-
-Inspecting centroid time‑series (TS1, TS10, TS25, TS50):
-
- **Cluster 0:** low TS1 (~0.3), slow growth, TS50 is approx 17 → “normal/low popularity” baseline; almost all articles.
- **Cluster 2:** TS1 is approx 23, TS10 is approx 211, TS50 is approx 1,388 → early rapid take‑off and sustained growth; these are clearly **viral** trajectories.
- **Cluster 1:** single extreme **super‑viral** outlier with TS1 is approx TS50 is approx 1,886.
-
-Clustering therefore uncovers distinct popularity regimes: ordinary stories, viral stories, and rare super‑viral events.
-
---
-
-## 3. Decisions and Practical Use
-
-### 3.1 What do the models tell us?
-
-**1. Static metadata is not enough for precise prediction.**
-
-Model 1, using only topic, time and sentiment, explains only about 16% of the variance in log‑Facebook popularity. The EDA already indicated weak correlations between sentiment and engagement, and the model confirms that topic is the only strong static predictor. This means:
-
- Before any user feedback is observed, we can form only a rough guess about popularity (e.g., “obama stories tend to do better”), but detailed predictions are unreliable.
-
-**2. Early engagement is the key signal.**
-
-Models 2 and 3 show that once ~16 hours of Facebook feedback are available:
-
- Random forests can explain ~75% of the variance in final log‑popularity.
- PCA compresses the 50‑dimensional TS inputs to 10 components with essentially no loss in performance.
-
-In practice, this means that **monitoring early time‑series of shares is crucial**. Stories that are already accumulating shares quickly by TS50 are extremely likely to end up as the most popular items after two days.
-
-**3. Logistic regression is useful for ranking, not for definitive labels.**
-
-The viral vs non‑viral classifier has:
-
- Good ranking ability (ROC AUC ~0.75).
- Moderate F1 score and relatively low accuracy compared to the majority baseline.
-
-This makes it better suited as a **priority score** than as a hard decision rule. For example, an editorial team might sort draft stories by predicted viral probability to decide where to invest additional editorial resources, but should not automatically discard stories predicted to be non‑viral.
-
-**4. Clustering uncovers growth archetypes.**
-
-K‑means reveals three typical growth shapes:
-
-1. Slow/low growth (most items).
-2. Clearly viral trajectories.
-3. A tiny number of super‑viral events.
-
-Recognizing that an article’s early TS pattern matches the viral or super‑viral cluster can trigger decisions such as:
-
- Featuring the article more prominently on the homepage.
- Allocating budget for promoted posts.
- Producing follow‑up content while interest is high.
-
-### 3.2 How useful are these models for real decisions?
-
-A practical decision workflow informed by this analysis could be:
-
-1. **Pre‑publication / immediately at publication**
-
-   Use the logistic regression model and static features (topic, sentiment, time) to assign each new article a baseline probability of becoming viral. This can help prioritize which stories to monitor more closely, but should not be the sole basis for publication decisions.
-
-2. **Early post‑publication (first few hours)**
-
-   Once some time‑slice information is available (TS1–TS10), use clustering to see whether the article’s early trajectory resembles known viral patterns. Articles already in the viral cluster are good candidates for early promotion.
-
-3. **Mid‑window (around TS50)**
-
-   At ~16–17 hours, feed TS1–TS50 into the PCA + random forest regressor (Model 3) to estimate final reach. This estimate can guide decisions about:
-
-   - How long to keep the story on front pages.
-   - Whether to schedule follow‑ups or derivative content.
-   - Where to allocate marketing/promotional resources.
-
-4. **Limitations**
-
-   - Popularity is still highly stochastic; even with R² is approx 0.75 in the best case, there is considerable residual uncertainty.
-   - Models trained on this dataset focus on four specific topics and a particular time period (2015–2016). Performance may degrade when applied to different domains, languages or time spans. ([arXiv][1])
-
-Overall, these models are best used for **relative ranking and triage** and help in deciding which articles deserve extra attention rather than for exact point predictions of future share counts. Combining static features, early engagement signals, and growth‑pattern clustering yields a practical decision support tool for newsrooms and social media teams working with limited resources.
-
-If you actually read this far...nice! :D
-
-[1]: https://arxiv.org/abs/1801.07055 "Multi-Source Social Feedback of Online News Feeds"
@@ -1,363 +0,0 @@
-import zipfile
-import numpy as np
-import pandas as pd
-import matplotlib.pyplot as plt
-import os
-
-from sklearn.model_selection import train_test_split
-from sklearn.linear_model import LinearRegression, LogisticRegression
-from sklearn.ensemble import RandomForestRegressor
-from sklearn.decomposition import PCA
-from sklearn.preprocessing import StandardScaler
-from sklearn.metrics import (
-	r2_score,
-	root_mean_squared_error,
-	accuracy_score,
-	f1_score,
-	roc_auc_score,
-	confusion_matrix,
-	silhouette_score,
-)
-from sklearn.pipeline import Pipeline
-from sklearn.cluster import KMeans
-
-# ensure imgs dir exists
-os.makedirs("imgs", exist_ok=True)
-
-# data loading
-
-zip_path = "news+popularity+in+multiple+social+media+platforms.zip"
-
-with zipfile.ZipFile(zip_path, "r") as zf:
-	with zf.open("Data/News_Final.csv") as f:
-		news = pd.read_csv(f)
-
-# basic cleaning
-
-pop_cols = ["Facebook", "GooglePlus", "LinkedIn"]
-
-# encode -1 as missing
-for col in pop_cols:
-	news.loc[news[col] < 0, col] = np.nan
-
-# convert publishdate and add numeric time feature
-news["PublishDate"] = pd.to_datetime(news["PublishDate"])
-news["DaysSinceEpoch"] = (
-	news["PublishDate"] - pd.Timestamp("1970-01-01")
-).dt.days
-
-# log transform facebook popularity where available
-news["log_Facebook"] = np.log1p(news["Facebook"])
-
-# eda helpers (optional plotting)
-
-
-def plot_eda():
-	plt.figure()
-	vals = news["Facebook"].dropna()
-	vals = vals[vals > 0]
-	vals.plot.hist(bins=50)
-	plt.xlabel("facebook shares")
-	plt.ylabel("count")
-	plt.title("distribution of facebook popularity")
-	plt.xscale("log")
-	plt.tight_layout()
-	plt.savefig("imgs/eda_facebook_hist.png")
-	plt.close()
-
-	plt.figure()
-	news["log_Facebook"].dropna().plot.hist(bins=50)
-	plt.xlabel("log1p(facebook shares)")
-	plt.ylabel("count")
-	plt.title("distribution of log-transformed facebook popularity")
-	plt.tight_layout()
-	plt.savefig("imgs/eda_log_facebook_hist.png")
-	plt.close()
-
-	mean_by_topic = (
-		news.groupby("Topic")["log_Facebook"].mean().sort_values()
-	)
-	plt.figure()
-	mean_by_topic.plot(kind="bar")
-	plt.ylabel("mean log1p(facebook shares)")
-	plt.title("average facebook popularity by topic")
-	plt.tight_layout()
-	plt.savefig("imgs/eda_mean_by_topic.png")
-	plt.close()
-
-	sample = news.dropna(
-		subset=["log_Facebook", "SentimentTitle"]
-	).sample(5000, random_state=42)
-	plt.figure()
-	plt.scatter(
-		sample["SentimentTitle"],
-		sample["log_Facebook"],
-		alpha=0.3,
-	)
-	plt.xlabel("sentimenttitle")
-	plt.ylabel("log1p(facebook shares)")
-	plt.title("title sentiment vs facebook popularity (sample)")
-	plt.tight_layout()
-	plt.savefig("imgs/eda_sentiment_vs_popularity.png")
-	plt.close()
-
-# model 1: linear regression
-
-
-def run_model_1():
-	df = news.dropna(subset=["log_Facebook"]).copy()
-
-	X = df[["SentimentTitle", "SentimentHeadline", "DaysSinceEpoch", "Topic"]]
-	X = pd.get_dummies(X, columns=["Topic"], drop_first=True)
-	y = df["log_Facebook"]
-
-	X_train, X_test, y_train, y_test = train_test_split(
-		X, y, test_size=0.2, random_state=42
-	)
-
-	linreg = LinearRegression()
-	linreg.fit(X_train, y_train)
-	y_pred = linreg.predict(X_test)
-
-	r2 = r2_score(y_test, y_pred)
-	rmse = root_mean_squared_error(y_test, y_pred)
-
-	print("model 1 – linear regression")
-	print("r2:", r2)
-	print("rmse:", rmse)
-	print("coefficients:")
-	print(pd.Series(linreg.coef_, index=X.columns))
-
-	# optional diagnostic plot
-	plt.figure()
-	plt.scatter(y_test, y_pred, alpha=0.3)
-	plt.xlabel("actual log1p(facebook)")
-	plt.ylabel("predicted log1p(facebook)")
-	plt.title("model 1: actual vs predicted")
-	plt.tight_layout()
-	plt.savefig("imgs/model1_actual_vs_predicted.png")
-	plt.close()
-
-	return linreg, (X_test, y_test, y_pred)
-
-# prepare economy + facebook time-slice data
-
-
-with zipfile.ZipFile(zip_path, "r") as zf:
-	with zf.open("Data/Facebook_Economy.csv") as f:
-		fb_econ = pd.read_csv(f)
-
-# ensure integer id for join
-news["IDLink_int"] = news["IDLink"].astype(int)
-
-news_econ = news[news["Topic"] == "economy"].copy()
-news_econ["IDLink_int"] = news_econ["IDLink"].astype(int)
-
-fb_econ_merged = fb_econ.merge(
-	news_econ, left_on="IDLink", right_on="IDLink_int", how="inner"
-)
-
-# clean time-slice features
-ts_cols = [c for c in fb_econ.columns if c.startswith("TS")]
-for col in ts_cols:
-	fb_econ_merged.loc[fb_econ_merged[col] < 0, col] = 0
-
-# drop rows with missing facebook target
-fb_econ_merged = fb_econ_merged[fb_econ_merged["Facebook"].notna()].copy()
-fb_econ_merged["log_Facebook"] = np.log1p(fb_econ_merged["Facebook"])
-
-ts_cols_early = ts_cols[:50]
-
-# model 2: random forest on raw early ts
-
-
-def run_model_2():
-	X = fb_econ_merged[ts_cols_early + ["SentimentTitle", "SentimentHeadline"]]
-	y = fb_econ_merged["log_Facebook"]
-
-	X_train, X_test, y_train, y_test = train_test_split(
-		X, y, test_size=0.2, random_state=42
-	)
-
-	rf = RandomForestRegressor(
-		n_estimators=120,
-		random_state=42,
-		n_jobs=-1,
-		max_depth=None,
-		min_samples_leaf=2,
-	)
-	rf.fit(X_train, y_train)
-
-	pipe = Pipeline([
-		("scaler", StandardScaler()),
-		("pca", PCA(n_components=10, random_state=42)),
-		("rf", RandomForestRegressor(
-			n_estimators=120,
-			random_state=42,
-			n_jobs=-1,
-			max_depth=None,
-			min_samples_leaf=2,
-		)),
-	])
-
-	pipe.fit(X_train, y_train)
-	y_pred = pipe.predict(X_test)
-
-	r2 = r2_score(y_test, y_pred)
-	rmse = root_mean_squared_error(y_test, y_pred)
-
-	print("model 2 – random forest on raw ts")
-	print("r2:", r2)
-	print("rmse:", rmse)
-
-	importances = pd.Series(rf.feature_importances_, index=X.columns)
-	print("top importances:")
-	print(importances.sort_values(ascending=False).head(10))
-
-	return rf, (X_test, y_test, y_pred)
-
-# model 3: pca + random forest
-
-
-def run_model_3():
-	ts = fb_econ_merged[ts_cols_early]
-	sent = fb_econ_merged[["SentimentTitle", "SentimentHeadline"]]
-	X = pd.concat([ts, sent], axis=1)
-	y = fb_econ_merged["log_Facebook"]
-
-	X_train, X_test, y_train, y_test = train_test_split(
-		X, y, test_size=0.2, random_state=42
-	)
-
-	scaler = StandardScaler()
-	X_train_scaled = scaler.fit_transform(X_train[ts_cols_early])
-	X_test_scaled = scaler.transform(X_test[ts_cols_early])
-
-	pca = PCA(n_components=10, random_state=42)
-	X_train_pca = pca.fit_transform(X_train_scaled)
-	X_test_pca = pca.transform(X_test_scaled)
-
-	train_sent = X_train[["SentimentTitle", "SentimentHeadline"]].values
-	test_sent = X_test[["SentimentTitle", "SentimentHeadline"]].values
-
-	X_train_final = np.hstack([X_train_pca, train_sent])
-	X_test_final = np.hstack([X_test_pca, test_sent])
-
-	rf = RandomForestRegressor(
-		n_estimators=120,
-		random_state=42,
-		n_jobs=-1,
-		max_depth=None,
-		min_samples_leaf=2,
-	)
-	rf.fit(X_train_final, y_train)
-	y_pred = rf.predict(X_test_final)
-
-	r2 = r2_score(y_test, y_pred)
-	rmse = root_mean_squared_error(y_test, y_pred)
-
-	print("model 3 – random forest on pca(ts)")
-	print("r2:", r2)
-	print("rmse:", rmse)
-	print("pca variance explained (first 10):", pca.explained_variance_ratio_)
-	print("total variance explained:", pca.explained_variance_ratio_.sum())
-
-	return rf, (X_test, y_test, y_pred), (pca, scaler)
-
-# model 4: logistic regression (viral vs non-viral)
-
-
-def run_model_4():
-	df = news.copy()
-	df = df[df["Facebook"].notna()].copy()
-
-	threshold = df["Facebook"].quantile(0.9)
-	df["viral_fb"] = (df["Facebook"] >= threshold).astype(int)
-
-	X = df[["SentimentTitle", "SentimentHeadline", "DaysSinceEpoch", "Topic"]]
-	X = pd.get_dummies(X, columns=["Topic"], drop_first=True)
-	y = df["viral_fb"]
-
-	X_train, X_test, y_train, y_test = train_test_split(
-		X,
-		y,
-		test_size=0.2,
-		random_state=42,
-		stratify=y,
-	)
-
-	clf = LogisticRegression(
-		max_iter=500,
-		class_weight="balanced",
-	)
-	clf.fit(X_train, y_train)
-
-	y_pred = clf.predict(X_test)
-	y_proba = clf.predict_proba(X_test)[:, 1]
-
-	acc = accuracy_score(y_test, y_pred)
-	f1 = f1_score(y_test, y_pred)
-	auc = roc_auc_score(y_test, y_proba)
-	cm = confusion_matrix(y_test, y_pred)
-
-	print("model 4 – logistic regression (viral vs non-viral)")
-	print("threshold (shares):", threshold)
-	print("accuracy:", acc)
-	print("f1 (positive class):", f1)
-	print("roc auc:", auc)
-	print("confusion matrix:\n", cm)
-
-	return clf, (X_test, y_test, y_pred, y_proba)
-
-# model 5: k-means clustering on ts shapes
-
-
-def run_model_5():
-	X = fb_econ_merged[ts_cols_early].values
-	scaler = StandardScaler()
-	X_scaled = scaler.fit_transform(X)
-
-	rng = np.random.RandomState(42)
-	idx = rng.choice(X_scaled.shape[0], size=5000, replace=False)
-	X_sample = X_scaled[idx]
-	fb_sample = fb_econ_merged["Facebook"].values[idx]
-
-	kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
-	kmeans.fit(X_sample)
-	labels = kmeans.labels_
-
-	sil = silhouette_score(X_sample, labels)
-	print("model 5 – kmeans on ts shapes")
-	print("silhouette score:", sil)
-
-	cluster_df = pd.DataFrame(
-		{"cluster": labels, "Facebook": fb_sample}
-	)
-	print(cluster_df.groupby("cluster")["Facebook"].agg(
-		["count", "mean", "median", "max"]
-	))
-
-	centers_scaled = kmeans.cluster_centers_
-	centers = scaler.inverse_transform(centers_scaled)
-	centers_df = pd.DataFrame(centers, columns=ts_cols_early)
-
-	summary = pd.DataFrame({
-		"cluster": list(range(centers_df.shape[0])),
-		"avg_ts": centers_df.mean(axis=1),
-		"ts1": centers_df["TS1"],
-		"ts10": centers_df["TS10"],
-		"ts25": centers_df["TS25"],
-		"ts50": centers_df["TS50"],
-	})
-	print("cluster centroid summary:\n", summary)
-
-	return kmeans, scaler, summary
-
-
-if __name__ == "__main__":
-	run_model_1()
-	run_model_2()
-	run_model_3()
-	run_model_4()
-	run_model_5()
-	plot_eda()
@@ -1,53 +0,0 @@
-model 1 – linear regression
-r2: 0.1566089012155698
-rmse: 1.8625218879551908
-coefficients:
-SentimentTitle      -0.383499
-SentimentHeadline   -0.064708
-DaysSinceEpoch      -0.000678
-Topic_microsoft      0.101848
-Topic_obama          1.779152
-Topic_palestine      0.023738
-dtype: float64
-model 2 – random forest on raw ts
-r2: 0.7441325592979975
-rmse: 0.8661035218490399
-top importances:
-TS50                 0.810814
-SentimentHeadline    0.099992
-SentimentTitle       0.067386
-TS49                 0.001883
-TS48                 0.000589
-TS15                 0.000503
-TS18                 0.000503
-TS13                 0.000498
-TS24                 0.000498
-TS10                 0.000480
-dtype: float64
-model 3 – random forest on pca(ts)
-r2: 0.7442278904925559
-rmse: 0.8659421602173341
-pca variance explained (first 10): [9.38529911e-01 3.24317512e-02 1.76049987e-02 7.50439628e-03
- 1.90148973e-03 6.83679307e-04 3.57135169e-04 2.12058930e-04
- 1.33577763e-04 9.66846072e-05]
-total variance explained: 0.9994556829781833
-model 4 – logistic regression (viral vs non-viral)
-threshold (shares): 214.0
-accuracy: 0.7287481626653601
-f1 (positive class): 0.35709101466105386
-roc auc: 0.7530964866530827
-confusion matrix:
- [[10669  4023]
- [  406  1230]]
-model 5 – kmeans on ts shapes
-silhouette score: 0.9732852082508215
-         count         mean  median     max
-cluster                                    
-0         4978    36.751708     3.0  7045.0
-1            1  1886.000000  1886.0  1886.0
-2           21  2477.761905  1291.0  8010.0
-cluster centroid summary:
-    cluster       avg_ts          ts1         ts10         ts25         ts50
-0        0     8.317766     0.297710     2.959221     7.836079    17.221977
-1        1  1885.920000  1885.000000  1886.000000  1886.000000  1886.000000
-2        2   640.917143    22.761905   211.142857   579.047619  1387.619048
@@ -1,25 +0,0 @@
-Conduct the following analysis for the dataset:
-1. Exploratory Data Analysis
-Explore the statistical aspects of the dataset. Analyze the
-distributions and provide summaries of the relevant statistics. Perform any cleaning,
-transformations, interpolations, smoothing, outlier detection/ removal, etc. required on the
-data. Include figures and descriptions of this exploration and a short description of what
-you concluded (e.g. nature of distribution, indication of suitable model approaches you
-would try, etc.) Min.1 page text + graphics (required).
-
-2. Model Development, Validation and Optimization
-Develop and evaluate three (4000-level) or four (6000-level) or more J models. If possible,
-these models should cover more than one objective, i.e. regression, classification,
-clustering. Consider the efect of dimension reduction of the dataset on model
-performance. Diferent models means diferent combinations of an algorithm and a
-formula (input and output features). The choice of independent and response variables is
-up to you. Explain why you chose them. Construct the models, test/ validate them. Briefly explain the
-validation approach. You can use any method(s) covered in the course. Include your code
-in your submission. Compare model results if applicable. Report the results of the model
-(fits, coeficients, sample trees, other measures of fit/ importance, etc., predictors and
-summary statistics). Min. 2 pages of text + graphics (required).
-
-3. Decisions
-Describe your conclusions from the model
-fits, predictions and how well (or not) it could be used for decisions and why. Min. 1/2 page
-of text + graphics.
Author	SHA1	Message	Date
ION606	2806905ccf	Merge branch 'transfer' of https://git.ion606.com/ION606/Data-Analytics into transfer	2025-11-14 15:54:16 -05:00
ION606	8ef9cb2d6e	added assignment 5	2025-11-14 15:53:48 -05:00