Data-Analytics/Assignment V/Assignment.md

I used Manhattan (BOROUGH code = 1) for Question 1 and Brooklyn (BOROUGH code = 3) for Question 2.

NYC Dept. of Finance / BBL conventions:
- 1 = Manhattan
- 2 = Bronx
- 3 = Brooklyn
- 4 = Queens
- 5 = Staten Island

The dataset itself is NYC’s annualized file of residential property sales across all five boroughs

## Loading and Cleaning the Data

- `manhattan_clean` ended up with **6,313** sales.
- `brooklyn_clean` ended up with **40,921** sales.

---

# 1. One-borough analysis (Manhattan, BOROUGH = 1)

---

## 1(a) Patterns, trends, and modeling plan

For Manhattan, I’m interested first in how **SALE PRICE** is distributed and how it relates to building characteristics like **GROSS SQUARE FEET**, **LAND SQUARE FEET**, **YEAR BUILT**, and **unit counts**. Because this is a citywide residential sales file, I expect the price distribution to be extremely right‑skewed with a small number of ultra‑expensive transactions and many more moderately priced ones.

I’d start with **univariate** distributions of price, square footage, and year built, and then move to **bivariate** relationships (scatter plots of price vs. size, boxplots of price by neighborhood) and **correlation matrices**. For modeling, I’d use **log‑transformed sale price** as the response to stabilize variance and compare a baseline **linear regression** to a non‑linear **Random Forest** regressor. Finally, I’d treat **NEIGHBORHOOD** as a label and build supervised classifiers (k‑NN, Random Forest, logistic regression) using quantitative features (price, areas, units, year) to see how well location can be inferred from physical characteristics and price alone.

---

## 1(b) Exploratory Data Analysis and Outliers

### Findings

1. **Distribution of Sale Price**
   For Manhattan, after cleaning, `SALE PRICE` is extremely right‑skewed:

   - Count is approx **6,313**.
   - Median is approx **$3.86M**.
   - Mean is approx **$15.7M**.
   - 75th percentile is approx **$9.25M**.
   - 90th percentile is approx **$21.8M**.
   - 99th percentile is approx **$228.9M**.
   - Maximum is approx **$2.40B**.

   That huge gap between the median and max confirms a heavy upper tail.

2. **Outliers**
   Using the 1.5 IQR rule, the upper outlier threshold is around **$21.1M**; there are about **650** outlier sales above this bound. The most extreme sale (approx$2.4B) is orders of magnitude larger than a typical sale and shows up as a solitary point far above the rest in the boxplot. On the lower side, I also saw many sales at or near zero in the raw data, which is why I filtered out sales ≤$10,000 as likely non‑arm’s‑length or data errors.

3. **Effect of log transformation.**
   When I plotted a histogram of `log(1 + SALE PRICE)`, the distribution became much closer to symmetric: the bulk of log‑prices fell approximately between ~14 and ~17 (roughly $1.2M to $24M), with a long but much more manageable upper tail. This supported using the log scale for regression.

4. **Relationships with size & other features.**
   The correlation of `SALE PRICE` with **GROSS SQUARE FEET** was about **0.49**, substantially higher than any other feature. Correlations with **COMMERCIAL UNITS**, **LAND SQUARE FEET**, and **TOTAL UNITS** were modestly positive (~0.16–0.21), and correlation with **YEAR BUILT** was essentially near zero. This suggests that building size is the main driver captured in these quantitative covariates, while age and unit counts are relatively weak predictors by themselves.

5. **Scatter plots & heteroskedasticity.**
   The scatter plot of price vs gross square feet (with a log scale on price) showed a clear upward trend but with wide vertical spread, especially for larger buildings. High‑end buildings with similar square footage can sell at very different prices, which is consistent with neighborhood effects, building quality, and other unobserved characteristics. Overall, the EDA shows strong skewness, many high‑end outliers, and a moderate but noisy link between size and price.

---

## 1(c) Regression analysis to predict sale price

### Findings

1. **Cleaning for regression.**
   In addition to the general cleaning above, I required that observations have non‑missing **GROSS SQUARE FEET**, **LAND SQUARE FEET**, **YEAR BUILT**, and unit counts, and that the areas be strictly positive. I also removed sales at or below $10,000 as non‑market outliers. I did **not** remove very high prices; instead I relied on the log transformation and the tree‑based model to reduce their influence.

2. **Linear regression (baseline).**
   The standardized linear model on log‑price achieved only about **R² is approx 0.13** on the test set, with RMSE is approx **1.71** and MAE is approx **1.24** in log units. This means a purely linear relationship between size, units, year built and log price is a poor approximation – unsurprising given the complexity of Manhattan’s housing market and the missing effects of location and building quality.

3. **Random Forest regression (non‑linear).**
   The Random Forest model performed much better, with **R² is approx 0.75**, RMSE is approx **0.92**, and MAE is approx **0.59** on log‑price. Interpreting roughly, an MAE of 0.59 in log units corresponds to prediction errors on the order of ±80% in price (because exp(0.59) is approx 1.8), which is not great for individual deals but decent for a coarse citywide model based only on a few structural features.

4. **Interpretation of predictors.**
   Based on the earlier correlations and typical real‑estate patterns, most of the predictive power comes from **GROSS SQUARE FEET**, with **LAND SQUARE FEET** and unit counts adding secondary information. Year built contributes little signal by itself, consistent with the near‑zero correlation with sale price and the fact that historic vs modern buildings can command premiums or discounts depending on context.

5. **Model choice.**
   Because the Random Forest explains substantially more variance in log price and is more robust to non‑linearity and heteroskedasticity than linear regression, I treated it as the **best Manhattan regression model** and used it as the model to generalize to Brooklyn in Question 2.

---

## 1(d) Classification: predicting neighborhood from quantitative variables

### Findings

1. **Cleaning for classification.**
   I restricted to Manhattan neighborhoods with at least **100** sales to avoid tiny classes. That left **5,476** observations and **23 neighborhoods** (e.g., Midtown West, multiple Upper East/West Side segments, several Harlem neighborhoods, Chelsea, Lower East Side, etc.). I also dropped any rows with missing numeric features.

2. **k‑NN classifier.**
   With standardized features and k=7, k‑NN achieved **accuracy approx 0.50**, macro **F1 is approx 0.45**, and weighted **F1 is approx 0.50**. The confusion matrix showed that it often confused nearby or similar neighborhoods (e.g., different segments of the Upper East/West Side), which is intuitive because those areas have similar price/size profiles.

3. **Random Forest classifier (best).**
   The Random Forest neighborhood classifier performed best, with **accuracy approx 0.61**, macro **precision approx 0.59**, macro **recall is approx 0.55**, and macro **F1 is approx 0.57** (weighted F1 is approx 0.61). The confusion matrix had a reasonably strong diagonal for major neighborhoods like Midtown West and Central Harlem, though there were still frequent misclassifications between adjacent, similar market segments.

4. **Logistic regression.**
   Multinomial logistic regression performed poorly on this feature set, with **accuracy approx 0.27** and macro **F1 approx 0.12**. This suggests that the decision boundaries between neighborhoods in this feature space are highly non‑linear, and a simple linear model in the original feature space is not expressive enough.

5. **Overall assessment.**
   Even the best classifier (Random Forest) makes many mistakes, which is expected: we are trying to reconstruct a very fine‑grained location label (neighborhood) from crude variables (square feet, units, year, plus price) and ignoring explicit spatial coordinates. The contingency tables show that neighborhoods with similar densities and price levels systematically get confused, highlighting the limits of using only structural attributes and price to infer location.

---

# 2. Second borough (Brooklyn, BOROUGH = 3)

For Question 2, I used **Brooklyn** as the second borough, cleaned with the same logic as Manhattan (BOROUGH=3, same min price, same handling of missing areas and year built).

---

## 2(a) Applying the Manhattan regression model to Brooklyn

### Findings

1. **Performance metrics.**
   When the Manhattan Random Forest regression model was applied directly to the cleaned Brooklyn data (no retraining), I got:

   - **R² is approx –0.77** on log‑price.
   - **RMSE is approx 1.21**.
   - **MAE is approx 0.95**.

   The negative R² means the model does **worse** than simply predicting the mean log price for every Brooklyn sale.

2. **Predicted vs actual plot.**
   The predicted vs actual log‑price scatter for Brooklyn is very diffuse and does not cluster around the 45° line. The model tends to **systematically mis‑scale Brooklyn prices**: for some segments it over‑predicts (especially cheaper properties) and for more expensive Brooklyn neighborhoods it under‑predicts compared to their actual sale prices.

3. **Residual diagnostics.**
   The residual‑vs‑prediction plot shows large, structured patterns rather than random noise; the mean residual is substantially negative (indicating systematic under‑prediction on the log scale). This indicates poor generalization: the relationships between square footage, units, year built and price that the model learned in Manhattan do not transfer well to Brooklyn.

4. **Interpretation.**
   Brooklyn has a very different mix of housing types, neighborhood price levels, and land availability compared to Manhattan. Without explicit location variables and more detailed building characteristics, a model calibrated on Manhattan cannot capture Brooklyn’s pricing structure, so it generalizes poorly even though the algorithms are fairly powerful.

---

## 2(b) Applying Manhattan neighborhood classifiers to Brooklyn

### Findings

1. **Label-space mismatch.**
   The classifiers trained on Manhattan were all trained to predict **Manhattan neighborhoods** (e.g., “MIDTOWN WEST”, “UPPER EAST SIDE (79–96)”) as labels. When evaluated on Brooklyn, the **true labels** are Brooklyn neighborhoods (“BAY RIDGE”, “WILLIAMSBURG”, etc.), which **do not overlap at all** with the Manhattan label set. As a result, the Manhattan models will *never* predict the correct Brooklyn neighborhood name.

2. **Metrics.**
   As expected, for all three models (k‑NN, Random Forest, logistic regression), I got **accuracy approx 0.0** and macro/weighted **F1 approx 0.0** when using Brooklyn neighborhood names as the true labels. In other words, the models made effectively zero correct predictions across all observations.

3. **Contingency tables.**
   The resulting “confusion matrices” are degenerate: for each Brooklyn neighborhood, all counts are off‑diagonal because the classifier is outputting Manhattan labels that never equal the Brooklyn labels on the y‑axis. This still technically produces a contingency table, but it visually demonstrates that the model is fundamentally mis-specified for this task when moved to a different borough.

4. **Interpretation.**
   This exercise highlights a key point: **classification models can’t generalize across domains when the label space itself changes**. Because Brooklyn neighborhoods are a completely different set of categories, a Manhattan neighborhood classifier cannot be expected to perform well without re-training on Brooklyn labels. At best, you might interpret the predictions as a “closest Manhattan analog,” but they have no predictive validity for the true Brooklyn neighborhood names.

---

## 2(c) General observations & confidence

The datasets for Manhattan and Brooklyn are both large but quite **noisy**, with many missing or inconsistent values for square footage and units and with a lot of non‑arm’s‑length sales at very low or zero prices. Even after cleaning, high‑end outliers still exert influence and reflect market segments (e.g., trophy assets) that behave differently from the bulk of the distribution. Across both boroughs, models based only on size, units, and age of the building capture some signal but miss many important drivers like exact location, building quality, and amenities. My confidence in the **relative** conclusions (e.g., Random Forest beats linear regression; Manhattan‑trained models generalize poorly to Brooklyn) is high, but I would not rely on these models for **precise valuation** of individual properties.

---

# 3. 6000-level: Conclusions about model types & suitability

Across this study, **non‑linear tree‑based models (Random Forests)** consistently outperformed simpler linear models for both regression and classification. In the Manhattan regression, linear regression on log‑price captured only a small fraction of the variance, while the Random Forest achieved a much higher R² and more realistic error levels, indicating that the relationship between size/units and price is non‑linear, with interactions and thresholds that linear models cannot represent. For classification, the Random Forest neighborhood model again beat k‑NN and especially logistic regression, suggesting that neighborhood decision boundaries in this feature space are complex and benefit from hierarchical splits rather than a single global linear separator.

However, these stronger models are still limited by **feature quality and domain shift**. When the Manhattan regression model is applied to Brooklyn, its performance collapses (negative R²), showing that even a flexible model trained on one borough’s distribution cannot simply be transplanted to another borough with different price levels and housing stock. The classification models fail even more dramatically when moved across boroughs, because the label space itself changes; this is a reminder that good predictive performance is inherently **domain-specific** and that models must be re-trained or at least adapted when the domain or label space shifts.

Methodologically, what “worked” was combining **sensible cleaning (dropping non‑arm’s‑length sales, fixing square‑footage fields, filtering small classes), log transformations, and non‑linear models**; what did not work was assuming that limited structural variables alone could fully explain prices or that a model trained on Manhattan would generalize to Brooklyn without explicit location features. In a production setting, I would layer on richer covariates (latitude/longitude, transit accessibility, building quality proxies, zoning, etc.) and likely move to gradient‑boosted trees or other ensemble methods, but the main lessons about non‑linearity, outliers, and domain dependence would remain the same.