This repository has been archived on 2026-05-09. You can view files and clone it. You cannot open issues or pull requests or push a commit.
Files
Data-Analytics/Assignment V/Assignment.md
T
2025-11-14 15:53:48 -05:00

15 KiB
Raw Blame History

I used Manhattan (BOROUGH code = 1) for Question 1 and Brooklyn (BOROUGH code = 3) for Question 2.

NYC Dept. of Finance / BBL conventions:

  • 1 = Manhattan
  • 2 = Bronx
  • 3 = Brooklyn
  • 4 = Queens
  • 5 = Staten Island

The dataset itself is NYCs annualized file of residential property sales across all five boroughs

Loading and Cleaning the Data

  • manhattan_clean ended up with 6,313 sales.
  • brooklyn_clean ended up with 40,921 sales.

1. One-borough analysis (Manhattan, BOROUGH = 1)


For Manhattan, Im interested first in how SALE PRICE is distributed and how it relates to building characteristics like GROSS SQUARE FEET, LAND SQUARE FEET, YEAR BUILT, and unit counts. Because this is a citywide residential sales file, I expect the price distribution to be extremely rightskewed with a small number of ultraexpensive transactions and many more moderately priced ones.

Id start with univariate distributions of price, square footage, and year built, and then move to bivariate relationships (scatter plots of price vs. size, boxplots of price by neighborhood) and correlation matrices. For modeling, Id use logtransformed sale price as the response to stabilize variance and compare a baseline linear regression to a nonlinear Random Forest regressor. Finally, Id treat NEIGHBORHOOD as a label and build supervised classifiers (kNN, Random Forest, logistic regression) using quantitative features (price, areas, units, year) to see how well location can be inferred from physical characteristics and price alone.


1(b) Exploratory Data Analysis and Outliers

Findings

  1. Distribution of Sale Price For Manhattan, after cleaning, SALE PRICE is extremely rightskewed:

    • Count is approx 6,313.
    • Median is approx $3.86M.
    • Mean is approx $15.7M.
    • 75th percentile is approx $9.25M.
    • 90th percentile is approx $21.8M.
    • 99th percentile is approx $228.9M.
    • Maximum is approx $2.40B.

    That huge gap between the median and max confirms a heavy upper tail.

  2. Outliers Using the 1.5 IQR rule, the upper outlier threshold is around $21.1M; there are about 650 outlier sales above this bound. The most extreme sale (approx$2.4B) is orders of magnitude larger than a typical sale and shows up as a solitary point far above the rest in the boxplot. On the lower side, I also saw many sales at or near zero in the raw data, which is why I filtered out sales ≤$10,000 as likely nonarmslength or data errors.

  3. Effect of log transformation. When I plotted a histogram of log(1 + SALE PRICE), the distribution became much closer to symmetric: the bulk of logprices fell approximately between ~14 and ~17 (roughly $1.2M to $24M), with a long but much more manageable upper tail. This supported using the log scale for regression.

  4. Relationships with size & other features. The correlation of SALE PRICE with GROSS SQUARE FEET was about 0.49, substantially higher than any other feature. Correlations with COMMERCIAL UNITS, LAND SQUARE FEET, and TOTAL UNITS were modestly positive (~0.160.21), and correlation with YEAR BUILT was essentially near zero. This suggests that building size is the main driver captured in these quantitative covariates, while age and unit counts are relatively weak predictors by themselves.

  5. Scatter plots & heteroskedasticity. The scatter plot of price vs gross square feet (with a log scale on price) showed a clear upward trend but with wide vertical spread, especially for larger buildings. Highend buildings with similar square footage can sell at very different prices, which is consistent with neighborhood effects, building quality, and other unobserved characteristics. Overall, the EDA shows strong skewness, many highend outliers, and a moderate but noisy link between size and price.


1(c) Regression analysis to predict sale price

Findings

  1. Cleaning for regression. In addition to the general cleaning above, I required that observations have nonmissing GROSS SQUARE FEET, LAND SQUARE FEET, YEAR BUILT, and unit counts, and that the areas be strictly positive. I also removed sales at or below $10,000 as nonmarket outliers. I did not remove very high prices; instead I relied on the log transformation and the treebased model to reduce their influence.

  2. Linear regression (baseline). The standardized linear model on logprice achieved only about R² is approx 0.13 on the test set, with RMSE is approx 1.71 and MAE is approx 1.24 in log units. This means a purely linear relationship between size, units, year built and log price is a poor approximation unsurprising given the complexity of Manhattans housing market and the missing effects of location and building quality.

  3. Random Forest regression (nonlinear). The Random Forest model performed much better, with R² is approx 0.75, RMSE is approx 0.92, and MAE is approx 0.59 on logprice. Interpreting roughly, an MAE of 0.59 in log units corresponds to prediction errors on the order of ±80% in price (because exp(0.59) is approx 1.8), which is not great for individual deals but decent for a coarse citywide model based only on a few structural features.

  4. Interpretation of predictors. Based on the earlier correlations and typical realestate patterns, most of the predictive power comes from GROSS SQUARE FEET, with LAND SQUARE FEET and unit counts adding secondary information. Year built contributes little signal by itself, consistent with the nearzero correlation with sale price and the fact that historic vs modern buildings can command premiums or discounts depending on context.

  5. Model choice. Because the Random Forest explains substantially more variance in log price and is more robust to nonlinearity and heteroskedasticity than linear regression, I treated it as the best Manhattan regression model and used it as the model to generalize to Brooklyn in Question 2.


1(d) Classification: predicting neighborhood from quantitative variables

Findings

  1. Cleaning for classification. I restricted to Manhattan neighborhoods with at least 100 sales to avoid tiny classes. That left 5,476 observations and 23 neighborhoods (e.g., Midtown West, multiple Upper East/West Side segments, several Harlem neighborhoods, Chelsea, Lower East Side, etc.). I also dropped any rows with missing numeric features.

  2. kNN classifier. With standardized features and k=7, kNN achieved accuracy approx 0.50, macro F1 is approx 0.45, and weighted F1 is approx 0.50. The confusion matrix showed that it often confused nearby or similar neighborhoods (e.g., different segments of the Upper East/West Side), which is intuitive because those areas have similar price/size profiles.

  3. Random Forest classifier (best). The Random Forest neighborhood classifier performed best, with accuracy approx 0.61, macro precision approx 0.59, macro recall is approx 0.55, and macro F1 is approx 0.57 (weighted F1 is approx 0.61). The confusion matrix had a reasonably strong diagonal for major neighborhoods like Midtown West and Central Harlem, though there were still frequent misclassifications between adjacent, similar market segments.

  4. Logistic regression. Multinomial logistic regression performed poorly on this feature set, with accuracy approx 0.27 and macro F1 approx 0.12. This suggests that the decision boundaries between neighborhoods in this feature space are highly nonlinear, and a simple linear model in the original feature space is not expressive enough.

  5. Overall assessment. Even the best classifier (Random Forest) makes many mistakes, which is expected: we are trying to reconstruct a very finegrained location label (neighborhood) from crude variables (square feet, units, year, plus price) and ignoring explicit spatial coordinates. The contingency tables show that neighborhoods with similar densities and price levels systematically get confused, highlighting the limits of using only structural attributes and price to infer location.


2. Second borough (Brooklyn, BOROUGH = 3)

For Question 2, I used Brooklyn as the second borough, cleaned with the same logic as Manhattan (BOROUGH=3, same min price, same handling of missing areas and year built).


2(a) Applying the Manhattan regression model to Brooklyn

Findings

  1. Performance metrics. When the Manhattan Random Forest regression model was applied directly to the cleaned Brooklyn data (no retraining), I got:

    • R² is approx 0.77 on logprice.
    • RMSE is approx 1.21.
    • MAE is approx 0.95.

    The negative R² means the model does worse than simply predicting the mean log price for every Brooklyn sale.

  2. Predicted vs actual plot. The predicted vs actual logprice scatter for Brooklyn is very diffuse and does not cluster around the 45° line. The model tends to systematically misscale Brooklyn prices: for some segments it overpredicts (especially cheaper properties) and for more expensive Brooklyn neighborhoods it underpredicts compared to their actual sale prices.

  3. Residual diagnostics. The residualvsprediction plot shows large, structured patterns rather than random noise; the mean residual is substantially negative (indicating systematic underprediction on the log scale). This indicates poor generalization: the relationships between square footage, units, year built and price that the model learned in Manhattan do not transfer well to Brooklyn.

  4. Interpretation. Brooklyn has a very different mix of housing types, neighborhood price levels, and land availability compared to Manhattan. Without explicit location variables and more detailed building characteristics, a model calibrated on Manhattan cannot capture Brooklyns pricing structure, so it generalizes poorly even though the algorithms are fairly powerful.


2(b) Applying Manhattan neighborhood classifiers to Brooklyn

Findings

  1. Label-space mismatch. The classifiers trained on Manhattan were all trained to predict Manhattan neighborhoods (e.g., “MIDTOWN WEST”, “UPPER EAST SIDE (7996)”) as labels. When evaluated on Brooklyn, the true labels are Brooklyn neighborhoods (“BAY RIDGE”, “WILLIAMSBURG”, etc.), which do not overlap at all with the Manhattan label set. As a result, the Manhattan models will never predict the correct Brooklyn neighborhood name.

  2. Metrics. As expected, for all three models (kNN, Random Forest, logistic regression), I got accuracy approx 0.0 and macro/weighted F1 approx 0.0 when using Brooklyn neighborhood names as the true labels. In other words, the models made effectively zero correct predictions across all observations.

  3. Contingency tables. The resulting “confusion matrices” are degenerate: for each Brooklyn neighborhood, all counts are offdiagonal because the classifier is outputting Manhattan labels that never equal the Brooklyn labels on the yaxis. This still technically produces a contingency table, but it visually demonstrates that the model is fundamentally mis-specified for this task when moved to a different borough.

  4. Interpretation. This exercise highlights a key point: classification models cant generalize across domains when the label space itself changes. Because Brooklyn neighborhoods are a completely different set of categories, a Manhattan neighborhood classifier cannot be expected to perform well without re-training on Brooklyn labels. At best, you might interpret the predictions as a “closest Manhattan analog,” but they have no predictive validity for the true Brooklyn neighborhood names.


2(c) General observations & confidence

The datasets for Manhattan and Brooklyn are both large but quite noisy, with many missing or inconsistent values for square footage and units and with a lot of nonarmslength sales at very low or zero prices. Even after cleaning, highend outliers still exert influence and reflect market segments (e.g., trophy assets) that behave differently from the bulk of the distribution. Across both boroughs, models based only on size, units, and age of the building capture some signal but miss many important drivers like exact location, building quality, and amenities. My confidence in the relative conclusions (e.g., Random Forest beats linear regression; Manhattantrained models generalize poorly to Brooklyn) is high, but I would not rely on these models for precise valuation of individual properties.


3. 6000-level: Conclusions about model types & suitability

Across this study, nonlinear treebased models (Random Forests) consistently outperformed simpler linear models for both regression and classification. In the Manhattan regression, linear regression on logprice captured only a small fraction of the variance, while the Random Forest achieved a much higher R² and more realistic error levels, indicating that the relationship between size/units and price is nonlinear, with interactions and thresholds that linear models cannot represent. For classification, the Random Forest neighborhood model again beat kNN and especially logistic regression, suggesting that neighborhood decision boundaries in this feature space are complex and benefit from hierarchical splits rather than a single global linear separator.

However, these stronger models are still limited by feature quality and domain shift. When the Manhattan regression model is applied to Brooklyn, its performance collapses (negative R²), showing that even a flexible model trained on one boroughs distribution cannot simply be transplanted to another borough with different price levels and housing stock. The classification models fail even more dramatically when moved across boroughs, because the label space itself changes; this is a reminder that good predictive performance is inherently domain-specific and that models must be re-trained or at least adapted when the domain or label space shifts.

Methodologically, what “worked” was combining sensible cleaning (dropping nonarmslength sales, fixing squarefootage fields, filtering small classes), log transformations, and nonlinear models; what did not work was assuming that limited structural variables alone could fully explain prices or that a model trained on Manhattan would generalize to Brooklyn without explicit location features. In a production setting, I would layer on richer covariates (latitude/longitude, transit accessibility, building quality proxies, zoning, etc.) and likely move to gradientboosted trees or other ensemble methods, but the main lessons about nonlinearity, outliers, and domain dependence would remain the same.