16 KiB
1. One-Borough Analysis: Manhattan
I filtered the dataset to Manhattan by only keeping rows where BOROUGH == 1. After cleaning, parsing numeric fields, dropping implausible or missing values, etc, I had 7,294 usable sales out of 96,088 raw Manhattan rows, and 42,743 Brooklyn rows for later comparison.s
1(a) Planned patterns, trends, and modeling approach
For Manhattan, the overall big-picture trends I wanted to look at were: how sale price changes with building size (gross and land square footage), intensity of use (number of residential and commercial units), and building age (year built). Because NYC housing prices are known to be highly skewed with extreme luxury outliers - particularly in Manhattan's condo and co-op markets - I wanted to examine both raw prices and log-transformed prices to better see the central bulk of transactions.
I also expected nonlinear relationships, such as the fact that price per square foot starts to decrease for very large buildings, and there could be thresholds beyond which additional units change value in a nonadditive way. To detect these, I compared the simple linear regression on log sale price against a more flexible model like a random forest regression, which is well-suited for modeling nonlinear relationships and interactions among predictors.
For the classification, I treated "neighborhood" as the target and price/size variables as predictors to see whether basic quantitative attributes are enough to distinguish markets like "Upper East Side", "Harlem", and "SoHo". Then I compared three supervised models: k-NN, random forest, and multinomial logistic regression, using accuracy and macro-F1 as evaluation metrics.
1(b) Exploratory data analysis and outliers
I first cleaned the raw NYC file by converting the strings to numeric for SALE PRICE, LAND SQUARE FEET, GROSS SQUARE FEET, and YEAR BUILT, dropping obvious invalid numeric codes (e.g., “0”, “.”),removing very small sales below $10,000 to exclude non–arm’s-length transactions and recording artifacts, treating years before 1800 as missing, and requiring positive, non-missing values for land area, gross area, and year built. I set the remaining missing unit counts to 0 for residential, commercial, and total units.
In Manhattan, sale prices are extremely right-skewed-the minimum valid sale is about $10,050, the median $4.0M, the mean $16.3M, the 75th percentile $9.58M, and the maximum an enormous $2.40B. This suggests that a relative few ultra-high-value transactions pull the mean far above the median, which for Manhattan is fairly normal for high-end, expensive markets.
Using the quartiles, I then created a rule for outliers based on the inter-quartile range: with Q1 as approx $1.37M and Q3 as approx $9.58M, the IQR is about $8.21M, so the upper “fence” is Q3 + 1.5 * IQR approx $21.9M. Any sale above that is thus classified as an outlier and this 1.5 * IQR rule is the usual box-plot definition of outliers. With that rule I found there were 756 outliers in Manhattan, while the maximum sale price is more than $2.39B.
The histogram of raw sale prices, Figure 1, has almost all sales piled up near the left-hand side, with a long low-frequency tail extending out toward billions of dollars; this visually confirms heavy skew and extreme outliers.
Once I applied log1p(sale_price), the histogram on the log scale, Fig. 2, is much more symmetric and interpretable, compressing the ultra-luxury outliers into a reasonable range while still preserving their relative ordering.
Figure 2: manhattan sale price distribution (log scale)
Figure 3: manhattan sale price with outliers.
Finally, the scatterplot of gross square feet vs. sale price (with price on a log10 scale; Figure 4) displays a strong positive association-larger buildings sell for more-but also a lot of vertical spread, indicating that other factors beyond size drive price differences as well, namely location, building class, and quality. This is supported by the correlation analysis I have done: sale price is moderately correlated with gross square feet (about 0.49), weakly with land area (about 0.16) and total units (about 0.17), almost uncorrelated with year built (about 0.02).
Below is a scatterplot showing sale price vs. gross square feet for Manhattan: Figure 4: manhattan sale price vs gross square feet.
1(c) Regression analysis for sale price I then created a modeling dataset with predictors for regression:
-
land_sqft,gross_sqft, -
year_built,
res_units, comm_units, and total_units
and the target log_price = log1p(sale_price), removing any rows with missing values in those fields. I then split the data into a 75% training set and 25% test set using createDataPartition to preserve the distribution of log_price.
The baseline model was a multiple linear regression of log_price on all six predictors. On the held-out Manhattan test set this model achieved APPROXIMATELY R² = 0.19, RMSE = 1.69, and MAE = 1.26 on the log scale, meaning it explains only about 19% of the variance in log sale price and leaves relatively large residual errors. This suggests that linear relationships in these basic structural variables alone are not sufficient to capture the complexity of Manhattan real estate pricing.
I then trained a random forest regressor on the same predictors and target. The random forest model did considerably better on the Manhattan test set, with (approx) R² = 0.70, RMSE = 1.03, and MAE = 0.66 on log price-over tripling the explained variance relative to the linear model. This performance gain is consistent with the idea that random forests are effective at modeling nonlinear relationships and interactions without requiring me to specify them by hand.
The predicted-vs-actual plot for Manhattan's random forest (Figure 5) has points clustered around the 45 deg line-particularly in the log price mid-range-indicating generally accurate predictions, yet with scatter at the extremes.
Figure 5: random forest – predicted vs actual log(1 + sale price) (manhattan)
Figure 5
Figure 6: random forest residuals (manhattan) Figure 6
Overall, following fairly aggressive cleaning-dropping non-arm's-length sales, invalid years and missing areas-and a log transformation, the random forest regression yields a reasonably strong model for Manhattan sale prices, whereas the linear model substantially underfits.
1(d) Classification: predicting neighborhood in Manhattan
For neighborhood classification I created a subset in which neighborhood is not missing and only neighborhoods with 100 or more sales are retained in order to avoid tiny, noisy classes. After this filtering I had 6,792 records spanning 28 neighborhoods. I then selected the quantitative predictors
-
sale_price,land_sqft,gross_sqft, -
year_built,res_units,comm_units,total_units
and dropped any rows with remaining missing values.
First, I fit a k-NN classifier with k = 7 and standardization (center/scale) applied to all predictors. On the Manhattan test set, k-NN achieved an approximate accuracy = 0.48 and macro-F1 = 0.42, indicating that it correctly predicts the neighborhood for just under half of the held-out sales and gives moderate average F1 across classes. Next, I trained a random forest classifier using the same predictors with 300 trees and mtry = 3. This performed the best of the three, with accuracy of about 0.58 and macro-F1 about 0.53.
Lastly, I fit a multinomial logistic regression model, whose performance, despite its interpretability, was substantially worse (accuracy about 0.25 and macro-F1 about 0.30). It would be impossible to distinguish 28 neighborhoods using such numeric predictors. The confusion matrix of the random forest-a 28 * 28 table-reveals the most common confusion: geographically close or socio-economically similar neighbourhoods, like different slices of the Upper East/West Side or adjacent parts of Harlem (which makes perfect sense given that their price and size profiles have a very high overlap).
This cleaning for the classification task was meant to remove neighbourhoods with too few samples that would make F1 metrics unstable, enforce complete predictor data, and restrict attention to clearly labelled neighbourhoods. Even after cleaning, the modest accuracy and macro-F1 remind me that neighbourhood captures many qualitative aspects-exact location, school zones, views, amenities-that are not fully represented by simple size and unit counts.
2. Second-borough analysis: Brooklyn, using Manhattan models
For this second derived dataset I focused on Brooklyn (BOROUGH == 3). I repeated all of the same cleaning and feature engineering steps: numeric parsing, filtering on sale price and physical characteristics, construction of the same predictor variables (land_sqft, gross_sqft, year_built, res_units, comm_units, total_units, sale_price). This yielded 42,743 cleaned Brooklyn records-a far larger sample than Manhattan’s 7,294 cleaned rows.
2(a) Applying Manhattan regression to Brooklyn
I tested how well the Manhattan-trained random forest regressor generalizes by applying it directly to the Brooklyn dataset. On Brooklyn, the model's performance fell to APPROXIMATELY R² = 0.24, RMSE = 1.31 and MAE = 1.08 on log price-far worse than its performance of R² = 0.70 and MAE = 0.66 on Manhattan.
Figure 7: The predicted-vs-actual plot for Brooklyn displays a broad cloud of points with a strong linear trend but much more scatter around the 45 deg line than in Manhattan, which suggests that the model systematically underand over-predicts across different price ranges.
The residual plot shown in Figure 8 depicts residuals which are not symmetrically centered around zero; there are pockets of the predicted price which generate clusters of large negative residuals, indicative of consistent overestimation in parts of the Brooklyn market.
This poor generalization makes sense: Manhattan and Brooklyn have different mixes of property types-e.g. ultra-luxury high-rise condos vs. more rowhouses and small multi-family homes-and different price levels, even as Brooklyn has become increasingly expensive. A model that has been trained only on Manhattan data learns relationships calibrated to its particular price structure and building stock, so when it is applied to Brooklyn it picks up some broad patterns-e.g. larger buildings are more expensive-but misses borough-specific effects leading to a large drop in R².
2(b) Manhattan neighborhood classifiers applied to Brooklyn
To classify these, I made a Brooklyn subset with the same methodology as Manhattan, keeping only the neighborhoods with over 100 sales, removing rows with missing predictors, which gave me a total of 42,533 records across 56 Brooklyn neighborhoods. I then applied the three Manhattan-trained classifiers: k-NN, random forest, and multinomial logistic regression to predict Brooklyn neighborhoods.
The basic problem is that there's a fundamental label mismatch: the Manhattan models' output classes are 28 Manhattan neighborhoods, while the true labels in the Brooklyn data are 56 different Brooklyn neighborhoods. The resulting contingency tables are thus 56 * 28, with nonzero counts almost entirely off the diagonal: Brooklyn neighborhoods systematically get mapped to some Manhattan label that best matches their numeric features, but there is no notion of “correct” prediction in terms of neighborhood identity.
Because of this mismatch, usual metrics like accuracy, precision, recall, and F1 are not meaningful-every prediction is technically wrong with respect to the true Brooklyn neighborhood labels. The contingency tables are still useful descriptively: they show which Manhattan neighborhoods Brooklyn neighborhoods look like in terms of price and size (for example, some high-end Brooklyn areas may be frequently mapped to Upper East/West Side or SoHo/Tribeca). But as a generalization test of a neighborhood classifier, these results show that a model trained in one borough does not transfer to another when the class labels themselves change.
The Manhattan neighborhood classifiers therefore do not generalize to Brooklyn, in the sense of predicting Brooklyn neighborhood names; they act more like a rough borough-agnostic similarity mapping.
Hint:
2(c) Further remarks and confidence
One thing that stood out right away is how much more filtering Manhattan needed. Only about 7.6% of its original rows survived the cleaning process, while Brooklyn kept a far larger share. That points to either more missing or odd entries in the Manhattan data, or simply stricter criteria knocking out extreme cases there. Even after cleaning, both boroughs still have very skewed price distributions with plenty of outliers, which makes modeling tough in a market where a handful of ultra-expensive properties dominate the landscape.
Within Manhattan, I’m fairly confident in the regression results for mid-range homes-random forest handles that part of the market well-but the model struggles with luxury listings and doesn’t transfer cleanly across boroughs. The weak performance of multinomial logistic regression, along with the only-okay results from k-NN and random forest for neighborhood prediction, makes it clear that numeric features alone aren’t enough. Getting neighborhood right would require richer location details and stronger categorical features.
Overall conclusions about model types and suitability
A log transform on sale price and basic cleaning-removing outliers, invalid years, missing area data, and tiny non-arm’s-length transactions-are the minimum steps needed before modeling NYC housing data. Without them, extreme values and inconsistent records overwhelm everything and drag down model performance. The contrast between the raw and log-scaled price histograms, and between linear regression and random forest, shows how strongly the distribution’s shape affects model behavior.
Using only structural features, linear regression reaches an R² of about 0.19 for Manhattan, which reflects the strong nonlinearities and missing variables at play. Random forest, on the other hand, captures most of the variation with an R² around 0.70 and far smaller errors, highlighting the advantage of flexible ensemble methods in a market as messy as this one.
But even the best Manhattan model doesn’t travel well: applying it to Brooklyn drops performance to roughly R² approx 0.24. That makes it obvious that models trained in one borough don’t work elsewhere without accounting for differences in market structure and price levels. Purely structural, cross-sectional models miss the spatial and neighborhood effects needed for transferability.
For neighborhood classification, random forest again does better than k-NN or multinomial logistic regression, but overall accuracy is still modest (about 0.58, macro-F1 around 0.53). This reinforces that simple numerical features-price, area, and the like-aren’t enough to reliably identify neighborhoods. More detailed spatial information, building class categories, and possibly time-related features would likely improve results.
Overall, random forests are the strongest of the models you tested. They handle nonlinear, heavy-tailed relationships and deliver solid within-borough predictions, though they’re harder to interpret and struggle when applied to a different borough. Linear and logistic models are easy to explain but miss too much of the structure here. Taken together, the results point toward flexible nonlinear models for within-borough price predictions, with careful feature engineering and borough-specific training needed to generalize across NYC.