Compare commits
11 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
| 2806905ccf | |||
| 8ef9cb2d6e | |||
| 18a911f9d3 | |||
| 414a4ac5a3 | |||
| 4eff5a6378 | |||
| 5adb4119f5 | |||
| cd3ababd59 | |||
| 88f2975b86 | |||
| dc2ceac7de | |||
| 9abd1a6df6 | |||
| 555650ac3c |
@@ -2,7 +2,7 @@
|
||||
|
||||
|
||||
suppressPackageStartupMessages({
|
||||
pkgs <- c("tidyverse", "readr", "readxl", "broom", "jsonlite", "ggplot2", "class", "optparse")
|
||||
pkgs <- c("tidyverse", "readr", "readxl", "broom", "jsonlite", "ggplot2", "class", "optparse", "markdown")
|
||||
to_install <- pkgs[!pkgs %in% rownames(installed.packages())]
|
||||
if (length(to_install)) install.packages(to_install, repos = "https://cloud.r-project.org")
|
||||
lapply(pkgs, library, character.only = TRUE)
|
||||
@@ -1,3 +1,5 @@
|
||||
library(markdown)
|
||||
|
||||
source("/home/ion606/Desktop/Homework/Data Analytics/Assignments/Assignment II/R/00_utils.R")
|
||||
ctx <- jsonlite::fromJSON("/home/ion606/Desktop/Homework/Data Analytics/Assignments/Assignment II/output/ctx.json")
|
||||
|
||||
@@ -70,15 +72,18 @@ if (!is.null(ctx$knn) && length(ctx$knn)) {
|
||||
# I hate markdown sometimes man
|
||||
md <- gsub("/home/ion606/Desktop/Homework/Data Analytics/Assignments/Assignment II/output/", "", md)
|
||||
|
||||
writeLines(md, "/home/ion606/Desktop/Homework/Data Analytics/Assignments/Assignment II/output/report.md")
|
||||
writeLines(jsonlite::toJSON(ctx, pretty = TRUE, auto_unbox = TRUE),
|
||||
file.path(ctx$stats_dir, "summary.json"))
|
||||
# writeLines(md, "/home/ion606/Desktop/Homework/Data Analytics/Assignments/Assignment II/output/report.md")
|
||||
# writeLines(jsonlite::toJSON(ctx, pretty = TRUE, auto_unbox = TRUE),
|
||||
# file.path(ctx$stats_dir, "summary.json"))
|
||||
|
||||
# rmarkdown::render(
|
||||
# "/home/ion606/Desktop/Homework/Data Analytics/Assignments/Assignment II/output/report.md",
|
||||
# output_format = "pdf_document",
|
||||
# output_file = "report.pdf",
|
||||
# output_dir = "/home/ion606/Desktop/Homework/Data Analytics/Assignments/Assignment II/output"
|
||||
# )
|
||||
md_file <- "output/report.md"
|
||||
html_file <- "output/report.html"
|
||||
pdf_file <- "output/report.pdf"
|
||||
|
||||
setwd("/home/ion606/Desktop/Homework/Data Analytics/Assignments/Assignment II/")
|
||||
markdownToHTML(
|
||||
md_file,
|
||||
html_file
|
||||
)
|
||||
|
||||
message("done")
|
||||
|
Before Width: | Height: | Size: 24 KiB After Width: | Height: | Size: 24 KiB |
|
Before Width: | Height: | Size: 25 KiB After Width: | Height: | Size: 25 KiB |
|
Before Width: | Height: | Size: 36 KiB After Width: | Height: | Size: 36 KiB |
|
Before Width: | Height: | Size: 42 KiB After Width: | Height: | Size: 42 KiB |
|
Before Width: | Height: | Size: 63 KiB After Width: | Height: | Size: 63 KiB |
|
Before Width: | Height: | Size: 62 KiB After Width: | Height: | Size: 62 KiB |
|
Before Width: | Height: | Size: 46 KiB After Width: | Height: | Size: 46 KiB |
|
Before Width: | Height: | Size: 63 KiB After Width: | Height: | Size: 63 KiB |
|
Before Width: | Height: | Size: 62 KiB After Width: | Height: | Size: 62 KiB |
|
Before Width: | Height: | Size: 34 KiB After Width: | Height: | Size: 34 KiB |
|
Before Width: | Height: | Size: 34 KiB After Width: | Height: | Size: 34 KiB |
|
Before Width: | Height: | Size: 64 KiB After Width: | Height: | Size: 64 KiB |
|
Before Width: | Height: | Size: 63 KiB After Width: | Height: | Size: 63 KiB |
|
Before Width: | Height: | Size: 32 KiB After Width: | Height: | Size: 32 KiB |
|
Before Width: | Height: | Size: 32 KiB After Width: | Height: | Size: 32 KiB |
@@ -0,0 +1,164 @@
|
||||
I used Manhattan (BOROUGH code = 1) for Question 1 and Brooklyn (BOROUGH code = 3) for Question 2.
|
||||
|
||||
NYC Dept. of Finance / BBL conventions:
|
||||
- 1 = Manhattan
|
||||
- 2 = Bronx
|
||||
- 3 = Brooklyn
|
||||
- 4 = Queens
|
||||
- 5 = Staten Island
|
||||
|
||||
The dataset itself is NYC’s annualized file of residential property sales across all five boroughs
|
||||
|
||||
## Loading and Cleaning the Data
|
||||
|
||||
- `manhattan_clean` ended up with **6,313** sales.
|
||||
- `brooklyn_clean` ended up with **40,921** sales.
|
||||
|
||||
---
|
||||
|
||||
# 1. One-borough analysis (Manhattan, BOROUGH = 1)
|
||||
|
||||
---
|
||||
|
||||
## 1(a) Patterns, trends, and modeling plan
|
||||
|
||||
For Manhattan, I’m interested first in how **SALE PRICE** is distributed and how it relates to building characteristics like **GROSS SQUARE FEET**, **LAND SQUARE FEET**, **YEAR BUILT**, and **unit counts**. Because this is a citywide residential sales file, I expect the price distribution to be extremely right‑skewed with a small number of ultra‑expensive transactions and many more moderately priced ones.
|
||||
|
||||
I’d start with **univariate** distributions of price, square footage, and year built, and then move to **bivariate** relationships (scatter plots of price vs. size, boxplots of price by neighborhood) and **correlation matrices**. For modeling, I’d use **log‑transformed sale price** as the response to stabilize variance and compare a baseline **linear regression** to a non‑linear **Random Forest** regressor. Finally, I’d treat **NEIGHBORHOOD** as a label and build supervised classifiers (k‑NN, Random Forest, logistic regression) using quantitative features (price, areas, units, year) to see how well location can be inferred from physical characteristics and price alone.
|
||||
|
||||
---
|
||||
|
||||
## 1(b) Exploratory Data Analysis and Outliers
|
||||
|
||||
### Findings
|
||||
|
||||
1. **Distribution of Sale Price**
|
||||
For Manhattan, after cleaning, `SALE PRICE` is extremely right‑skewed:
|
||||
|
||||
- Count is approx **6,313**.
|
||||
- Median is approx **$3.86M**.
|
||||
- Mean is approx **$15.7M**.
|
||||
- 75th percentile is approx **$9.25M**.
|
||||
- 90th percentile is approx **$21.8M**.
|
||||
- 99th percentile is approx **$228.9M**.
|
||||
- Maximum is approx **$2.40B**.
|
||||
|
||||
That huge gap between the median and max confirms a heavy upper tail.
|
||||
|
||||
2. **Outliers**
|
||||
Using the 1.5 IQR rule, the upper outlier threshold is around **$21.1M**; there are about **650** outlier sales above this bound. The most extreme sale (approx$2.4B) is orders of magnitude larger than a typical sale and shows up as a solitary point far above the rest in the boxplot. On the lower side, I also saw many sales at or near zero in the raw data, which is why I filtered out sales ≤$10,000 as likely non‑arm’s‑length or data errors.
|
||||
|
||||
3. **Effect of log transformation.**
|
||||
When I plotted a histogram of `log(1 + SALE PRICE)`, the distribution became much closer to symmetric: the bulk of log‑prices fell approximately between ~14 and ~17 (roughly $1.2M to $24M), with a long but much more manageable upper tail. This supported using the log scale for regression.
|
||||
|
||||
4. **Relationships with size & other features.**
|
||||
The correlation of `SALE PRICE` with **GROSS SQUARE FEET** was about **0.49**, substantially higher than any other feature. Correlations with **COMMERCIAL UNITS**, **LAND SQUARE FEET**, and **TOTAL UNITS** were modestly positive (~0.16–0.21), and correlation with **YEAR BUILT** was essentially near zero. This suggests that building size is the main driver captured in these quantitative covariates, while age and unit counts are relatively weak predictors by themselves.
|
||||
|
||||
5. **Scatter plots & heteroskedasticity.**
|
||||
The scatter plot of price vs gross square feet (with a log scale on price) showed a clear upward trend but with wide vertical spread, especially for larger buildings. High‑end buildings with similar square footage can sell at very different prices, which is consistent with neighborhood effects, building quality, and other unobserved characteristics. Overall, the EDA shows strong skewness, many high‑end outliers, and a moderate but noisy link between size and price.
|
||||
|
||||
---
|
||||
|
||||
## 1(c) Regression analysis to predict sale price
|
||||
|
||||
### Findings
|
||||
|
||||
1. **Cleaning for regression.**
|
||||
In addition to the general cleaning above, I required that observations have non‑missing **GROSS SQUARE FEET**, **LAND SQUARE FEET**, **YEAR BUILT**, and unit counts, and that the areas be strictly positive. I also removed sales at or below $10,000 as non‑market outliers. I did **not** remove very high prices; instead I relied on the log transformation and the tree‑based model to reduce their influence.
|
||||
|
||||
2. **Linear regression (baseline).**
|
||||
The standardized linear model on log‑price achieved only about **R² is approx 0.13** on the test set, with RMSE is approx **1.71** and MAE is approx **1.24** in log units. This means a purely linear relationship between size, units, year built and log price is a poor approximation – unsurprising given the complexity of Manhattan’s housing market and the missing effects of location and building quality.
|
||||
|
||||
3. **Random Forest regression (non‑linear).**
|
||||
The Random Forest model performed much better, with **R² is approx 0.75**, RMSE is approx **0.92**, and MAE is approx **0.59** on log‑price. Interpreting roughly, an MAE of 0.59 in log units corresponds to prediction errors on the order of ±80% in price (because exp(0.59) is approx 1.8), which is not great for individual deals but decent for a coarse citywide model based only on a few structural features.
|
||||
|
||||
4. **Interpretation of predictors.**
|
||||
Based on the earlier correlations and typical real‑estate patterns, most of the predictive power comes from **GROSS SQUARE FEET**, with **LAND SQUARE FEET** and unit counts adding secondary information. Year built contributes little signal by itself, consistent with the near‑zero correlation with sale price and the fact that historic vs modern buildings can command premiums or discounts depending on context.
|
||||
|
||||
5. **Model choice.**
|
||||
Because the Random Forest explains substantially more variance in log price and is more robust to non‑linearity and heteroskedasticity than linear regression, I treated it as the **best Manhattan regression model** and used it as the model to generalize to Brooklyn in Question 2.
|
||||
|
||||
---
|
||||
|
||||
## 1(d) Classification: predicting neighborhood from quantitative variables
|
||||
|
||||
### Findings
|
||||
|
||||
1. **Cleaning for classification.**
|
||||
I restricted to Manhattan neighborhoods with at least **100** sales to avoid tiny classes. That left **5,476** observations and **23 neighborhoods** (e.g., Midtown West, multiple Upper East/West Side segments, several Harlem neighborhoods, Chelsea, Lower East Side, etc.). I also dropped any rows with missing numeric features.
|
||||
|
||||
2. **k‑NN classifier.**
|
||||
With standardized features and k=7, k‑NN achieved **accuracy approx 0.50**, macro **F1 is approx 0.45**, and weighted **F1 is approx 0.50**. The confusion matrix showed that it often confused nearby or similar neighborhoods (e.g., different segments of the Upper East/West Side), which is intuitive because those areas have similar price/size profiles.
|
||||
|
||||
3. **Random Forest classifier (best).**
|
||||
The Random Forest neighborhood classifier performed best, with **accuracy approx 0.61**, macro **precision approx 0.59**, macro **recall is approx 0.55**, and macro **F1 is approx 0.57** (weighted F1 is approx 0.61). The confusion matrix had a reasonably strong diagonal for major neighborhoods like Midtown West and Central Harlem, though there were still frequent misclassifications between adjacent, similar market segments.
|
||||
|
||||
4. **Logistic regression.**
|
||||
Multinomial logistic regression performed poorly on this feature set, with **accuracy approx 0.27** and macro **F1 approx 0.12**. This suggests that the decision boundaries between neighborhoods in this feature space are highly non‑linear, and a simple linear model in the original feature space is not expressive enough.
|
||||
|
||||
5. **Overall assessment.**
|
||||
Even the best classifier (Random Forest) makes many mistakes, which is expected: we are trying to reconstruct a very fine‑grained location label (neighborhood) from crude variables (square feet, units, year, plus price) and ignoring explicit spatial coordinates. The contingency tables show that neighborhoods with similar densities and price levels systematically get confused, highlighting the limits of using only structural attributes and price to infer location.
|
||||
|
||||
---
|
||||
|
||||
# 2. Second borough (Brooklyn, BOROUGH = 3)
|
||||
|
||||
For Question 2, I used **Brooklyn** as the second borough, cleaned with the same logic as Manhattan (BOROUGH=3, same min price, same handling of missing areas and year built).
|
||||
|
||||
---
|
||||
|
||||
## 2(a) Applying the Manhattan regression model to Brooklyn
|
||||
|
||||
### Findings
|
||||
|
||||
1. **Performance metrics.**
|
||||
When the Manhattan Random Forest regression model was applied directly to the cleaned Brooklyn data (no retraining), I got:
|
||||
|
||||
- **R² is approx –0.77** on log‑price.
|
||||
- **RMSE is approx 1.21**.
|
||||
- **MAE is approx 0.95**.
|
||||
|
||||
The negative R² means the model does **worse** than simply predicting the mean log price for every Brooklyn sale.
|
||||
|
||||
2. **Predicted vs actual plot.**
|
||||
The predicted vs actual log‑price scatter for Brooklyn is very diffuse and does not cluster around the 45° line. The model tends to **systematically mis‑scale Brooklyn prices**: for some segments it over‑predicts (especially cheaper properties) and for more expensive Brooklyn neighborhoods it under‑predicts compared to their actual sale prices.
|
||||
|
||||
3. **Residual diagnostics.**
|
||||
The residual‑vs‑prediction plot shows large, structured patterns rather than random noise; the mean residual is substantially negative (indicating systematic under‑prediction on the log scale). This indicates poor generalization: the relationships between square footage, units, year built and price that the model learned in Manhattan do not transfer well to Brooklyn.
|
||||
|
||||
4. **Interpretation.**
|
||||
Brooklyn has a very different mix of housing types, neighborhood price levels, and land availability compared to Manhattan. Without explicit location variables and more detailed building characteristics, a model calibrated on Manhattan cannot capture Brooklyn’s pricing structure, so it generalizes poorly even though the algorithms are fairly powerful.
|
||||
|
||||
---
|
||||
|
||||
## 2(b) Applying Manhattan neighborhood classifiers to Brooklyn
|
||||
|
||||
### Findings
|
||||
|
||||
1. **Label-space mismatch.**
|
||||
The classifiers trained on Manhattan were all trained to predict **Manhattan neighborhoods** (e.g., “MIDTOWN WEST”, “UPPER EAST SIDE (79–96)”) as labels. When evaluated on Brooklyn, the **true labels** are Brooklyn neighborhoods (“BAY RIDGE”, “WILLIAMSBURG”, etc.), which **do not overlap at all** with the Manhattan label set. As a result, the Manhattan models will *never* predict the correct Brooklyn neighborhood name.
|
||||
|
||||
2. **Metrics.**
|
||||
As expected, for all three models (k‑NN, Random Forest, logistic regression), I got **accuracy approx 0.0** and macro/weighted **F1 approx 0.0** when using Brooklyn neighborhood names as the true labels. In other words, the models made effectively zero correct predictions across all observations.
|
||||
|
||||
3. **Contingency tables.**
|
||||
The resulting “confusion matrices” are degenerate: for each Brooklyn neighborhood, all counts are off‑diagonal because the classifier is outputting Manhattan labels that never equal the Brooklyn labels on the y‑axis. This still technically produces a contingency table, but it visually demonstrates that the model is fundamentally mis-specified for this task when moved to a different borough.
|
||||
|
||||
4. **Interpretation.**
|
||||
This exercise highlights a key point: **classification models can’t generalize across domains when the label space itself changes**. Because Brooklyn neighborhoods are a completely different set of categories, a Manhattan neighborhood classifier cannot be expected to perform well without re-training on Brooklyn labels. At best, you might interpret the predictions as a “closest Manhattan analog,” but they have no predictive validity for the true Brooklyn neighborhood names.
|
||||
|
||||
---
|
||||
|
||||
## 2(c) General observations & confidence
|
||||
|
||||
The datasets for Manhattan and Brooklyn are both large but quite **noisy**, with many missing or inconsistent values for square footage and units and with a lot of non‑arm’s‑length sales at very low or zero prices. Even after cleaning, high‑end outliers still exert influence and reflect market segments (e.g., trophy assets) that behave differently from the bulk of the distribution. Across both boroughs, models based only on size, units, and age of the building capture some signal but miss many important drivers like exact location, building quality, and amenities. My confidence in the **relative** conclusions (e.g., Random Forest beats linear regression; Manhattan‑trained models generalize poorly to Brooklyn) is high, but I would not rely on these models for **precise valuation** of individual properties.
|
||||
|
||||
---
|
||||
|
||||
# 3. 6000-level: Conclusions about model types & suitability
|
||||
|
||||
Across this study, **non‑linear tree‑based models (Random Forests)** consistently outperformed simpler linear models for both regression and classification. In the Manhattan regression, linear regression on log‑price captured only a small fraction of the variance, while the Random Forest achieved a much higher R² and more realistic error levels, indicating that the relationship between size/units and price is non‑linear, with interactions and thresholds that linear models cannot represent. For classification, the Random Forest neighborhood model again beat k‑NN and especially logistic regression, suggesting that neighborhood decision boundaries in this feature space are complex and benefit from hierarchical splits rather than a single global linear separator.
|
||||
|
||||
However, these stronger models are still limited by **feature quality and domain shift**. When the Manhattan regression model is applied to Brooklyn, its performance collapses (negative R²), showing that even a flexible model trained on one borough’s distribution cannot simply be transplanted to another borough with different price levels and housing stock. The classification models fail even more dramatically when moved across boroughs, because the label space itself changes; this is a reminder that good predictive performance is inherently **domain-specific** and that models must be re-trained or at least adapted when the domain or label space shifts.
|
||||
|
||||
Methodologically, what “worked” was combining **sensible cleaning (dropping non‑arm’s‑length sales, fixing square‑footage fields, filtering small classes), log transformations, and non‑linear models**; what did not work was assuming that limited structural variables alone could fully explain prices or that a model trained on Manhattan would generalize to Brooklyn without explicit location features. In a production setting, I would layer on richer covariates (latitude/longitude, transit accessibility, building quality proxies, zoning, etc.) and likely move to gradient‑boosted trees or other ensemble methods, but the main lessons about non‑linearity, outliers, and domain dependence would remain the same.
|
||||
@@ -0,0 +1,135 @@
|
||||
## 1. One-Borough Analysis: Manhattan
|
||||
|
||||
I filtered the dataset to Manhattan by only keeping rows where `BOROUGH == 1`. After cleaning, parsing numeric fields, dropping implausible or missing values, etc, I had 7,294 usable sales out of 96,088 raw Manhattan rows, and 42,743 Brooklyn rows for later comparison.s
|
||||
|
||||
1(a) Planned patterns, trends, and modeling approach
|
||||
|
||||
For Manhattan, the overall big-picture trends I wanted to look at were: how sale price changes with building size (gross and land square footage), intensity of use (number of residential and commercial units), and building age (year built). Because NYC housing prices are known to be highly skewed with extreme luxury outliers - particularly in Manhattan's condo and co-op markets - I wanted to examine both raw prices and log-transformed prices to better see the central bulk of transactions.
|
||||
|
||||
I also expected nonlinear relationships, such as the fact that price per square foot starts to decrease for very large buildings, and there could be thresholds beyond which additional units change value in a nonadditive way. To detect these, I compared the simple linear regression on log sale price against a more flexible model like a random forest regression, which is well-suited for modeling nonlinear relationships and interactions among predictors.
|
||||
|
||||
For the classification, I treated "neighborhood" as the target and price/size variables as predictors to see whether basic quantitative attributes are enough to distinguish markets like "Upper East Side", "Harlem", and "SoHo". Then I compared three supervised models: k-NN, random forest, and multinomial logistic regression, using accuracy and macro-F1 as evaluation metrics.
|
||||
|
||||
---
|
||||
|
||||
1(b) Exploratory data analysis and outliers
|
||||
|
||||
I first cleaned the raw NYC file by converting the strings to numeric for `SALE PRICE`, `LAND SQUARE FEET`, `GROSS SQUARE FEET`, and `YEAR BUILT`, dropping obvious invalid numeric codes (e.g., “0”, “.”),removing very small sales below $10,000 to exclude non–arm’s-length transactions and recording artifacts, treating years before 1800 as missing, and requiring positive, non-missing values for land area, gross area, and year built. I set the remaining missing unit counts to 0 for residential, commercial, and total units.
|
||||
|
||||
In Manhattan, sale prices are extremely right-skewed-the minimum valid sale is about $10,050, the median $4.0M, the mean $16.3M, the 75th percentile $9.58M, and the maximum an enormous $2.40B. This suggests that a relative few ultra-high-value transactions pull the mean far above the median, which for Manhattan is fairly normal for high-end, expensive markets.
|
||||
|
||||
Using the quartiles, I then created a rule for outliers based on the inter-quartile range: with Q1 as approx $1.37M and Q3 as approx $9.58M, the IQR is about $8.21M, so the upper “fence” is Q3 + 1.5 \* IQR approx $21.9M. Any sale above that is thus classified as an outlier and this 1.5 \* IQR rule is the usual box-plot definition of outliers. With that rule I found there were 756 outliers in Manhattan, while the maximum sale price is more than $2.39B.
|
||||
|
||||
The histogram of raw sale prices, Figure 1, has almost all sales piled up near the left-hand side, with a long low-frequency tail extending out toward billions of dollars; this visually confirms heavy skew and extreme outliers.
|
||||
|
||||

|
||||
|
||||
Once I applied `log1p(sale_price)`, the histogram on the log scale, Fig. 2, is much more symmetric and interpretable, compressing the ultra-luxury outliers into a reasonable range while still preserving their relative ordering.
|
||||
|
||||
Figure 2: manhattan sale price distribution (log scale)
|
||||

|
||||
|
||||
Figure 3: manhattan sale price with outliers.
|
||||

|
||||
|
||||
Finally, the scatterplot of gross square feet vs. sale price (with price on a log10 scale; Figure 4) displays a strong positive association-larger buildings sell for more-but also a lot of vertical spread, indicating that other factors beyond size drive price differences as well, namely location, building class, and quality. This is supported by the correlation analysis I have done: sale price is moderately correlated with gross square feet (about 0.49), weakly with land area (about 0.16) and total units (about 0.17), almost uncorrelated with year built (about 0.02).
|
||||
|
||||
Below is a scatterplot showing sale price vs. gross square feet for Manhattan: Figure 4: manhattan sale price vs gross square feet.
|
||||
|
||||
---
|
||||
|
||||
1(c) Regression analysis for sale price
|
||||
I then created a modeling dataset with predictors for regression:
|
||||
|
||||
* `land_sqft`, `gross_sqft`,
|
||||
|
||||
* `year_built`,
|
||||
|
||||
`res_units`, `comm_units`, and `total_units`
|
||||
|
||||
and the target log_price = log1p(sale_price), removing any rows with missing values in those fields. I then split the data into a 75% training set and 25% test set using createDataPartition to preserve the distribution of log_price.
|
||||
|
||||
The baseline model was a multiple linear regression of `log_price` on all six predictors. On the held-out Manhattan test set this model achieved APPROXIMATELY R² = 0.19, RMSE = 1.69, and MAE = 1.26 on the log scale, meaning it explains only about 19% of the variance in log sale price and leaves relatively large residual errors. This suggests that linear relationships in these basic structural variables alone are not sufficient to capture the complexity of Manhattan real estate pricing.
|
||||
|
||||
I then trained a random forest regressor on the same predictors and target. The random forest model did considerably better on the Manhattan test set, with (approx) R² = 0.70, RMSE = 1.03, and MAE = 0.66 on log price-over tripling the explained variance relative to the linear model. This performance gain is consistent with the idea that random forests are effective at modeling nonlinear relationships and interactions without requiring me to specify them by hand.
|
||||
|
||||
The predicted-vs-actual plot for Manhattan's random forest (Figure 5) has points clustered around the 45 deg line-particularly in the log price mid-range-indicating generally accurate predictions, yet with scatter at the extremes.
|
||||
|
||||
Figure 5: random forest – predicted vs actual log(1 + sale price) (manhattan)
|
||||
|
||||
[Figure 5](./plots/rf_pred_vs_actual_manhattan.png)
|
||||
|
||||
Figure 6: random forest residuals (manhattan)
|
||||
[Figure 6](./plots/rf_resid_manhattan.png)
|
||||
|
||||
Overall, following fairly aggressive cleaning-dropping non-arm's-length sales, invalid years and missing areas-and a log transformation, the random forest regression yields a reasonably strong model for Manhattan sale prices, whereas the linear model substantially underfits.
|
||||
|
||||
*
|
||||
|
||||
1(d) Classification: predicting neighborhood in Manhattan
|
||||
|
||||
For neighborhood classification I created a subset in which `neighborhood` is not missing and only neighborhoods with 100 or more sales are retained in order to avoid tiny, noisy classes. After this filtering I had 6,792 records spanning 28 neighborhoods. I then selected the quantitative predictors
|
||||
|
||||
* `sale_price`, `land_sqft`, `gross_sqft`,
|
||||
|
||||
* `year_built`, `res_units`, `comm_units`, `total_units`
|
||||
|
||||
and dropped any rows with remaining missing values.
|
||||
|
||||
First, I fit a k-NN classifier with k = 7 and standardization (center/scale) applied to all predictors. On the Manhattan test set, k-NN achieved an approximate accuracy = 0.48 and macro-F1 = 0.42, indicating that it correctly predicts the neighborhood for just under half of the held-out sales and gives moderate average F1 across classes. Next, I trained a random forest classifier using the same predictors with 300 trees and `mtry = 3`. This performed the best of the three, with accuracy of about 0.58 and macro-F1 about 0.53.
|
||||
|
||||
Lastly, I fit a multinomial logistic regression model, whose performance, despite its interpretability, was substantially worse (accuracy about 0.25 and macro-F1 about 0.30). It would be impossible to distinguish 28 neighborhoods using such numeric predictors. The confusion matrix of the random forest-a 28 \* 28 table-reveals the most common confusion: geographically close or socio-economically similar neighbourhoods, like different slices of the Upper East/West Side or adjacent parts of Harlem (which makes perfect sense given that their price and size profiles have a very high overlap).
|
||||
|
||||
This cleaning for the classification task was meant to remove neighbourhoods with too few samples that would make F1 metrics unstable, enforce complete predictor data, and restrict attention to clearly labelled neighbourhoods. Even after cleaning, the modest accuracy and macro-F1 remind me that neighbourhood captures many qualitative aspects-exact location, school zones, views, amenities-that are not fully represented by simple size and unit counts.
|
||||
|
||||
---
|
||||
|
||||
## 2. Second-borough analysis: Brooklyn, using Manhattan models
|
||||
|
||||
For this second derived dataset I focused on Brooklyn (`BOROUGH == 3`). I repeated all of the same cleaning and feature engineering steps: numeric parsing, filtering on sale price and physical characteristics, construction of the same predictor variables (`land_sqft`, `gross_sqft`, `year_built`, `res_units`, `comm_units`, `total_units`, `sale_price`). This yielded 42,743 cleaned Brooklyn records-a far larger sample than Manhattan’s 7,294 cleaned rows.
|
||||
|
||||
2(a) Applying Manhattan regression to Brooklyn
|
||||
|
||||
I tested how well the Manhattan-trained random forest regressor generalizes by applying it directly to the Brooklyn dataset. On Brooklyn, the model's performance fell to APPROXIMATELY R² = 0.24, RMSE = 1.31 and MAE = 1.08 on log price-far worse than its performance of R² = 0.70 and MAE = 0.66 on Manhattan.
|
||||
|
||||
Figure 7: The predicted-vs-actual plot for Brooklyn displays a broad cloud of points with a strong linear trend but much more scatter around the 45 deg line than in Manhattan, which suggests that the model systematically underand over-predicts across different price ranges.
|
||||
|
||||

|
||||
|
||||
The residual plot shown in Figure 8 depicts residuals which are not symmetrically centered around zero; there are pockets of the predicted price which generate clusters of large negative residuals, indicative of consistent overestimation in parts of the Brooklyn market.
|
||||
|
||||

|
||||
|
||||
This poor generalization makes sense: Manhattan and Brooklyn have different mixes of property types-e.g. ultra-luxury high-rise condos vs. more rowhouses and small multi-family homes-and different price levels, even as Brooklyn has become increasingly expensive. A model that has been trained only on Manhattan data learns relationships calibrated to its particular price structure and building stock, so when it is applied to Brooklyn it picks up some broad patterns-e.g. larger buildings are more expensive-but misses borough-specific effects leading to a large drop in R².
|
||||
|
||||
---
|
||||
|
||||
2(b) Manhattan neighborhood classifiers applied to Brooklyn
|
||||
|
||||
To classify these, I made a Brooklyn subset with the same methodology as Manhattan, keeping only the neighborhoods with over 100 sales, removing rows with missing predictors, which gave me a total of 42,533 records across 56 Brooklyn neighborhoods. I then applied the three Manhattan-trained classifiers: k-NN, random forest, and multinomial logistic regression to predict Brooklyn neighborhoods.
|
||||
|
||||
The basic problem is that there's a fundamental label mismatch: the Manhattan models' output classes are 28 Manhattan neighborhoods, while the true labels in the Brooklyn data are 56 *different* Brooklyn neighborhoods. The resulting contingency tables are thus 56 \* 28, with nonzero counts almost entirely off the diagonal: Brooklyn neighborhoods systematically get mapped to some Manhattan label that best matches their numeric features, but there is no notion of “correct” prediction in terms of neighborhood identity.
|
||||
|
||||
Because of this mismatch, usual metrics like accuracy, precision, recall, and F1 are not meaningful-every prediction is technically wrong with respect to the true Brooklyn neighborhood labels. The contingency tables are still useful descriptively: they show which Manhattan neighborhoods Brooklyn neighborhoods *look like* in terms of price and size (for example, some high-end Brooklyn areas may be frequently mapped to Upper East/West Side or SoHo/Tribeca). But as a generalization test of a neighborhood classifier, these results show that a model trained in one borough does not transfer to another when the class labels themselves change.
|
||||
|
||||
The Manhattan neighborhood classifiers therefore do not generalize to Brooklyn, in the sense of predicting *Brooklyn* neighborhood names; they act more like a rough borough-agnostic similarity mapping.
|
||||
|
||||
Hint:
|
||||
|
||||
2(c) Further remarks and confidence
|
||||
|
||||
One thing that stood out right away is how much more filtering Manhattan needed. Only about 7.6% of its original rows survived the cleaning process, while Brooklyn kept a far larger share. That points to either more missing or odd entries in the Manhattan data, or simply stricter criteria knocking out extreme cases there. Even after cleaning, both boroughs still have very skewed price distributions with plenty of outliers, which makes modeling tough in a market where a handful of ultra-expensive properties dominate the landscape.
|
||||
|
||||
Within Manhattan, I’m fairly confident in the regression results for mid-range homes-random forest handles that part of the market well-but the model struggles with luxury listings and doesn’t transfer cleanly across boroughs. The weak performance of multinomial logistic regression, along with the only-okay results from k-NN and random forest for neighborhood prediction, makes it clear that numeric features alone aren’t enough. Getting neighborhood right would require richer location details and stronger categorical features.
|
||||
|
||||
### Overall conclusions about model types and suitability
|
||||
|
||||
A log transform on sale price and basic cleaning-removing outliers, invalid years, missing area data, and tiny non-arm’s-length transactions-are the minimum steps needed before modeling NYC housing data. Without them, extreme values and inconsistent records overwhelm everything and drag down model performance. The contrast between the raw and log-scaled price histograms, and between linear regression and random forest, shows how strongly the distribution’s shape affects model behavior.
|
||||
|
||||
Using only structural features, linear regression reaches an R² of about 0.19 for Manhattan, which reflects the strong nonlinearities and missing variables at play. Random forest, on the other hand, captures most of the variation with an R² around 0.70 and far smaller errors, highlighting the advantage of flexible ensemble methods in a market as messy as this one.
|
||||
|
||||
But even the best Manhattan model doesn’t travel well: applying it to Brooklyn drops performance to roughly R² approx 0.24. That makes it obvious that models trained in one borough don’t work elsewhere without accounting for differences in market structure and price levels. Purely structural, cross-sectional models miss the spatial and neighborhood effects needed for transferability.
|
||||
|
||||
For neighborhood classification, random forest again does better than k-NN or multinomial logistic regression, but overall accuracy is still modest (about 0.58, macro-F1 around 0.53). This reinforces that simple numerical features-price, area, and the like-aren’t enough to reliably identify neighborhoods. More detailed spatial information, building class categories, and possibly time-related features would likely improve results.
|
||||
|
||||
Overall, random forests are the strongest of the models you tested. They handle nonlinear, heavy-tailed relationships and deliver solid within-borough predictions, though they’re harder to interpret and struggle when applied to a different borough. Linear and logistic models are easy to explain but miss too much of the structure here. Taken together, the results point toward flexible nonlinear models for within-borough price predictions, with careful feature engineering and borough-specific training needed to generalize across NYC.
|
||||
@@ -0,0 +1,553 @@
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
from matplotlib.ticker import FuncFormatter
|
||||
|
||||
from sklearn.model_selection import train_test_split
|
||||
from sklearn.metrics import (
|
||||
r2_score,
|
||||
mean_squared_error,
|
||||
mean_absolute_error,
|
||||
accuracy_score,
|
||||
precision_recall_fscore_support,
|
||||
confusion_matrix,
|
||||
)
|
||||
from sklearn.linear_model import LinearRegression, LogisticRegression
|
||||
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
|
||||
from sklearn.neighbors import KNeighborsClassifier
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from sklearn.pipeline import Pipeline
|
||||
|
||||
RANDOM_STATE = 42
|
||||
np.random.seed(RANDOM_STATE)
|
||||
|
||||
file_path = "Given/NYC_Citywide_Annualized_Calendar_Sales_Update_20241107.csv"
|
||||
|
||||
cols_needed = [
|
||||
"BOROUGH", "NEIGHBORHOOD", "BUILDING CLASS CATEGORY",
|
||||
"TAX CLASS AS OF FINAL ROLL", "BLOCK", "LOT",
|
||||
"BUILDING CLASS AS OF FINAL ROLL", "ZIP CODE",
|
||||
"RESIDENTIAL UNITS", "COMMERCIAL UNITS", "TOTAL UNITS",
|
||||
"LAND SQUARE FEET", "GROSS SQUARE FEET", "YEAR BUILT",
|
||||
"TAX CLASS AT TIME OF SALE", "BUILDING CLASS AT TIME OF SALE",
|
||||
"SALE PRICE", "SALE DATE",
|
||||
]
|
||||
|
||||
# loading...
|
||||
nyc = pd.read_csv(file_path, dtype=str, low_memory=False)
|
||||
cols_present = [c for c in cols_needed if c in nyc.columns]
|
||||
nyc = nyc[cols_present]
|
||||
|
||||
# force borough numeric
|
||||
nyc["BOROUGH"] = pd.to_numeric(nyc["BOROUGH"], errors="coerce")
|
||||
|
||||
manhattan_raw = nyc[nyc["BOROUGH"] == 1].copy()
|
||||
brooklyn_raw = nyc[nyc["BOROUGH"] == 3].copy()
|
||||
|
||||
print(f"raw manhattan rows: {len(manhattan_raw)}")
|
||||
print(f"raw brooklyn rows: {len(brooklyn_raw)}")
|
||||
|
||||
|
||||
def clean_borough(df: pd.DataFrame, min_price: float = 10000.0) -> pd.DataFrame:
|
||||
"""clean borough-level dataframe similarly to the r function."""
|
||||
df = df.copy()
|
||||
|
||||
def parse_numeric(series: pd.Series) -> pd.Series:
|
||||
"""numeric from char / factor with commas and junk values."""
|
||||
s = series.astype(str).str.strip()
|
||||
s = s.replace(
|
||||
{
|
||||
"0": np.nan,
|
||||
"0.0": np.nan,
|
||||
"- 0": np.nan,
|
||||
"": np.nan,
|
||||
".": np.nan,
|
||||
"NA": np.nan,
|
||||
"NaN": np.nan,
|
||||
}
|
||||
)
|
||||
s = s.str.replace(",", "", regex=False)
|
||||
return pd.to_numeric(s, errors="coerce")
|
||||
|
||||
# convert numeric columns
|
||||
if "SALE PRICE" in df.columns:
|
||||
df["SALE PRICE"] = parse_numeric(df["SALE PRICE"])
|
||||
if "LAND SQUARE FEET" in df.columns:
|
||||
df["LAND SQUARE FEET"] = parse_numeric(df["LAND SQUARE FEET"])
|
||||
if "GROSS SQUARE FEET" in df.columns:
|
||||
df["GROSS SQUARE FEET"] = parse_numeric(df["GROSS SQUARE FEET"])
|
||||
if "YEAR BUILT" in df.columns:
|
||||
df["YEAR BUILT"] = parse_numeric(df["YEAR BUILT"])
|
||||
|
||||
unit_cols = ["RESIDENTIAL UNITS", "COMMERCIAL UNITS", "TOTAL UNITS"]
|
||||
for col in unit_cols:
|
||||
if col in df.columns:
|
||||
df[col] = parse_numeric(df[col])
|
||||
|
||||
# drop non-arms-length / tiny sales
|
||||
if "SALE PRICE" in df.columns:
|
||||
df = df[df["SALE PRICE"].notna()]
|
||||
df = df[df["SALE PRICE"] > min_price]
|
||||
|
||||
# very old or zero years --> missing
|
||||
if "YEAR BUILT" in df.columns:
|
||||
df.loc[df["YEAR BUILT"] < 1800, "YEAR BUILT"] = np.nan
|
||||
|
||||
# need usable size / year
|
||||
required_cols = ["GROSS SQUARE FEET", "LAND SQUARE FEET", "YEAR BUILT"]
|
||||
if all(c in df.columns for c in required_cols):
|
||||
df = df[
|
||||
df["GROSS SQUARE FEET"].notna()
|
||||
& df["LAND SQUARE FEET"].notna()
|
||||
& df["YEAR BUILT"].notna()
|
||||
& (df["GROSS SQUARE FEET"] > 0)
|
||||
& (df["LAND SQUARE FEET"] > 0)
|
||||
]
|
||||
|
||||
# fill missing units with 0
|
||||
for col in unit_cols:
|
||||
if col in df.columns:
|
||||
df[col] = df[col].fillna(0)
|
||||
|
||||
# rename neighborhood
|
||||
if "NEIGHBORHOOD" in df.columns:
|
||||
df = df.rename(columns={"NEIGHBORHOOD": "neighborhood"})
|
||||
|
||||
# create new columns
|
||||
if "LAND SQUARE FEET" in df.columns:
|
||||
df["land_sqft"] = df["LAND SQUARE FEET"]
|
||||
if "GROSS SQUARE FEET" in df.columns:
|
||||
df["gross_sqft"] = df["GROSS SQUARE FEET"]
|
||||
if "YEAR BUILT" in df.columns:
|
||||
df["year_built"] = df["YEAR BUILT"]
|
||||
if "RESIDENTIAL UNITS" in df.columns:
|
||||
df["res_units"] = df["RESIDENTIAL UNITS"]
|
||||
else:
|
||||
df["res_units"] = 0
|
||||
if "COMMERCIAL UNITS" in df.columns:
|
||||
df["comm_units"] = df["COMMERCIAL UNITS"]
|
||||
else:
|
||||
df["comm_units"] = 0
|
||||
if "TOTAL UNITS" in df.columns:
|
||||
df["total_units"] = df["TOTAL UNITS"]
|
||||
else:
|
||||
df["total_units"] = 0
|
||||
if "SALE PRICE" in df.columns:
|
||||
df["sale_price"] = df["SALE PRICE"]
|
||||
|
||||
return df
|
||||
|
||||
|
||||
manhattan = clean_borough(manhattan_raw)
|
||||
brooklyn = clean_borough(brooklyn_raw)
|
||||
|
||||
print(f"clean manhattan rows: {len(manhattan)}")
|
||||
print(f"clean brooklyn rows: {len(brooklyn)}")
|
||||
|
||||
|
||||
# manhattan exploratory data analysis
|
||||
|
||||
# summary stats for sale price
|
||||
summary_manhattan_price = manhattan["sale_price"].describe()
|
||||
print("summary of manhattan sale_price:")
|
||||
print(summary_manhattan_price)
|
||||
|
||||
quantiles_manhattan = manhattan["sale_price"].quantile(
|
||||
[0.25, 0.5, 0.75, 0.9, 0.95, 0.99]
|
||||
)
|
||||
print("selected quantiles for manhattan sale_price:")
|
||||
print(quantiles_manhattan)
|
||||
|
||||
q1 = quantiles_manhattan.loc[0.25]
|
||||
q3 = quantiles_manhattan.loc[0.75]
|
||||
iqr = q3 - q1
|
||||
lower_bound = q1 - 1.5 * iqr
|
||||
upper_bound = q3 + 1.5 * iqr
|
||||
|
||||
print(f"manhattan iqr upper bound: {upper_bound}")
|
||||
print(
|
||||
f"manhattan max sale price: {manhattan['sale_price'].max(skipna=True)}"
|
||||
)
|
||||
|
||||
outlier_mask = (manhattan["sale_price"] < lower_bound) | (
|
||||
manhattan["sale_price"] > upper_bound
|
||||
)
|
||||
print(f"number of sale price outliers: {outlier_mask.sum()}")
|
||||
|
||||
# correlation with other numeric vars
|
||||
num_cols = [
|
||||
"sale_price", "gross_sqft", "land_sqft",
|
||||
"year_built", "res_units", "comm_units", "total_units",
|
||||
]
|
||||
|
||||
corr_manhattan = manhattan[num_cols].corr()
|
||||
print("correlation with sale_price:")
|
||||
print(corr_manhattan["sale_price"])
|
||||
|
||||
# formatter for comma-separated tick labels
|
||||
|
||||
|
||||
def comma_format(x, pos):
|
||||
try:
|
||||
return f"{int(x):,}"
|
||||
except Exception:
|
||||
return str(x)
|
||||
|
||||
|
||||
# histogram of raw sale prices
|
||||
fig, ax = plt.subplots()
|
||||
ax.hist(manhattan["sale_price"], bins=50)
|
||||
ax.set_title("manhattan sale price distribution (raw)")
|
||||
ax.set_xlabel("sale price (usd)")
|
||||
ax.set_ylabel("count of sales")
|
||||
ax.xaxis.set_major_formatter(FuncFormatter(comma_format))
|
||||
plt.tight_layout()
|
||||
plt.show()
|
||||
|
||||
# histogram of log(1 + sale price)
|
||||
fig, ax = plt.subplots()
|
||||
ax.hist(np.log1p(manhattan["sale_price"]), bins=50)
|
||||
ax.set_title("manhattan sale price distribution (log scale)")
|
||||
ax.set_xlabel("log(1 + sale price)")
|
||||
ax.set_ylabel("count of sales")
|
||||
plt.tight_layout()
|
||||
plt.show()
|
||||
|
||||
# boxplot of sale price (for outliers)
|
||||
fig, ax = plt.subplots()
|
||||
ax.boxplot(manhattan["sale_price"].values, vert=True)
|
||||
ax.set_title("manhattan sale price with outliers")
|
||||
ax.set_ylabel("sale price (usd)")
|
||||
ax.set_xticks([])
|
||||
ax.yaxis.set_major_formatter(FuncFormatter(comma_format))
|
||||
plt.tight_layout()
|
||||
plt.show()
|
||||
|
||||
# scatter: gross sqft vs sale price (log y)
|
||||
fig, ax = plt.subplots()
|
||||
ax.scatter(manhattan["gross_sqft"], manhattan["sale_price"], alpha=0.3)
|
||||
ax.set_yscale("log")
|
||||
ax.set_title("manhattan sale price vs gross square feet")
|
||||
ax.set_xlabel("gross square feet")
|
||||
ax.set_ylabel("sale price (log10-ish scale)")
|
||||
plt.tight_layout()
|
||||
plt.show()
|
||||
|
||||
# manhattan regression analysis
|
||||
|
||||
reg_vars = [
|
||||
"land_sqft", "gross_sqft", "year_built",
|
||||
"res_units", "comm_units", "total_units",
|
||||
]
|
||||
|
||||
reg_df = manhattan[reg_vars + ["sale_price"]].dropna()
|
||||
reg_df["log_price"] = np.log1p(reg_df["sale_price"])
|
||||
|
||||
X = reg_df[reg_vars]
|
||||
y = reg_df["log_price"]
|
||||
|
||||
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
|
||||
X, y, test_size=0.25, random_state=RANDOM_STATE
|
||||
)
|
||||
|
||||
|
||||
# linear regression on log_price
|
||||
lm = LinearRegression()
|
||||
lm.fit(X_train_reg, y_train_reg)
|
||||
lm_pred = lm.predict(X_test_reg)
|
||||
|
||||
r2_lm = r2_score(y_test_reg, lm_pred)
|
||||
rmse_lm = np.sqrt(mean_squared_error(y_test_reg, lm_pred))
|
||||
mae_lm = mean_absolute_error(y_test_reg, lm_pred)
|
||||
|
||||
print("\nlinear model (log price) metrics on manhattan:")
|
||||
print(f"r2: {r2_lm:.4f} rmse: {rmse_lm:.4f} mae: {mae_lm:.4f}")
|
||||
|
||||
|
||||
# random forest regression on log_price
|
||||
rf_reg = RandomForestRegressor(
|
||||
n_estimators=200,
|
||||
max_features=3,
|
||||
max_leaf_nodes=100,
|
||||
random_state=RANDOM_STATE,
|
||||
n_jobs=-1,
|
||||
)
|
||||
rf_reg.fit(X_train_reg, y_train_reg)
|
||||
rf_pred = rf_reg.predict(X_test_reg)
|
||||
|
||||
r2_rf = r2_score(y_test_reg, rf_pred)
|
||||
rmse_rf = np.sqrt(mean_squared_error(y_test_reg, rf_pred))
|
||||
mae_rf = mean_absolute_error(y_test_reg, rf_pred)
|
||||
|
||||
print("\nrandom forest (log price) metrics on manhattan:")
|
||||
print(f"r2: {r2_rf:.4f} rmse: {rmse_rf:.4f} mae: {mae_rf:.4f}")
|
||||
|
||||
|
||||
# manhattan predicted vs actual plot
|
||||
rf_diag_df = pd.DataFrame(
|
||||
{
|
||||
"actual": y_test_reg.values,
|
||||
"predicted": rf_pred,
|
||||
}
|
||||
)
|
||||
fig, ax = plt.subplots()
|
||||
ax.scatter(rf_diag_df["actual"], rf_diag_df["predicted"], alpha=0.3)
|
||||
min_val = min(rf_diag_df["actual"].min(), rf_diag_df["predicted"].min())
|
||||
max_val = max(rf_diag_df["actual"].max(), rf_diag_df["predicted"].max())
|
||||
ax.plot([min_val, max_val], [min_val, max_val], linestyle="--")
|
||||
ax.set_title(
|
||||
"random forest: predicted vs actual log(1 + sale price) (manhattan)"
|
||||
)
|
||||
|
||||
ax.set_xlabel("actual log(1 + sale price)")
|
||||
ax.set_ylabel("predicted log(1 + sale price)")
|
||||
plt.tight_layout()
|
||||
plt.show()
|
||||
|
||||
|
||||
# manhattan residuals vs predicted plot
|
||||
rf_diag_df["residual"] = (
|
||||
rf_diag_df["actual"] - rf_diag_df["predicted"]
|
||||
)
|
||||
fig, ax = plt.subplots()
|
||||
ax.scatter(rf_diag_df["predicted"], rf_diag_df["residual"], alpha=0.3)
|
||||
ax.axhline(0.0, linestyle="--")
|
||||
ax.set_title("random forest residuals (manhattan)")
|
||||
ax.set_xlabel("predicted log(1 + sale price)")
|
||||
ax.set_ylabel("residual")
|
||||
plt.tight_layout()
|
||||
plt.show()
|
||||
|
||||
# use manhattan regression model on brooklyn
|
||||
|
||||
brook_reg_df = brooklyn[reg_vars + ["sale_price"]].dropna()
|
||||
brook_reg_df["log_price"] = np.log1p(brook_reg_df["sale_price"])
|
||||
|
||||
X_brook = brook_reg_df[reg_vars]
|
||||
y_brook = brook_reg_df["log_price"]
|
||||
|
||||
rf_pred_brook = rf_reg.predict(X_brook)
|
||||
|
||||
r2_rf_brook = r2_score(y_brook, rf_pred_brook)
|
||||
rmse_rf_brook = np.sqrt(mean_squared_error(y_brook, rf_pred_brook))
|
||||
mae_rf_brook = mean_absolute_error(y_brook, rf_pred_brook)
|
||||
|
||||
print(
|
||||
"\nrandom forest (log price) metrics on brooklyn "
|
||||
"(trained on manhattan):"
|
||||
)
|
||||
print(
|
||||
f"r2: {r2_rf_brook:.4f} rmse: {rmse_rf_brook:.4f} "
|
||||
f"mae: {mae_rf_brook:.4f}"
|
||||
)
|
||||
|
||||
brook_diag_df = pd.DataFrame(
|
||||
{
|
||||
"actual": y_brook.values,
|
||||
"predicted": rf_pred_brook,
|
||||
}
|
||||
)
|
||||
brook_diag_df["residual"] = (
|
||||
brook_diag_df["actual"] - brook_diag_df["predicted"]
|
||||
)
|
||||
|
||||
fig, ax = plt.subplots()
|
||||
ax.scatter(brook_diag_df["actual"], brook_diag_df["predicted"], alpha=0.3)
|
||||
min_val = min(brook_diag_df["actual"].min(), brook_diag_df["predicted"].min())
|
||||
max_val = max(brook_diag_df["actual"].max(), brook_diag_df["predicted"].max())
|
||||
ax.plot([min_val, max_val], [min_val, max_val], linestyle="--")
|
||||
ax.set_title(
|
||||
"random forest: manhattan model on brooklyn (log price)"
|
||||
)
|
||||
|
||||
ax.set_xlabel("actual log(1 + sale price) (brooklyn)")
|
||||
ax.set_ylabel("predicted log(1 + sale price)")
|
||||
plt.tight_layout()
|
||||
plt.show()
|
||||
|
||||
|
||||
fig, ax = plt.subplots()
|
||||
ax.scatter(brook_diag_df["predicted"], brook_diag_df["residual"], alpha=0.3)
|
||||
ax.axhline(0.0, linestyle="--")
|
||||
ax.set_title(
|
||||
"residuals on brooklyn using manhattan random forest"
|
||||
)
|
||||
|
||||
ax.set_xlabel("predicted log(1 + sale price)")
|
||||
ax.set_ylabel("residual")
|
||||
plt.tight_layout()
|
||||
plt.show()
|
||||
|
||||
|
||||
# classification: manhattan predict neighborhood
|
||||
|
||||
clf_vars = [
|
||||
"sale_price", "land_sqft", "gross_sqft",
|
||||
"year_built", "res_units", "comm_units", "total_units",
|
||||
]
|
||||
|
||||
|
||||
def prepare_classification_df(
|
||||
df: pd.DataFrame, min_per_class: int = 100
|
||||
) -> pd.DataFrame:
|
||||
"""prepare classification df similar to r code."""
|
||||
tmp = df.copy()
|
||||
if "neighborhood" not in tmp.columns:
|
||||
raise ValueError("neighborhood column missing")
|
||||
tmp = tmp[tmp["neighborhood"].notna()]
|
||||
|
||||
counts = tmp["neighborhood"].value_counts()
|
||||
keep = counts[counts >= min_per_class].index
|
||||
tmp = tmp[tmp["neighborhood"].isin(keep)]
|
||||
|
||||
cols = ["neighborhood"] + clf_vars
|
||||
tmp = tmp[cols].dropna()
|
||||
|
||||
tmp["neighborhood"] = tmp["neighborhood"].astype("category")
|
||||
return tmp
|
||||
|
||||
|
||||
manhattan_clf = prepare_classification_df(
|
||||
manhattan, min_per_class=100
|
||||
)
|
||||
|
||||
print(
|
||||
f"\nmanhattan classification subset rows: {len(manhattan_clf)}"
|
||||
)
|
||||
print(
|
||||
"manhattan neighborhoods: "
|
||||
f"{manhattan_clf['neighborhood'].nunique()}"
|
||||
)
|
||||
|
||||
X_clf = manhattan_clf[clf_vars]
|
||||
y_clf = manhattan_clf["neighborhood"]
|
||||
|
||||
X_train_clf, X_test_clf, y_train_clf, y_test_clf = train_test_split(
|
||||
X_clf,
|
||||
y_clf,
|
||||
test_size=0.25,
|
||||
stratify=y_clf,
|
||||
random_state=RANDOM_STATE,
|
||||
)
|
||||
|
||||
|
||||
def macro_f1_score(y_true, y_pred) -> float:
|
||||
"""macro f1 similar to caret::confusionMatrix byClass averaging."""
|
||||
_, _, f1, _ = precision_recall_fscore_support(
|
||||
y_true, y_pred, average="macro", zero_division=0
|
||||
)
|
||||
return float(f1)
|
||||
|
||||
|
||||
# k-nn classifier with scaling
|
||||
knn_pipeline = Pipeline(
|
||||
[
|
||||
("scaler", StandardScaler()),
|
||||
("knn", KNeighborsClassifier(n_neighbors=7)),
|
||||
]
|
||||
)
|
||||
knn_pipeline.fit(X_train_clf, y_train_clf)
|
||||
knn_pred = knn_pipeline.predict(X_test_clf)
|
||||
|
||||
acc_knn = accuracy_score(y_test_clf, knn_pred)
|
||||
macro_f1_knn = macro_f1_score(y_test_clf, knn_pred)
|
||||
|
||||
print(
|
||||
f"\nknn (manhattan) accuracy: {acc_knn:.4f} "
|
||||
f"macro f1: {macro_f1_knn:.4f}"
|
||||
)
|
||||
|
||||
# random forest classifier
|
||||
rf_clf = RandomForestClassifier(
|
||||
n_estimators=300,
|
||||
max_features=3,
|
||||
random_state=RANDOM_STATE,
|
||||
n_jobs=-1,
|
||||
)
|
||||
rf_clf.fit(X_train_clf, y_train_clf)
|
||||
rf_clf_pred = rf_clf.predict(X_test_clf)
|
||||
|
||||
acc_rf_clf = accuracy_score(y_test_clf, rf_clf_pred)
|
||||
macro_f1_rf_clf = macro_f1_score(y_test_clf, rf_clf_pred)
|
||||
|
||||
print(
|
||||
"\nrandom forest classifier (manhattan) accuracy: "
|
||||
f"{acc_rf_clf:.4f} macro f1: {macro_f1_rf_clf:.4f}"
|
||||
)
|
||||
|
||||
# contingency table (rf example)
|
||||
labels = sorted(y_clf.unique())
|
||||
rf_cm_table = confusion_matrix(
|
||||
y_test_clf, rf_clf_pred, labels=labels
|
||||
)
|
||||
print("rf confusion matrix shape:", rf_cm_table.shape)
|
||||
print("rf confusion matrix (rows=true, cols=pred):")
|
||||
print(rf_cm_table)
|
||||
|
||||
# multinomial logistic regression
|
||||
logit_clf = LogisticRegression(
|
||||
multi_class="multinomial",
|
||||
max_iter=2000,
|
||||
solver="lbfgs",
|
||||
n_jobs=-1,
|
||||
)
|
||||
logit_clf.fit(X_train_clf, y_train_clf)
|
||||
logit_pred = logit_clf.predict(X_test_clf)
|
||||
|
||||
acc_logit = accuracy_score(y_test_clf, logit_pred)
|
||||
macro_f1_logit = macro_f1_score(y_test_clf, logit_pred)
|
||||
|
||||
print(
|
||||
"\nmultinomial logistic regression (manhattan) accuracy: "
|
||||
f"{acc_logit:.4f} macro f1: {macro_f1_logit:.4f}"
|
||||
)
|
||||
|
||||
# use manhattan classifiers on brooklyn
|
||||
|
||||
brooklyn_clf = prepare_classification_df(
|
||||
brooklyn, min_per_class=100
|
||||
)
|
||||
print(
|
||||
f"\nbrooklyn classification subset rows: {len(brooklyn_clf)}"
|
||||
)
|
||||
print(
|
||||
"brooklyn neighborhoods: "
|
||||
f"{brooklyn_clf['neighborhood'].nunique()}"
|
||||
)
|
||||
|
||||
X_brook_clf = brooklyn_clf[clf_vars]
|
||||
y_brook_clf = brooklyn_clf["neighborhood"]
|
||||
|
||||
# predictions from manhattan-trained models on brooklyn data
|
||||
knn_pred_brook = knn_pipeline.predict(X_brook_clf)
|
||||
rf_pred_brook_clf = rf_clf.predict(X_brook_clf)
|
||||
logit_pred_brook = logit_clf.predict(X_brook_clf)
|
||||
|
||||
# contingency tables (true brooklyn neigh vs predicted manhattan neigh)
|
||||
tab_knn_brook = pd.crosstab(
|
||||
y_brook_clf, knn_pred_brook,
|
||||
rownames=["true"], colnames=["pred"]
|
||||
)
|
||||
|
||||
tab_rf_brook = pd.crosstab(
|
||||
y_brook_clf, rf_pred_brook_clf,
|
||||
rownames=["true"], colnames=["pred"]
|
||||
)
|
||||
|
||||
tab_logit_brook = pd.crosstab(
|
||||
y_brook_clf, logit_pred_brook,
|
||||
rownames=["true"], colnames=["pred"]
|
||||
)
|
||||
|
||||
print("\ncontingency table dimensions (knn):", tab_knn_brook.shape)
|
||||
print(tab_knn_brook)
|
||||
|
||||
print(
|
||||
"contingency table dimensions (random forest):",
|
||||
tab_rf_brook.shape,
|
||||
)
|
||||
print(tab_rf_brook)
|
||||
|
||||
print("contingency table dimensions (logit):", tab_logit_brook.shape)
|
||||
print(tab_logit_brook)
|
||||
@@ -0,0 +1,504 @@
|
||||
# install.packages(
|
||||
# c("dplyr", "ggplot2", "randomForest", "caret", "nnet", "e1071", "scales"),
|
||||
# repos = "https://cloud.r-project.org"
|
||||
# )
|
||||
|
||||
library(dplyr)
|
||||
library(ggplot2)
|
||||
library(randomForest)
|
||||
library(caret)
|
||||
library(nnet)
|
||||
library(e1071)
|
||||
library(scales)
|
||||
|
||||
# load data / basic subsets
|
||||
options(stringsAsFactors = FALSE)
|
||||
set.seed(42L)
|
||||
|
||||
file_path <- "Given/NYC_Citywide_Annualized_Calendar_Sales_Update_20241107.csv"
|
||||
|
||||
# columns we actually need
|
||||
cols_needed <- c(
|
||||
"BOROUGH", "NEIGHBORHOOD", "BUILDING CLASS CATEGORY",
|
||||
"TAX CLASS AS OF FINAL ROLL", "BLOCK", "LOT",
|
||||
"BUILDING CLASS AS OF FINAL ROLL", "ZIP CODE",
|
||||
"RESIDENTIAL UNITS", "COMMERCIAL UNITS", "TOTAL UNITS",
|
||||
"LAND SQUARE FEET", "GROSS SQUARE FEET", "YEAR BUILT",
|
||||
"TAX CLASS AT TIME OF SALE", "BUILDING CLASS AT TIME OF SALE",
|
||||
"SALE PRICE", "SALE DATE"
|
||||
)
|
||||
|
||||
nyc <- read.csv(file_path, stringsAsFactors = FALSE, check.names = FALSE)
|
||||
nyc <- nyc[, intersect(cols_needed, colnames(nyc))]
|
||||
|
||||
# force borough numeric
|
||||
nyc$BOROUGH <- suppressWarnings(as.numeric(nyc$BOROUGH))
|
||||
manhattan_raw <- nyc %>% filter(BOROUGH == 1)
|
||||
brooklyn_raw <- nyc %>% filter(BOROUGH == 3)
|
||||
|
||||
cat("raw manhattan rows:", nrow(manhattan_raw), "\n")
|
||||
cat("raw brooklyn rows:", nrow(brooklyn_raw), "\n")
|
||||
|
||||
|
||||
clean_borough <- function(df, min_price = 10000) {
|
||||
# numeric from char / factor with commas
|
||||
parse_numeric <- function(x) {
|
||||
x <- as.character(x)
|
||||
x <- trimws(x)
|
||||
x[x %in% c("0", "0.0", "- 0", "", ".", "NA", "NaN")] <- NA
|
||||
x <- gsub(",", "", x, fixed = TRUE)
|
||||
suppressWarnings(as.numeric(x))
|
||||
}
|
||||
|
||||
df <- df
|
||||
|
||||
# convert TO COMPUTER SCIENCE ITWS OVERRATED
|
||||
df$`SALE PRICE` <- parse_numeric(df$`SALE PRICE`)
|
||||
df$`LAND SQUARE FEET` <- parse_numeric(df$`LAND SQUARE FEET`)
|
||||
df$`GROSS SQUARE FEET` <- parse_numeric(df$`GROSS SQUARE FEET`)
|
||||
df$`YEAR BUILT` <- parse_numeric(df$`YEAR BUILT`)
|
||||
|
||||
unit_cols <- c("RESIDENTIAL UNITS", "COMMERCIAL UNITS", "TOTAL UNITS")
|
||||
for (col in unit_cols) {
|
||||
if (col %in% names(df)) {
|
||||
df[[col]] <- parse_numeric(df[[col]])
|
||||
}
|
||||
}
|
||||
|
||||
# drop non-arms-length / tiny sales
|
||||
df <- df %>%
|
||||
filter(!is.na(`SALE PRICE`)) %>%
|
||||
filter(`SALE PRICE` > min_price)
|
||||
|
||||
# very old or zero years --> missing
|
||||
df$`YEAR BUILT`[df$`YEAR BUILT` < 1800] <- NA
|
||||
|
||||
# need usable size / year
|
||||
df <- df %>%
|
||||
filter(
|
||||
!is.na(`GROSS SQUARE FEET`),
|
||||
!is.na(`LAND SQUARE FEET`),
|
||||
!is.na(`YEAR BUILT`),
|
||||
`GROSS SQUARE FEET` > 0,
|
||||
`LAND SQUARE FEET` > 0
|
||||
)
|
||||
|
||||
# fill missing units with 0 because I am creative 10
|
||||
for (col in unit_cols) {
|
||||
if (col %in% names(df)) {
|
||||
df[[col]][is.na(df[[col]])] <- 0
|
||||
}
|
||||
}
|
||||
|
||||
# I AM LAZY (create new cols)
|
||||
df <- df %>%
|
||||
rename(
|
||||
neighborhood = NEIGHBORHOOD
|
||||
) %>%
|
||||
mutate(
|
||||
land_sqft = `LAND SQUARE FEET`,
|
||||
gross_sqft = `GROSS SQUARE FEET`,
|
||||
year_built = `YEAR BUILT`,
|
||||
res_units = `RESIDENTIAL UNITS`,
|
||||
comm_units = `COMMERCIAL UNITS`,
|
||||
total_units = `TOTAL UNITS`,
|
||||
sale_price = `SALE PRICE`
|
||||
)
|
||||
|
||||
df
|
||||
}
|
||||
|
||||
# http://localhost:21486/library/psych/html/manhattan.html
|
||||
manhattan <- clean_borough(manhattan_raw)
|
||||
brooklyn <- clean_borough(brooklyn_raw)
|
||||
|
||||
cat("clean manhattan rows:", nrow(manhattan), "\n")
|
||||
cat("clean brooklyn rows:", nrow(brooklyn), "\n")
|
||||
|
||||
# manhattan exploratory data analysis
|
||||
|
||||
# summary stats for sale price
|
||||
summary_manhattan_price <- summary(manhattan$sale_price)
|
||||
print(summary_manhattan_price)
|
||||
|
||||
quantiles_manhattan <- quantile(
|
||||
manhattan$sale_price,
|
||||
probs = c(0.25, 0.5, 0.75, 0.9, 0.95, 0.99),
|
||||
na.rm = TRUE
|
||||
)
|
||||
|
||||
print(quantiles_manhattan)
|
||||
|
||||
# iqr-based outlier bounds
|
||||
q1 <- quantiles_manhattan[1]
|
||||
q3 <- quantiles_manhattan[3]
|
||||
iqr <- q3 - q1
|
||||
lower_bound <- q1 - 1.5 * iqr
|
||||
upper_bound <- q3 + 1.5 * iqr
|
||||
|
||||
cat("manhattan iqr upper bound:", upper_bound, "\n")
|
||||
cat("manhattan max sale price:", max(manhattan$sale_price, na.rm = TRUE), "\n")
|
||||
|
||||
outlier_mask <- (manhattan$sale_price < lower_bound) |
|
||||
(manhattan$sale_price > upper_bound)
|
||||
cat("number of sale price outliers:", sum(outlier_mask, na.rm = TRUE), "\n")
|
||||
|
||||
|
||||
# correlation with other numeric vars
|
||||
num_cols <- c(
|
||||
"sale_price", "gross_sqft", "land_sqft",
|
||||
"year_built", "res_units", "comm_units", "total_units"
|
||||
)
|
||||
|
||||
corr_manhattan <- cor(manhattan[, num_cols], use = "complete.obs")
|
||||
print(corr_manhattan[, "sale_price"])
|
||||
|
||||
|
||||
# histogram of raw sale prices
|
||||
p_hist_raw <- ggplot(manhattan, aes(x = sale_price)) +
|
||||
geom_histogram(bins = 50, color = "black", fill = NA) +
|
||||
scale_x_continuous(labels = comma) +
|
||||
labs(
|
||||
title = "manhattan sale price distribution (raw)",
|
||||
x = "sale price (usd)",
|
||||
y = "count of sales"
|
||||
)
|
||||
|
||||
|
||||
# histogram of log(1 + sale price)
|
||||
p_hist_log <- ggplot(manhattan, aes(x = log1p(sale_price))) +
|
||||
geom_histogram(bins = 50, color = "black", fill = NA) +
|
||||
labs(
|
||||
title = "manhattan sale price distribution (log scale)",
|
||||
x = "log(1 + sale price)",
|
||||
y = "count of sales"
|
||||
)
|
||||
|
||||
|
||||
# boxplot of sale price (for show outliers)
|
||||
p_box <- ggplot(manhattan, aes(y = sale_price)) +
|
||||
geom_boxplot(outlier.alpha = 0.4) +
|
||||
scale_y_continuous(labels = comma) +
|
||||
labs(
|
||||
title = "manhattan sale price with outliers",
|
||||
y = "sale price (usd)",
|
||||
x = ""
|
||||
)
|
||||
|
||||
|
||||
# scatter: gross sqft vs sale price (log y)
|
||||
p_scatter <- ggplot(manhattan, aes(x = gross_sqft, y = sale_price)) +
|
||||
geom_point(alpha = 0.3) +
|
||||
scale_y_continuous(trans = "log10", labels = comma) +
|
||||
labs(
|
||||
title = "manhattan sale price vs gross square feet",
|
||||
x = "gross square feet",
|
||||
y = "sale price (log10 scale)"
|
||||
)
|
||||
|
||||
|
||||
# print or save plots as needed
|
||||
print(p_hist_raw)
|
||||
print(p_hist_log)
|
||||
print(p_box)
|
||||
print(p_scatter)
|
||||
|
||||
# regression analysis (manhattan)
|
||||
|
||||
reg_vars <- c(
|
||||
"land_sqft", "gross_sqft", "year_built",
|
||||
"res_units", "comm_units", "total_units"
|
||||
)
|
||||
|
||||
reg_df <- manhattan %>%
|
||||
select(all_of(reg_vars), sale_price) %>%
|
||||
tidyr::drop_na()
|
||||
|
||||
reg_df$log_price <- log1p(reg_df$sale_price)
|
||||
|
||||
set.seed(42L)
|
||||
|
||||
train_idx_reg <- createDataPartition(reg_df$log_price, p = 0.75, list = FALSE)
|
||||
train_reg <- reg_df[train_idx_reg, ]
|
||||
test_reg <- reg_df[-train_idx_reg, ]
|
||||
|
||||
# linear regression
|
||||
lm_fit <- lm(
|
||||
log_price ~ land_sqft + gross_sqft + year_built +
|
||||
res_units + comm_units + total_units,
|
||||
data = train_reg
|
||||
)
|
||||
|
||||
lm_pred <- predict(lm_fit, newdata = test_reg)
|
||||
r2_lm <- cor(test_reg$log_price, lm_pred)^2
|
||||
rmse_lm <- sqrt(mean((test_reg$log_price - lm_pred)^2))
|
||||
mae_lm <- mean(abs(test_reg$log_price - lm_pred))
|
||||
|
||||
cat("\nlinear model (log price) metrics on manhattan:\n")
|
||||
cat("r2:", r2_lm, " rmse:", rmse_lm, " mae:", mae_lm, "\n")
|
||||
|
||||
# random forest regression on log price
|
||||
set.seed(42L)
|
||||
|
||||
rf_fit <- randomForest(
|
||||
x = train_reg[, reg_vars],
|
||||
y = train_reg$log_price,
|
||||
ntree = 200,
|
||||
mtry = 3,
|
||||
maxnodes = 100,
|
||||
importance = TRUE
|
||||
)
|
||||
|
||||
rf_pred <- predict(rf_fit, newdata = test_reg[, reg_vars])
|
||||
r2_rf <- cor(test_reg$log_price, rf_pred)^2
|
||||
rmse_rf <- sqrt(mean((test_reg$log_price - rf_pred)^2))
|
||||
mae_rf <- mean(abs(test_reg$log_price - rf_pred))
|
||||
|
||||
cat("\nrandom forest (log price) metrics on manhattan:\n")
|
||||
cat("r2:", r2_rf, " rmse:", rmse_rf, " mae:", mae_rf, "\n")
|
||||
|
||||
# predicted vs actual plot (manhattan)
|
||||
rf_diag_df <- data.frame(
|
||||
actual = test_reg$log_price,
|
||||
predicted = rf_pred
|
||||
)
|
||||
|
||||
p_rf_pred_vs_actual <- ggplot(rf_diag_df, aes(x = actual, y = predicted)) +
|
||||
geom_point(alpha = 0.3) +
|
||||
geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
|
||||
labs(
|
||||
title = "random forest: predicted vs actual log(1 + sale price) (manhattan)",
|
||||
x = "actual log(1 + sale price)",
|
||||
y = "predicted log(1 + sale price)"
|
||||
)
|
||||
|
||||
|
||||
# residuals vs predicted plot (manhattan)
|
||||
rf_diag_df$residual <- rf_diag_df$actual - rf_diag_df$predicted
|
||||
p_rf_resid <- ggplot(rf_diag_df, aes(x = predicted, y = residual)) +
|
||||
geom_point(alpha = 0.3) +
|
||||
geom_hline(yintercept = 0, linetype = "dashed") +
|
||||
labs(
|
||||
title = "random forest residuals (manhattan)",
|
||||
x = "predicted log(1 + sale price)",
|
||||
y = "residual"
|
||||
)
|
||||
|
||||
print(p_rf_pred_vs_actual)
|
||||
print(p_rf_resid)
|
||||
|
||||
|
||||
# apply manhattan regression model to brooklyn
|
||||
|
||||
brook_reg_df <- brooklyn %>%
|
||||
select(all_of(reg_vars), sale_price) %>%
|
||||
tidyr::drop_na()
|
||||
brook_reg_df$log_price <- log1p(brook_reg_df$sale_price)
|
||||
|
||||
rf_pred_brook <- predict(rf_fit, newdata = brook_reg_df[, reg_vars])
|
||||
r2_rf_brook <- cor(brook_reg_df$log_price, rf_pred_brook)^2
|
||||
|
||||
rmse_rf_brook <- sqrt(mean((brook_reg_df$log_price - rf_pred_brook)^2))
|
||||
mae_rf_brook <- mean(abs(brook_reg_df$log_price - rf_pred_brook))
|
||||
|
||||
cat("\nrandom forest (log price) metrics on brooklyn (trained on manhattan):\n")
|
||||
cat("r2:", r2_rf_brook, " rmse:", rmse_rf_brook, " mae:", mae_rf_brook, "\n")
|
||||
|
||||
brook_diag_df <- data.frame(
|
||||
actual = brook_reg_df$log_price,
|
||||
predicted = rf_pred_brook
|
||||
)
|
||||
|
||||
p_brook_pred_vs_actual <- ggplot(brook_diag_df, aes(x = actual, y = predicted)) +
|
||||
geom_point(alpha = 0.3) +
|
||||
geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
|
||||
labs(
|
||||
title = "random forest: manhattan model on brooklyn (log price)",
|
||||
x = "actual log(1 + sale price) (brooklyn)",
|
||||
y = "predicted log(1 + sale price)"
|
||||
)
|
||||
|
||||
|
||||
brook_diag_df$residual <- brook_diag_df$actual - brook_diag_df$predicted
|
||||
|
||||
p_brook_resid <- ggplot(brook_diag_df, aes(x = predicted, y = residual)) +
|
||||
geom_point(alpha = 0.3) +
|
||||
geom_hline(yintercept = 0, linetype = "dashed") +
|
||||
labs(
|
||||
title = "residuals on brooklyn using manhattan random forest",
|
||||
x = "predicted log(1 + sale price)",
|
||||
y = "residual"
|
||||
)
|
||||
|
||||
print(p_brook_pred_vs_actual)
|
||||
print(p_brook_resid)
|
||||
|
||||
|
||||
# classification: manhattan predict neighborhood
|
||||
|
||||
clf_vars <- c(
|
||||
"sale_price", "land_sqft", "gross_sqft",
|
||||
"year_built", "res_units", "comm_units", "total_units"
|
||||
)
|
||||
|
||||
prepare_classification_df <- function(df, min_per_class = 100L) {
|
||||
df <- df %>%
|
||||
filter(!is.na(neighborhood))
|
||||
|
||||
counts <- table(df$neighborhood)
|
||||
keep <- names(counts[counts >= min_per_class])
|
||||
|
||||
df <- df %>%
|
||||
filter(neighborhood %in% keep) %>%
|
||||
mutate(neighborhood = factor(neighborhood)) %>%
|
||||
select(neighborhood, all_of(clf_vars)) %>%
|
||||
tidyr::drop_na()
|
||||
|
||||
df
|
||||
}
|
||||
|
||||
manhattan_clf <- prepare_classification_df(manhattan, min_per_class = 100L)
|
||||
|
||||
cat("\nmanhattan classification subset rows:", nrow(manhattan_clf), "\n")
|
||||
cat("manhattan neighborhoods:", nlevels(manhattan_clf$neighborhood), "\n")
|
||||
|
||||
set.seed(42L)
|
||||
|
||||
train_idx_clf <- createDataPartition(manhattan_clf$neighborhood, p = 0.75, list = FALSE)
|
||||
train_clf <- manhattan_clf[train_idx_clf, ]
|
||||
test_clf <- manhattan_clf[-train_idx_clf, ]
|
||||
|
||||
# helper function for macro f1 from a confusionMatrix object
|
||||
macro_f1_from_cm <- function(cm_obj) {
|
||||
byc <- cm_obj$byClass
|
||||
|
||||
if (!is.matrix(byc)) {
|
||||
precision <- byc["Pos Pred Value"]
|
||||
recall <- byc["Sensitivity"]
|
||||
return(2 * precision * recall / (precision + recall))
|
||||
} else {
|
||||
precision <- byc[, "Pos Pred Value"]
|
||||
recall <- byc[, "Sensitivity"]
|
||||
f1 <- 2 * precision * recall / (precision + recall)
|
||||
mean(f1, na.rm = TRUE)
|
||||
}
|
||||
}
|
||||
|
||||
# k-nn classifier (<- ->)
|
||||
|
||||
ctrl_none <- trainControl(method = "none")
|
||||
set.seed(42L)
|
||||
|
||||
knn_fit <- train(
|
||||
neighborhood ~ sale_price + land_sqft + gross_sqft + year_built +
|
||||
res_units + comm_units + total_units,
|
||||
data = train_clf,
|
||||
method = "knn",
|
||||
preProcess = c("center", "scale"),
|
||||
tuneGrid = data.frame(k = 7),
|
||||
trControl = ctrl_none
|
||||
)
|
||||
|
||||
knn_pred <- predict(knn_fit, newdata = test_clf)
|
||||
cm_knn <- confusionMatrix(knn_pred, test_clf$neighborhood)
|
||||
macro_f1_knn <- macro_f1_from_cm(cm_knn)
|
||||
|
||||
cat(
|
||||
"\nknn (manhattan) accuracy:", cm_knn$overall["Accuracy"],
|
||||
" macro f1:", macro_f1_knn, "\n"
|
||||
)
|
||||
|
||||
# random forest classifier
|
||||
set.seed(42L)
|
||||
rf_clf_fit <- randomForest(
|
||||
neighborhood ~ sale_price + land_sqft + gross_sqft + year_built +
|
||||
res_units + comm_units + total_units,
|
||||
data = train_clf,
|
||||
ntree = 300,
|
||||
mtry = 3
|
||||
)
|
||||
|
||||
rf_clf_pred <- predict(rf_clf_fit, newdata = test_clf)
|
||||
cm_rf_clf <- confusionMatrix(rf_clf_pred, test_clf$neighborhood)
|
||||
macro_f1_rf_clf <- macro_f1_from_cm(cm_rf_clf)
|
||||
|
||||
cat(
|
||||
"\nrandom forest classifier (manhattan) accuracy:",
|
||||
cm_rf_clf$overall["Accuracy"],
|
||||
" macro f1:", macro_f1_rf_clf, "\n"
|
||||
)
|
||||
|
||||
# ex contingency table
|
||||
rf_cm_table <- cm_rf_clf$table
|
||||
print(dim(rf_cm_table))
|
||||
|
||||
# num neighborhoods x num neighborhoods
|
||||
print(rf_cm_table)
|
||||
|
||||
# multinomial logistic regression
|
||||
|
||||
set.seed(42L)
|
||||
logit_fit <- multinom(
|
||||
neighborhood ~ sale_price + land_sqft + gross_sqft + year_built +
|
||||
res_units + comm_units + total_units,
|
||||
data = train_clf,
|
||||
MaxNWts = 10000,
|
||||
maxit = 2000,
|
||||
trace = FALSE
|
||||
)
|
||||
|
||||
logit_pred <- predict(logit_fit, newdata = test_clf)
|
||||
cm_logit <- confusionMatrix(logit_pred, test_clf$neighborhood)
|
||||
macro_f1_logit <- macro_f1_from_cm(cm_logit)
|
||||
|
||||
cat(
|
||||
"\nmultinomial logistic regression (manhattan) accuracy:",
|
||||
cm_logit$overall["Accuracy"],
|
||||
" macro f1:", macro_f1_logit, "\n"
|
||||
)
|
||||
|
||||
# use manhattan classifiers on brooklyn
|
||||
|
||||
brooklyn_clf <- prepare_classification_df(brooklyn, min_per_class = 100L)
|
||||
cat("\nbrooklyn classification subset rows:", nrow(brooklyn_clf), "\n")
|
||||
cat("brooklyn neighborhoods:", nlevels(brooklyn_clf$neighborhood), "\n")
|
||||
|
||||
# predictions from manhattan-trained models on brooklyn data
|
||||
knn_pred_brook <- predict(knn_fit, newdata = brooklyn_clf)
|
||||
rf_pred_brook <- predict(rf_clf_fit, newdata = brooklyn_clf)
|
||||
logit_pred_brook <- predict(logit_fit, newdata = brooklyn_clf)
|
||||
|
||||
# contingency tables (true brooklyn neigh vs predicted manhattan neigh)
|
||||
# these will be essentially all off-diagonal because label sets differ
|
||||
# idk how to make this look better though :(
|
||||
tab_knn_brook <- table(true = brooklyn_clf$neighborhood, pred = knn_pred_brook)
|
||||
tab_rf_brook <- table(true = brooklyn_clf$neighborhood, pred = rf_pred_brook)
|
||||
tab_logit_brook <- table(true = brooklyn_clf$neighborhood, pred = logit_pred_brook)
|
||||
|
||||
cat("\ncontingency table dimensions (knn):", dim(tab_knn_brook), "\n")
|
||||
cat("contingency table dimensions (random forest):", dim(tab_rf_brook), "\n")
|
||||
cat("contingency table dimensions (logit):", dim(tab_logit_brook), "\n")
|
||||
|
||||
|
||||
plots <- list(
|
||||
manhattan_hist_raw = p_hist_raw,
|
||||
manhattan_hist_log = p_hist_log,
|
||||
manhattan_box = p_box,
|
||||
manhattan_scatter = p_scatter,
|
||||
rf_pred_vs_actual_manhattan = p_rf_pred_vs_actual,
|
||||
rf_resid_manhattan = p_rf_resid,
|
||||
rf_pred_vs_actual_brooklyn = p_brook_pred_vs_actual,
|
||||
rf_resid_brooklyn = p_brook_resid
|
||||
)
|
||||
|
||||
dir.create("plots", showWarnings = FALSE)
|
||||
|
||||
for (nm in names(plots)) {
|
||||
ggsave(
|
||||
filename = file.path("plots", paste0(nm, ".png")),
|
||||
plot = plots[[nm]],
|
||||
width = 7,
|
||||
height = 5,
|
||||
dpi = 300
|
||||
)
|
||||
}
|
||||
@@ -0,0 +1,341 @@
|
||||
raw manhattan rows: 96088
|
||||
raw brooklyn rows: 123813
|
||||
clean manhattan rows: 7294
|
||||
clean brooklyn rows: 42743
|
||||
Min. 1st Qu. Median Mean 3rd Qu. Max.
|
||||
1.005e+04 1.368e+06 4.000e+06 1.633e+07 9.575e+06 2.398e+09
|
||||
25% 50% 75% 90% 95% 99%
|
||||
1367750 4000000 9575000 22736850 50618132 266148692
|
||||
manhattan iqr upper bound: 21885875
|
||||
manhattan max sale price: 2397501899
|
||||
number of sale price outliers: 756
|
||||
sale_price gross_sqft land_sqft year_built res_units comm_units
|
||||
1.00000000 0.49144986 0.16392585 0.02117231 0.11589681 0.20912401
|
||||
total_units
|
||||
0.16618138
|
||||
|
||||
linear model (log price) metrics on manhattan:
|
||||
r2: 0.1867558 rmse: 1.689562 mae: 1.260938
|
||||
|
||||
random forest (log price) metrics on manhattan:
|
||||
r2: 0.6978428 rmse: 1.031439 mae: 0.661946
|
||||
|
||||
random forest (log price) metrics on brooklyn (trained on manhattan):
|
||||
r2: 0.2442946 rmse: 1.305062 mae: 1.084852
|
||||
|
||||
manhattan classification subset rows: 6792
|
||||
manhattan neighborhoods: 28
|
||||
|
||||
knn (manhattan) accuracy: 0.4774882 macro f1: 0.4174939
|
||||
|
||||
random forest classifier (manhattan) accuracy: 0.5841232 macro f1: 0.5328558
|
||||
[1] 28 28
|
||||
Reference
|
||||
Prediction ALPHABET CITY CHELSEA CHINATOWN CLINTON
|
||||
ALPHABET CITY 9 1 0 0
|
||||
CHELSEA 0 23 2 1
|
||||
CHINATOWN 0 0 10 0
|
||||
CLINTON 0 2 0 17
|
||||
EAST VILLAGE 2 2 2 0
|
||||
FASHION 0 4 0 0
|
||||
GRAMERCY 0 0 0 0
|
||||
GREENWICH VILLAGE-CENTRAL 0 2 1 0
|
||||
GREENWICH VILLAGE-WEST 1 3 0 1
|
||||
HARLEM-CENTRAL 3 4 3 5
|
||||
HARLEM-EAST 4 1 0 0
|
||||
HARLEM-UPPER 1 1 0 1
|
||||
KIPS BAY 0 2 0 0
|
||||
LOWER EAST SIDE 1 3 2 0
|
||||
MANHATTAN VALLEY 0 0 0 0
|
||||
MIDTOWN CBD 0 1 1 0
|
||||
MIDTOWN EAST 1 2 0 1
|
||||
MIDTOWN WEST 0 3 1 1
|
||||
MURRAY HILL 0 3 1 0
|
||||
SOHO 0 2 1 0
|
||||
TRIBECA 0 1 0 0
|
||||
UPPER EAST SIDE (59-79) 1 3 0 2
|
||||
UPPER EAST SIDE (79-96) 0 2 1 0
|
||||
UPPER WEST SIDE (59-79) 1 1 0 1
|
||||
UPPER WEST SIDE (79-96) 1 3 0 0
|
||||
UPPER WEST SIDE (96-116) 0 1 0 1
|
||||
WASHINGTON HEIGHTS LOWER 0 0 0 0
|
||||
WASHINGTON HEIGHTS UPPER 0 0 0 1
|
||||
Reference
|
||||
Prediction EAST VILLAGE FASHION GRAMERCY
|
||||
ALPHABET CITY 2 0 0
|
||||
CHELSEA 1 1 0
|
||||
CHINATOWN 0 1 0
|
||||
CLINTON 1 0 0
|
||||
EAST VILLAGE 33 0 0
|
||||
FASHION 0 10 0
|
||||
GRAMERCY 1 0 29
|
||||
GREENWICH VILLAGE-CENTRAL 2 1 0
|
||||
GREENWICH VILLAGE-WEST 2 0 1
|
||||
HARLEM-CENTRAL 4 2 2
|
||||
HARLEM-EAST 0 2 0
|
||||
HARLEM-UPPER 0 0 0
|
||||
KIPS BAY 0 0 0
|
||||
LOWER EAST SIDE 0 1 0
|
||||
MANHATTAN VALLEY 0 0 1
|
||||
MIDTOWN CBD 0 3 2
|
||||
MIDTOWN EAST 0 1 0
|
||||
MIDTOWN WEST 0 2 0
|
||||
MURRAY HILL 0 2 0
|
||||
SOHO 0 0 0
|
||||
TRIBECA 1 1 0
|
||||
UPPER EAST SIDE (59-79) 1 0 1
|
||||
UPPER EAST SIDE (79-96) 0 0 0
|
||||
UPPER WEST SIDE (59-79) 0 1 0
|
||||
UPPER WEST SIDE (79-96) 2 0 2
|
||||
UPPER WEST SIDE (96-116) 1 0 0
|
||||
WASHINGTON HEIGHTS LOWER 1 0 0
|
||||
WASHINGTON HEIGHTS UPPER 0 0 0
|
||||
Reference
|
||||
Prediction GREENWICH VILLAGE-CENTRAL GREENWICH VILLAGE-WEST
|
||||
ALPHABET CITY 0 0
|
||||
CHELSEA 1 5
|
||||
CHINATOWN 1 1
|
||||
CLINTON 0 0
|
||||
EAST VILLAGE 3 1
|
||||
FASHION 0 0
|
||||
GRAMERCY 0 0
|
||||
GREENWICH VILLAGE-CENTRAL 14 3
|
||||
GREENWICH VILLAGE-WEST 5 42
|
||||
HARLEM-CENTRAL 1 4
|
||||
HARLEM-EAST 1 1
|
||||
HARLEM-UPPER 0 0
|
||||
KIPS BAY 0 1
|
||||
LOWER EAST SIDE 1 0
|
||||
MANHATTAN VALLEY 0 0
|
||||
MIDTOWN CBD 0 0
|
||||
MIDTOWN EAST 1 1
|
||||
MIDTOWN WEST 0 1
|
||||
MURRAY HILL 0 0
|
||||
SOHO 1 0
|
||||
TRIBECA 0 0
|
||||
UPPER EAST SIDE (59-79) 1 6
|
||||
UPPER EAST SIDE (79-96) 1 6
|
||||
UPPER WEST SIDE (59-79) 0 0
|
||||
UPPER WEST SIDE (79-96) 2 3
|
||||
UPPER WEST SIDE (96-116) 1 0
|
||||
WASHINGTON HEIGHTS LOWER 0 0
|
||||
WASHINGTON HEIGHTS UPPER 0 0
|
||||
Reference
|
||||
Prediction HARLEM-CENTRAL HARLEM-EAST HARLEM-UPPER KIPS BAY
|
||||
ALPHABET CITY 2 0 0 0
|
||||
CHELSEA 0 0 0 0
|
||||
CHINATOWN 2 0 0 0
|
||||
CLINTON 0 0 1 0
|
||||
EAST VILLAGE 0 0 0 0
|
||||
FASHION 0 0 0 0
|
||||
GRAMERCY 0 0 0 0
|
||||
GREENWICH VILLAGE-CENTRAL 0 0 0 0
|
||||
GREENWICH VILLAGE-WEST 0 0 2 0
|
||||
HARLEM-CENTRAL 158 15 21 0
|
||||
HARLEM-EAST 9 35 2 0
|
||||
HARLEM-UPPER 9 2 19 0
|
||||
KIPS BAY 2 0 1 27
|
||||
LOWER EAST SIDE 3 0 0 0
|
||||
MANHATTAN VALLEY 1 4 0 0
|
||||
MIDTOWN CBD 0 0 0 0
|
||||
MIDTOWN EAST 2 0 0 2
|
||||
MIDTOWN WEST 2 0 0 0
|
||||
MURRAY HILL 2 1 0 0
|
||||
SOHO 1 1 0 0
|
||||
TRIBECA 0 0 0 0
|
||||
UPPER EAST SIDE (59-79) 2 0 1 0
|
||||
UPPER EAST SIDE (79-96) 2 4 0 1
|
||||
UPPER WEST SIDE (59-79) 0 0 0 0
|
||||
UPPER WEST SIDE (79-96) 0 0 0 0
|
||||
UPPER WEST SIDE (96-116) 0 0 4 0
|
||||
WASHINGTON HEIGHTS LOWER 7 3 1 0
|
||||
WASHINGTON HEIGHTS UPPER 1 1 1 0
|
||||
Reference
|
||||
Prediction LOWER EAST SIDE MANHATTAN VALLEY MIDTOWN CBD
|
||||
ALPHABET CITY 0 0 0
|
||||
CHELSEA 4 0 0
|
||||
CHINATOWN 0 1 0
|
||||
CLINTON 0 0 0
|
||||
EAST VILLAGE 3 0 0
|
||||
FASHION 2 0 1
|
||||
GRAMERCY 0 1 0
|
||||
GREENWICH VILLAGE-CENTRAL 0 0 0
|
||||
GREENWICH VILLAGE-WEST 1 0 0
|
||||
HARLEM-CENTRAL 4 7 2
|
||||
HARLEM-EAST 1 0 0
|
||||
HARLEM-UPPER 0 0 0
|
||||
KIPS BAY 0 0 0
|
||||
LOWER EAST SIDE 47 2 0
|
||||
MANHATTAN VALLEY 0 11 0
|
||||
MIDTOWN CBD 0 0 12
|
||||
MIDTOWN EAST 0 0 3
|
||||
MIDTOWN WEST 0 2 1
|
||||
MURRAY HILL 2 0 0
|
||||
SOHO 0 0 1
|
||||
TRIBECA 0 0 0
|
||||
UPPER EAST SIDE (59-79) 1 0 2
|
||||
UPPER EAST SIDE (79-96) 1 1 1
|
||||
UPPER WEST SIDE (59-79) 0 0 2
|
||||
UPPER WEST SIDE (79-96) 0 0 1
|
||||
UPPER WEST SIDE (96-116) 0 1 0
|
||||
WASHINGTON HEIGHTS LOWER 0 1 0
|
||||
WASHINGTON HEIGHTS UPPER 0 1 0
|
||||
Reference
|
||||
Prediction MIDTOWN EAST MIDTOWN WEST MURRAY HILL SOHO TRIBECA
|
||||
ALPHABET CITY 0 0 1 1 0
|
||||
CHELSEA 3 1 2 1 0
|
||||
CHINATOWN 0 1 2 0 0
|
||||
CLINTON 0 0 1 2 0
|
||||
EAST VILLAGE 1 0 0 2 0
|
||||
FASHION 0 3 3 1 0
|
||||
GRAMERCY 1 0 0 0 1
|
||||
GREENWICH VILLAGE-CENTRAL 0 1 0 3 1
|
||||
GREENWICH VILLAGE-WEST 1 0 1 3 1
|
||||
HARLEM-CENTRAL 4 5 5 0 0
|
||||
HARLEM-EAST 0 0 1 0 0
|
||||
HARLEM-UPPER 0 1 2 0 0
|
||||
KIPS BAY 0 0 0 0 0
|
||||
LOWER EAST SIDE 0 2 1 2 1
|
||||
MANHATTAN VALLEY 0 0 0 1 0
|
||||
MIDTOWN CBD 0 6 0 1 0
|
||||
MIDTOWN EAST 24 1 1 0 0
|
||||
MIDTOWN WEST 0 177 0 2 0
|
||||
MURRAY HILL 3 0 16 1 0
|
||||
SOHO 1 1 1 16 1
|
||||
TRIBECA 0 3 0 0 36
|
||||
UPPER EAST SIDE (59-79) 1 0 2 2 2
|
||||
UPPER EAST SIDE (79-96) 3 2 2 1 1
|
||||
UPPER WEST SIDE (59-79) 2 0 0 0 0
|
||||
UPPER WEST SIDE (79-96) 3 0 2 0 2
|
||||
UPPER WEST SIDE (96-116) 1 0 0 0 0
|
||||
WASHINGTON HEIGHTS LOWER 0 0 0 0 0
|
||||
WASHINGTON HEIGHTS UPPER 0 0 0 0 0
|
||||
Reference
|
||||
Prediction UPPER EAST SIDE (59-79) UPPER EAST SIDE (79-96)
|
||||
ALPHABET CITY 0 0
|
||||
CHELSEA 1 3
|
||||
CHINATOWN 1 4
|
||||
CLINTON 1 1
|
||||
EAST VILLAGE 2 1
|
||||
FASHION 2 1
|
||||
GRAMERCY 0 0
|
||||
GREENWICH VILLAGE-CENTRAL 0 0
|
||||
GREENWICH VILLAGE-WEST 2 7
|
||||
HARLEM-CENTRAL 6 5
|
||||
HARLEM-EAST 0 0
|
||||
HARLEM-UPPER 1 2
|
||||
KIPS BAY 0 0
|
||||
LOWER EAST SIDE 1 1
|
||||
MANHATTAN VALLEY 0 0
|
||||
MIDTOWN CBD 2 2
|
||||
MIDTOWN EAST 4 2
|
||||
MIDTOWN WEST 1 0
|
||||
MURRAY HILL 0 1
|
||||
SOHO 2 2
|
||||
TRIBECA 1 0
|
||||
UPPER EAST SIDE (59-79) 47 14
|
||||
UPPER EAST SIDE (79-96) 20 53
|
||||
UPPER WEST SIDE (59-79) 2 1
|
||||
UPPER WEST SIDE (79-96) 3 2
|
||||
UPPER WEST SIDE (96-116) 0 3
|
||||
WASHINGTON HEIGHTS LOWER 1 0
|
||||
WASHINGTON HEIGHTS UPPER 0 1
|
||||
Reference
|
||||
Prediction UPPER WEST SIDE (59-79) UPPER WEST SIDE (79-96)
|
||||
ALPHABET CITY 1 1
|
||||
CHELSEA 0 2
|
||||
CHINATOWN 0 0
|
||||
CLINTON 0 1
|
||||
EAST VILLAGE 1 1
|
||||
FASHION 0 0
|
||||
GRAMERCY 0 0
|
||||
GREENWICH VILLAGE-CENTRAL 1 1
|
||||
GREENWICH VILLAGE-WEST 0 1
|
||||
HARLEM-CENTRAL 4 3
|
||||
HARLEM-EAST 1 1
|
||||
HARLEM-UPPER 0 1
|
||||
KIPS BAY 0 0
|
||||
LOWER EAST SIDE 0 0
|
||||
MANHATTAN VALLEY 0 2
|
||||
MIDTOWN CBD 0 0
|
||||
MIDTOWN EAST 0 0
|
||||
MIDTOWN WEST 3 1
|
||||
MURRAY HILL 0 0
|
||||
SOHO 0 0
|
||||
TRIBECA 1 1
|
||||
UPPER EAST SIDE (59-79) 4 2
|
||||
UPPER EAST SIDE (79-96) 4 8
|
||||
UPPER WEST SIDE (59-79) 46 2
|
||||
UPPER WEST SIDE (79-96) 7 31
|
||||
UPPER WEST SIDE (96-116) 1 0
|
||||
WASHINGTON HEIGHTS LOWER 0 0
|
||||
WASHINGTON HEIGHTS UPPER 0 1
|
||||
Reference
|
||||
Prediction UPPER WEST SIDE (96-116) WASHINGTON HEIGHTS LOWER
|
||||
ALPHABET CITY 0 0
|
||||
CHELSEA 1 0
|
||||
CHINATOWN 0 0
|
||||
CLINTON 0 0
|
||||
EAST VILLAGE 0 0
|
||||
FASHION 0 0
|
||||
GRAMERCY 0 0
|
||||
GREENWICH VILLAGE-CENTRAL 1 0
|
||||
GREENWICH VILLAGE-WEST 0 0
|
||||
HARLEM-CENTRAL 6 18
|
||||
HARLEM-EAST 1 1
|
||||
HARLEM-UPPER 1 5
|
||||
KIPS BAY 1 0
|
||||
LOWER EAST SIDE 1 1
|
||||
MANHATTAN VALLEY 0 0
|
||||
MIDTOWN CBD 0 0
|
||||
MIDTOWN EAST 0 0
|
||||
MIDTOWN WEST 0 0
|
||||
MURRAY HILL 0 0
|
||||
SOHO 0 0
|
||||
TRIBECA 1 0
|
||||
UPPER EAST SIDE (59-79) 1 0
|
||||
UPPER EAST SIDE (79-96) 2 1
|
||||
UPPER WEST SIDE (59-79) 1 0
|
||||
UPPER WEST SIDE (79-96) 2 0
|
||||
UPPER WEST SIDE (96-116) 16 0
|
||||
WASHINGTON HEIGHTS LOWER 0 23
|
||||
WASHINGTON HEIGHTS UPPER 1 2
|
||||
Reference
|
||||
Prediction WASHINGTON HEIGHTS UPPER
|
||||
ALPHABET CITY 0
|
||||
CHELSEA 0
|
||||
CHINATOWN 0
|
||||
CLINTON 0
|
||||
EAST VILLAGE 0
|
||||
FASHION 0
|
||||
GRAMERCY 0
|
||||
GREENWICH VILLAGE-CENTRAL 0
|
||||
GREENWICH VILLAGE-WEST 0
|
||||
HARLEM-CENTRAL 7
|
||||
HARLEM-EAST 3
|
||||
HARLEM-UPPER 4
|
||||
KIPS BAY 0
|
||||
LOWER EAST SIDE 0
|
||||
MANHATTAN VALLEY 0
|
||||
MIDTOWN CBD 0
|
||||
MIDTOWN EAST 0
|
||||
MIDTOWN WEST 0
|
||||
MURRAY HILL 0
|
||||
SOHO 1
|
||||
TRIBECA 0
|
||||
UPPER EAST SIDE (59-79) 0
|
||||
UPPER EAST SIDE (79-96) 0
|
||||
UPPER WEST SIDE (59-79) 0
|
||||
UPPER WEST SIDE (79-96) 1
|
||||
UPPER WEST SIDE (96-116) 0
|
||||
WASHINGTON HEIGHTS LOWER 7
|
||||
WASHINGTON HEIGHTS UPPER 5
|
||||
|
||||
multinomial logistic regression (manhattan) accuracy: 0.2517773 macro f1: 0.2957543
|
||||
|
||||
brooklyn classification subset rows: 42533
|
||||
brooklyn neighborhoods: 56
|
||||
|
||||
contingency table dimensions (knn): 56 28
|
||||
contingency table dimensions (random forest): 56 28
|
||||
contingency table dimensions (logit): 56 28
|
||||
|
After Width: | Height: | Size: 50 KiB |
|
After Width: | Height: | Size: 47 KiB |
|
After Width: | Height: | Size: 44 KiB |
|
After Width: | Height: | Size: 144 KiB |
|
After Width: | Height: | Size: 366 KiB |
|
After Width: | Height: | Size: 209 KiB |
|
After Width: | Height: | Size: 365 KiB |
|
After Width: | Height: | Size: 186 KiB |
@@ -1,4 +0,0 @@
|
||||
node_modules
|
||||
.venv
|
||||
.vscode
|
||||
Assignment III
|
||||
@@ -0,0 +1,41 @@
|
||||
##########################################
|
||||
### Principal Component Analysis (PCA) ###
|
||||
##########################################
|
||||
|
||||
## load libraries
|
||||
library(ggplot2)
|
||||
library(ggfortify)
|
||||
library(GGally)
|
||||
library(e1071)
|
||||
library(class)
|
||||
library(psych)
|
||||
library(readr)
|
||||
|
||||
## set working directory so that files can be referenced without the full path
|
||||
setwd("/home/ion606/Desktop/Data Analytics/Lab 4")
|
||||
|
||||
## read dataset
|
||||
wine <- read_csv("wine.data", col_names = FALSE)
|
||||
|
||||
## set column names
|
||||
names(wine) <- c("Type","Alcohol","Malic acid","Ash","Alcalinity of ash","Magnesium","Total phenols","Flavanoids","Nonflavanoid Phenols","Proanthocyanins","Color Intensity","Hue","Od280/od315 of diluted wines","Proline")
|
||||
|
||||
## inspect data frame
|
||||
head(wine)
|
||||
|
||||
## change the data type of the "Type" column from character to factor
|
||||
####
|
||||
# Factors look like regular strings (characters) but with factors R knows
|
||||
# that the column is a categorical variable with finite possible values
|
||||
# e.g. "Type" in the Wine dataset can only be 1, 2, or 3
|
||||
####
|
||||
|
||||
wine$Type <- as.factor(wine$Type)
|
||||
|
||||
|
||||
## visualize variables
|
||||
pairs.panels(wine[,-1],gap = 0,bg = c("red", "yellow", "blue")[wine$Type],pch=21)
|
||||
|
||||
ggpairs(wine, ggplot2::aes(colour = Type))
|
||||
|
||||
###
|
||||
@@ -0,0 +1,366 @@
|
||||
has_pkg <- function(pkg) requireNamespace(pkg, quietly = TRUE)
|
||||
|
||||
has_ggplot2 <- has_pkg("ggplot2")
|
||||
has_GGally <- has_pkg("GGally")
|
||||
has_e1071 <- has_pkg("e1071")
|
||||
has_class <- has_pkg("class")
|
||||
has_psych <- has_pkg("psych")
|
||||
has_readr <- has_pkg("readr")
|
||||
|
||||
# WHY IS THIS HERE YOU MIGHT ASK???? WELL LET ME TELL YOU I SPENT TWO HOURS ON STUPID PACKAGE IMPORTS
|
||||
# OOOOOOHHH PSYCH IS IN A DIFFERENT REPO??? OH IT ISN'T??? I have a fever of 103 I DO NOT CARE
|
||||
if (has_ggplot2) { library(ggplot2) } else { warning("ggplot2 not available; plots will be skipped") }
|
||||
if (has_GGally) { library(GGally) } else { message("GGally not available; skipping ggpairs plot") }
|
||||
if (has_e1071) { library(e1071) }
|
||||
if (has_class) { library(class) } else { stop("class package not available for kNN") }
|
||||
if (!has_psych) { message("psych not available; skipping pairs.panels plot") }
|
||||
if (has_readr) { library(readr) }
|
||||
library(grid) # unit() for arrows in plots
|
||||
suppressWarnings(RNGkind(sample.kind = "Rounding"))
|
||||
|
||||
# set a reproducible seed
|
||||
set.seed(4600)
|
||||
|
||||
# 178 rows
|
||||
# col 1 is class label (1,2,3)
|
||||
# other 13 columns continuous predictors
|
||||
|
||||
possible_paths <- c(
|
||||
"wine.data",
|
||||
"./wine.data",
|
||||
"../wine.data",
|
||||
"DAN/wine.data",
|
||||
"./DAN/wine.data"
|
||||
)
|
||||
data_path <- NA
|
||||
for (p in possible_paths) { if (file.exists(p)) { data_path <- p; break } }
|
||||
if (is.na(data_path)) stop("could not find wine.data; place this script in the DAN folder or given/ and re-run")
|
||||
|
||||
if (has_readr) {
|
||||
wine <- readr::read_csv(
|
||||
file = data_path,
|
||||
col_names = FALSE,
|
||||
show_col_types = FALSE,
|
||||
progress = FALSE
|
||||
)
|
||||
} else {
|
||||
wine <- read.csv(file = data_path, header = FALSE)
|
||||
}
|
||||
|
||||
colnames(wine) <- c(
|
||||
"Type",
|
||||
"Alcohol",
|
||||
"Malic_acid",
|
||||
"Ash",
|
||||
"Alcalinity_of_ash",
|
||||
"Magnesium",
|
||||
"Total_phenols",
|
||||
"Flavanoids",
|
||||
"Nonflavanoid_phenols",
|
||||
"Proanthocyanins",
|
||||
"Color_intensity",
|
||||
"Hue",
|
||||
"OD280_OD315",
|
||||
"Proline"
|
||||
)
|
||||
|
||||
wine$Type <- as.factor(wine$Type)
|
||||
|
||||
# put here from when I accidentally read in the wrong file repeatedly
|
||||
# left because it makes it more, "robust"
|
||||
stopifnot(nrow(wine) == 178, ncol(wine) == 14)
|
||||
print(summary(wine$Type))
|
||||
|
||||
# exploratory plots (because I went down a rabbit hole and by god I'm using it)
|
||||
|
||||
if (has_psych) {
|
||||
# pairs panel (psych) – colors by class
|
||||
psych::pairs.panels(
|
||||
wine[,-1],
|
||||
gap = 0,
|
||||
bg = c("red","gold","royalblue")[wine$Type],
|
||||
pch = 21,
|
||||
main = "wine (uci) – scatterplot matrix by class"
|
||||
)
|
||||
}
|
||||
|
||||
if (has_GGally && has_ggplot2) {
|
||||
# ggpairs for nice matrix <3
|
||||
GGally::ggpairs(wine, ggplot2::aes(colour = Type), columns = 2:ncol(wine))
|
||||
}
|
||||
|
||||
# split into train/test BEFORE!!!!!!!!!!!!!!!!!!!!!! any preprocessing to avoid leakage
|
||||
|
||||
set.seed(4600)
|
||||
n <- nrow(wine)
|
||||
train_idx <- sample.int(n, size = floor(0.7 * n))
|
||||
wine_train <- wine[train_idx, , drop = FALSE]
|
||||
wine_test <- wine[-train_idx, , drop = FALSE]
|
||||
|
||||
X_train <- wine_train[, -1]
|
||||
y_train <- wine_train$Type
|
||||
X_test <- wine_test[, -1]
|
||||
y_test <- wine_test$Type
|
||||
|
||||
# yes
|
||||
if (any(sapply(X_train, function(x) var(x, na.rm = TRUE) == 0))) {
|
||||
warning("one or more predictors have zero variance in the training set; scale() would fail")
|
||||
}
|
||||
if (anyNA(X_train) | anyNA(X_test)) {
|
||||
stop("found NA values in predictors; handle missingness before PCA")
|
||||
}
|
||||
|
||||
# project both train and test using the train-fitted pca
|
||||
pca_tr <- prcomp(X_train, center = TRUE, scale. = TRUE)
|
||||
|
||||
pve_tr <- (pca_tr$sdev^2) / sum(pca_tr$sdev^2)
|
||||
pve_df <- data.frame(
|
||||
PC = paste0("PC", seq_along(pve_tr)),
|
||||
PVE = pve_tr,
|
||||
CumPVE = cumsum(pve_tr)
|
||||
)
|
||||
|
||||
print("variance explained (training pca):")
|
||||
print(pve_df)
|
||||
|
||||
# scree plot from training pca
|
||||
p_scree <- ggplot(pve_df, aes(x = seq_along(PVE), y = PVE)) +
|
||||
geom_line() + geom_point() +
|
||||
scale_x_continuous(breaks = 1:length(pve_df$PC), labels = pve_df$PC) +
|
||||
labs(title = "scree plot – variance explained by principal components (training pca)",
|
||||
x = "principal component", y = "proportion of variance explained") +
|
||||
theme_minimal()
|
||||
|
||||
# cumulative variance plot from training pca
|
||||
p_cumvar <- ggplot(pve_df, aes(x = seq_along(CumPVE), y = CumPVE)) +
|
||||
geom_line() + geom_point() +
|
||||
scale_x_continuous(breaks = 1:length(pve_df$PC), labels = pve_df$PC) +
|
||||
labs(title = "cumulative variance explained (training pca)",
|
||||
x = "principal component", y = "cumulative proportion of variance") +
|
||||
theme_minimal()
|
||||
|
||||
# ========================================================================================================
|
||||
|
||||
# choose number of pcs: default to the smallest k with >= thresh cum variance
|
||||
# you can change thresh to 0.90 or 0.99 if you prefer
|
||||
|
||||
pc_variance_threshold <- 0.95
|
||||
k_pcs <- which(cumsum(pve_tr) >= pc_variance_threshold)[1]
|
||||
if (is.na(k_pcs)) k_pcs <- ncol(X_train) # crashes if fails so...
|
||||
cat("chosen number of pcs (threshold =", pc_variance_threshold, "):", k_pcs, "\n")
|
||||
|
||||
# project train/test into the pca space
|
||||
Z_train_full <- as.data.frame(predict(pca_tr, newdata = X_train))
|
||||
Z_test_full <- as.data.frame(predict(pca_tr, newdata = X_test))
|
||||
|
||||
# for downstream modeling
|
||||
Z_train <- Z_train_full[, seq_len(k_pcs), drop = FALSE]
|
||||
Z_test <- Z_test_full[, seq_len(k_pcs), drop = FALSE]
|
||||
|
||||
scores_all <- as.data.frame(predict(pca_tr, newdata = wine[,-1]))
|
||||
scores_all$Type <- wine$Type
|
||||
|
||||
# loadings from training pca
|
||||
loadings <- as.data.frame(pca_tr$rotation)
|
||||
loadings$Variable <- rownames(loadings)
|
||||
top_pc1 <- loadings[order(abs(loadings$PC1), decreasing = TRUE), c("Variable","PC1")][1:5, ]
|
||||
top_pc2 <- loadings[order(abs(loadings$PC2), decreasing = TRUE), c("Variable","PC2")][1:5, ]
|
||||
print("top contributors to pc1 (training pca):"); print(top_pc1)
|
||||
print("top contributors to pc2 (training pca):"); print(top_pc2)
|
||||
|
||||
|
||||
# function to make convex hull data for each group
|
||||
scores <- scores_all
|
||||
hull_df <- do.call(rbind, lapply(split(scores, scores$Type), function(df) {
|
||||
pts <- df[chull(df$PC1, df$PC2), c("PC1","PC2")]
|
||||
pts$Type <- unique(df$Type)
|
||||
pts
|
||||
}))
|
||||
p_pc12 <- ggplot(scores, aes(PC1, PC2, color = Type)) +
|
||||
geom_point(size = 2, alpha = 0.85) +
|
||||
geom_polygon(data = hull_df, aes(fill = Type, group = Type), color = NA, alpha = 0.15) +
|
||||
guides(fill = "none") +
|
||||
theme_minimal() +
|
||||
labs(title = "pc1 vs pc2 by class (projected with training pca)")
|
||||
|
||||
# arrow arrow arrow arrow arrow arrow arrow arrow arrow
|
||||
loading_scalefactor <- 3 * max(abs(scores$PC1), abs(scores$PC2)) # heuristic
|
||||
load_plot_df <- loadings
|
||||
load_plot_df$PC1s <- load_plot_df$PC1 * loading_scalefactor
|
||||
load_plot_df$PC2s <- load_plot_df$PC2 * loading_scalefactor
|
||||
|
||||
p_biplot <- ggplot(scores, aes(PC1, PC2, color = Type)) +
|
||||
geom_point(size = 2, alpha = 0.85) +
|
||||
geom_segment(
|
||||
data = load_plot_df,
|
||||
mapping = aes(x = 0, y = 0, xend = PC1s, yend = PC2s),
|
||||
inherit.aes = FALSE,
|
||||
arrow = arrow(length = unit(0.02, "npc")),
|
||||
color = "black",
|
||||
alpha = 0.8
|
||||
) +
|
||||
geom_text(
|
||||
data = load_plot_df,
|
||||
mapping = aes(x = PC1s, y = PC2s, label = Variable),
|
||||
inherit.aes = FALSE,
|
||||
hjust = 0,
|
||||
vjust = 0
|
||||
) +
|
||||
theme_minimal() +
|
||||
labs(title = "pc1 vs pc2 with variable loadings (training pca projection)")
|
||||
|
||||
# 1) kNN on original variables with standardization
|
||||
# 2) kNN on first 2 principal components only
|
||||
|
||||
# helper to create metrics from a confusion matrix (rows=true, cols=pred)
|
||||
compute_metrics <- function(cm) {
|
||||
lv <- rownames(cm)
|
||||
if (is.null(lv)) lv <- as.character(1:nrow(cm))
|
||||
TP <- diag(cm)
|
||||
FP <- colSums(cm) - TP
|
||||
FN <- rowSums(cm) - TP
|
||||
precision <- TP / (TP + FP)
|
||||
recall <- TP / (TP + FN)
|
||||
f1 <- 2 * precision * recall / (precision + recall)
|
||||
acc <- sum(TP) / sum(cm)
|
||||
macro_precision <- mean(precision, na.rm = TRUE)
|
||||
macro_recall <- mean(recall, na.rm = TRUE)
|
||||
macro_f1 <- mean(f1, na.rm = TRUE)
|
||||
per_class <- data.frame(
|
||||
class = lv,
|
||||
precision = precision,
|
||||
recall = recall,
|
||||
f1 = f1,
|
||||
row.names = NULL
|
||||
)
|
||||
summary <- data.frame(
|
||||
accuracy = acc,
|
||||
macro_precision = macro_precision,
|
||||
macro_recall = macro_recall,
|
||||
macro_f1 = macro_f1
|
||||
)
|
||||
list(per_class = per_class, summary = summary)
|
||||
}
|
||||
|
||||
set.seed(4600)
|
||||
ks <- seq(1, 15, by = 2)
|
||||
Kfolds <- 5
|
||||
|
||||
# kNN on original vars
|
||||
X_train_scaled <- scale(X_train, center = TRUE, scale = TRUE)
|
||||
scale_center <- attr(X_train_scaled, "scaled:center")
|
||||
scale_scale <- attr(X_train_scaled, "scaled:scale")
|
||||
X_test_scaled <- scale(X_test, center = scale_center, scale = scale_scale)
|
||||
|
||||
n_train_orig <- nrow(X_train_scaled)
|
||||
folds_orig <- sample(rep(1:Kfolds, length.out = n_train_orig))
|
||||
cv_acc_orig <- sapply(ks, function(k) {
|
||||
mean(sapply(1:Kfolds, function(f) {
|
||||
tr <- which(folds_orig != f)
|
||||
va <- which(folds_orig == f)
|
||||
pred_cv <- knn(train = X_train_scaled[tr, , drop = FALSE],
|
||||
test = X_train_scaled[va, , drop = FALSE],
|
||||
cl = y_train[tr], k = k)
|
||||
mean(pred_cv == y_train[va])
|
||||
}))
|
||||
})
|
||||
|
||||
best_k_orig <- ks[which.max(cv_acc_orig)]
|
||||
cat("[Original vars] best k:", best_k_orig, "cv acc:", max(cv_acc_orig), "\n")
|
||||
|
||||
pred_orig <- knn(train = X_train_scaled, test = X_test_scaled, cl = y_train, k = best_k_orig)
|
||||
acc_orig <- mean(pred_orig == y_test)
|
||||
cm_orig <- table(truth = y_test, pred = pred_orig)
|
||||
|
||||
cat("[Original vars] held-out accuracy:", round(acc_orig, 4), "\n")
|
||||
print(cm_orig)
|
||||
|
||||
metrics_orig <- compute_metrics(cm_orig)
|
||||
print(metrics_orig$summary)
|
||||
print(metrics_orig$per_class)
|
||||
|
||||
# kNN on first 2 PCs only
|
||||
Z2_train <- Z_train_full[, 1:2, drop = FALSE]
|
||||
Z2_test <- Z_test_full[, 1:2, drop = FALSE]
|
||||
n_train_2pc <- nrow(Z2_train)
|
||||
|
||||
folds_2pc <- sample(rep(1:Kfolds, length.out = n_train_2pc))
|
||||
cv_acc_2pc <- sapply(ks, function(k) {
|
||||
mean(sapply(1:Kfolds, function(f) {
|
||||
tr <- which(folds_2pc != f)
|
||||
va <- which(folds_2pc == f)
|
||||
pred_cv <- knn(train = Z2_train[tr, , drop = FALSE],
|
||||
test = Z2_train[va, , drop = FALSE],
|
||||
cl = y_train[tr], k = k)
|
||||
mean(pred_cv == y_train[va])
|
||||
}))
|
||||
})
|
||||
|
||||
best_k_2pc <- ks[which.max(cv_acc_2pc)]
|
||||
cat("[First 2 PCs] best k:", best_k_2pc, "cv acc:", max(cv_acc_2pc), "\n")
|
||||
|
||||
pred_2pc <- knn(train = Z2_train, test = Z2_test, cl = y_train, k = best_k_2pc)
|
||||
acc_2pc <- mean(pred_2pc == y_test)
|
||||
cm_2pc <- table(truth = y_test, pred = pred_2pc)
|
||||
|
||||
cat("[First 2 PCs] held-out accuracy:", round(acc_2pc, 4), "\n")
|
||||
print(cm_2pc)
|
||||
|
||||
metrics_2pc <- compute_metrics(cm_2pc)
|
||||
print(metrics_2pc$summary)
|
||||
print(metrics_2pc$per_class)
|
||||
|
||||
# ===========================================================================================
|
||||
outputs_dir <- "outputs"
|
||||
if (!dir.exists(outputs_dir)) dir.create(outputs_dir, recursive = TRUE, showWarnings = FALSE)
|
||||
|
||||
# plots
|
||||
if (exists("p_pc12") && inherits(p_pc12, "ggplot")) ggsave(filename = file.path(outputs_dir, "pc12_scatter.png"), plot = p_pc12, width = 8, height = 6, dpi = 300)
|
||||
if (exists("p_biplot") && inherits(p_biplot, "ggplot")) ggsave(filename = file.path(outputs_dir, "pc12_biplot.png"), plot = p_biplot, width = 8, height = 6, dpi = 300)
|
||||
if (exists("p_scree") && inherits(p_scree, "ggplot")) ggsave(filename = file.path(outputs_dir, "pca_scree.png"), plot = p_scree, width = 8, height = 6, dpi = 300)
|
||||
if (exists("p_cumvar") && inherits(p_cumvar, "ggplot")) ggsave(filename = file.path(outputs_dir, "pca_cumvar.png"), plot = p_cumvar, width = 8, height = 6, dpi = 300)
|
||||
|
||||
# top contributors/vars to PC1 and PC2
|
||||
write.csv(top_pc1, file = file.path(outputs_dir, "top_contributors_pc1.csv"), row.names = FALSE)
|
||||
write.csv(top_pc2, file = file.path(outputs_dir, "top_contributors_pc2.csv"), row.names = FALSE)
|
||||
|
||||
# confusion matrices as wide CSV and pretty text
|
||||
write.csv(as.matrix(cm_orig), file = file.path(outputs_dir, "confusion_original_wide.csv"))
|
||||
writeLines(capture.output(cm_orig), con = file.path(outputs_dir, "confusion_original.txt"))
|
||||
|
||||
write.csv(as.matrix(cm_2pc), file = file.path(outputs_dir, "confusion_2pc_wide.csv"))
|
||||
writeLines(capture.output(cm_2pc), con = file.path(outputs_dir, "confusion_2pc.txt"))
|
||||
|
||||
# metrics
|
||||
write.csv(metrics_orig$per_class, file = file.path(outputs_dir, "metrics_original_per_class.csv"), row.names = FALSE)
|
||||
write.csv(metrics_orig$summary, file = file.path(outputs_dir, "metrics_original_summary.csv"), row.names = FALSE)
|
||||
write.csv(metrics_2pc$per_class, file = file.path(outputs_dir, "metrics_2pc_per_class.csv"), row.names = FALSE)
|
||||
write.csv(metrics_2pc$summary, file = file.path(outputs_dir, "metrics_2pc_summary.csv"), row.names = FALSE)
|
||||
|
||||
# summary
|
||||
metrics_compare <- data.frame(
|
||||
model = c("original_variables", "first_2_pcs"),
|
||||
accuracy = c(metrics_orig$summary$accuracy, metrics_2pc$summary$accuracy),
|
||||
macro_precision = c(metrics_orig$summary$macro_precision, metrics_2pc$summary$macro_precision),
|
||||
macro_recall = c(metrics_orig$summary$macro_recall, metrics_2pc$summary$macro_recall),
|
||||
macro_f1 = c(metrics_orig$summary$macro_f1, metrics_2pc$summary$macro_f1)
|
||||
)
|
||||
write.csv(metrics_compare, file = file.path(outputs_dir, "metrics_comparison.csv"), row.names = FALSE)
|
||||
|
||||
# The below was made with help from ChatGPT because the psych package is confusing
|
||||
if (!interactive() && has_ggplot2) {
|
||||
pdf("Rplots_pca_fixed.pdf", width = 8, height = 6)
|
||||
if (has_psych) {
|
||||
psych::pairs.panels(
|
||||
wine[,-1],
|
||||
gap = 0,
|
||||
bg = c("red","gold","royalblue")[wine$Type],
|
||||
pch = 21,
|
||||
main = "wine (uci) – scatterplot matrix by class"
|
||||
)
|
||||
}
|
||||
|
||||
if (exists("p_scree") && inherits(p_scree, "ggplot")) print(p_scree)
|
||||
if (exists("p_pc12") && inherits(p_pc12, "ggplot")) print(p_pc12)
|
||||
dev.off()
|
||||
}
|
||||
@@ -0,0 +1,5 @@
|
||||
pred
|
||||
truth 1 2 3
|
||||
1 15 2 0
|
||||
2 1 19 1
|
||||
3 0 1 15
|
||||
@@ -0,0 +1,4 @@
|
||||
"","1","2","3"
|
||||
"1",15,2,0
|
||||
"2",1,19,1
|
||||
"3",0,1,15
|
||||
|
@@ -0,0 +1,5 @@
|
||||
pred
|
||||
truth 1 2 3
|
||||
1 17 0 0
|
||||
2 1 18 2
|
||||
3 0 0 16
|
||||
@@ -0,0 +1,4 @@
|
||||
"","1","2","3"
|
||||
"1",17,0,0
|
||||
"2",1,18,2
|
||||
"3",0,0,16
|
||||
|
@@ -0,0 +1,4 @@
|
||||
"class","precision","recall","f1"
|
||||
"1",0.9375,0.882352941176471,0.909090909090909
|
||||
"2",0.863636363636364,0.904761904761905,0.883720930232558
|
||||
"3",0.9375,0.9375,0.9375
|
||||
|
@@ -0,0 +1,2 @@
|
||||
"accuracy","macro_precision","macro_recall","macro_f1"
|
||||
0.907407407407407,0.912878787878788,0.908204948646125,0.910103946441156
|
||||
|
@@ -0,0 +1,3 @@
|
||||
"model","accuracy","macro_precision","macro_recall","macro_f1"
|
||||
"original_variables",0.944444444444444,0.944444444444444,0.952380952380952,0.94522732169791
|
||||
"first_2_pcs",0.907407407407407,0.912878787878788,0.908204948646125,0.910103946441156
|
||||
|
@@ -0,0 +1,4 @@
|
||||
"class","precision","recall","f1"
|
||||
"1",0.944444444444444,1,0.971428571428571
|
||||
"2",1,0.857142857142857,0.923076923076923
|
||||
"3",0.888888888888889,1,0.941176470588235
|
||||
|
@@ -0,0 +1,2 @@
|
||||
"accuracy","macro_precision","macro_recall","macro_f1"
|
||||
0.944444444444444,0.944444444444444,0.952380952380952,0.94522732169791
|
||||
|
|
After Width: | Height: | Size: 344 KiB |
|
After Width: | Height: | Size: 227 KiB |
|
After Width: | Height: | Size: 101 KiB |
|
After Width: | Height: | Size: 105 KiB |
@@ -0,0 +1,6 @@
|
||||
"Variable","PC1"
|
||||
"Flavanoids",0.430570697054093
|
||||
"Total_phenols",0.388556731445086
|
||||
"OD280_OD315",0.379238757892512
|
||||
"Proanthocyanins",0.318149910146199
|
||||
"Nonflavanoid_phenols",-0.292569052362651
|
||||
|
@@ -0,0 +1,6 @@
|
||||
"Variable","PC2"
|
||||
"Color_intensity",-0.504116493512561
|
||||
"Alcohol",-0.480328824227057
|
||||
"Ash",-0.369020648548877
|
||||
"Proline",-0.3555672525193
|
||||
"Hue",0.300324646690879
|
||||
|
@@ -0,0 +1,178 @@
|
||||
1,14.23,1.71,2.43,15.6,127,2.8,3.06,.28,2.29,5.64,1.04,3.92,1065
|
||||
1,13.2,1.78,2.14,11.2,100,2.65,2.76,.26,1.28,4.38,1.05,3.4,1050
|
||||
1,13.16,2.36,2.67,18.6,101,2.8,3.24,.3,2.81,5.68,1.03,3.17,1185
|
||||
1,14.37,1.95,2.5,16.8,113,3.85,3.49,.24,2.18,7.8,.86,3.45,1480
|
||||
1,13.24,2.59,2.87,21,118,2.8,2.69,.39,1.82,4.32,1.04,2.93,735
|
||||
1,14.2,1.76,2.45,15.2,112,3.27,3.39,.34,1.97,6.75,1.05,2.85,1450
|
||||
1,14.39,1.87,2.45,14.6,96,2.5,2.52,.3,1.98,5.25,1.02,3.58,1290
|
||||
1,14.06,2.15,2.61,17.6,121,2.6,2.51,.31,1.25,5.05,1.06,3.58,1295
|
||||
1,14.83,1.64,2.17,14,97,2.8,2.98,.29,1.98,5.2,1.08,2.85,1045
|
||||
1,13.86,1.35,2.27,16,98,2.98,3.15,.22,1.85,7.22,1.01,3.55,1045
|
||||
1,14.1,2.16,2.3,18,105,2.95,3.32,.22,2.38,5.75,1.25,3.17,1510
|
||||
1,14.12,1.48,2.32,16.8,95,2.2,2.43,.26,1.57,5,1.17,2.82,1280
|
||||
1,13.75,1.73,2.41,16,89,2.6,2.76,.29,1.81,5.6,1.15,2.9,1320
|
||||
1,14.75,1.73,2.39,11.4,91,3.1,3.69,.43,2.81,5.4,1.25,2.73,1150
|
||||
1,14.38,1.87,2.38,12,102,3.3,3.64,.29,2.96,7.5,1.2,3,1547
|
||||
1,13.63,1.81,2.7,17.2,112,2.85,2.91,.3,1.46,7.3,1.28,2.88,1310
|
||||
1,14.3,1.92,2.72,20,120,2.8,3.14,.33,1.97,6.2,1.07,2.65,1280
|
||||
1,13.83,1.57,2.62,20,115,2.95,3.4,.4,1.72,6.6,1.13,2.57,1130
|
||||
1,14.19,1.59,2.48,16.5,108,3.3,3.93,.32,1.86,8.7,1.23,2.82,1680
|
||||
1,13.64,3.1,2.56,15.2,116,2.7,3.03,.17,1.66,5.1,.96,3.36,845
|
||||
1,14.06,1.63,2.28,16,126,3,3.17,.24,2.1,5.65,1.09,3.71,780
|
||||
1,12.93,3.8,2.65,18.6,102,2.41,2.41,.25,1.98,4.5,1.03,3.52,770
|
||||
1,13.71,1.86,2.36,16.6,101,2.61,2.88,.27,1.69,3.8,1.11,4,1035
|
||||
1,12.85,1.6,2.52,17.8,95,2.48,2.37,.26,1.46,3.93,1.09,3.63,1015
|
||||
1,13.5,1.81,2.61,20,96,2.53,2.61,.28,1.66,3.52,1.12,3.82,845
|
||||
1,13.05,2.05,3.22,25,124,2.63,2.68,.47,1.92,3.58,1.13,3.2,830
|
||||
1,13.39,1.77,2.62,16.1,93,2.85,2.94,.34,1.45,4.8,.92,3.22,1195
|
||||
1,13.3,1.72,2.14,17,94,2.4,2.19,.27,1.35,3.95,1.02,2.77,1285
|
||||
1,13.87,1.9,2.8,19.4,107,2.95,2.97,.37,1.76,4.5,1.25,3.4,915
|
||||
1,14.02,1.68,2.21,16,96,2.65,2.33,.26,1.98,4.7,1.04,3.59,1035
|
||||
1,13.73,1.5,2.7,22.5,101,3,3.25,.29,2.38,5.7,1.19,2.71,1285
|
||||
1,13.58,1.66,2.36,19.1,106,2.86,3.19,.22,1.95,6.9,1.09,2.88,1515
|
||||
1,13.68,1.83,2.36,17.2,104,2.42,2.69,.42,1.97,3.84,1.23,2.87,990
|
||||
1,13.76,1.53,2.7,19.5,132,2.95,2.74,.5,1.35,5.4,1.25,3,1235
|
||||
1,13.51,1.8,2.65,19,110,2.35,2.53,.29,1.54,4.2,1.1,2.87,1095
|
||||
1,13.48,1.81,2.41,20.5,100,2.7,2.98,.26,1.86,5.1,1.04,3.47,920
|
||||
1,13.28,1.64,2.84,15.5,110,2.6,2.68,.34,1.36,4.6,1.09,2.78,880
|
||||
1,13.05,1.65,2.55,18,98,2.45,2.43,.29,1.44,4.25,1.12,2.51,1105
|
||||
1,13.07,1.5,2.1,15.5,98,2.4,2.64,.28,1.37,3.7,1.18,2.69,1020
|
||||
1,14.22,3.99,2.51,13.2,128,3,3.04,.2,2.08,5.1,.89,3.53,760
|
||||
1,13.56,1.71,2.31,16.2,117,3.15,3.29,.34,2.34,6.13,.95,3.38,795
|
||||
1,13.41,3.84,2.12,18.8,90,2.45,2.68,.27,1.48,4.28,.91,3,1035
|
||||
1,13.88,1.89,2.59,15,101,3.25,3.56,.17,1.7,5.43,.88,3.56,1095
|
||||
1,13.24,3.98,2.29,17.5,103,2.64,2.63,.32,1.66,4.36,.82,3,680
|
||||
1,13.05,1.77,2.1,17,107,3,3,.28,2.03,5.04,.88,3.35,885
|
||||
1,14.21,4.04,2.44,18.9,111,2.85,2.65,.3,1.25,5.24,.87,3.33,1080
|
||||
1,14.38,3.59,2.28,16,102,3.25,3.17,.27,2.19,4.9,1.04,3.44,1065
|
||||
1,13.9,1.68,2.12,16,101,3.1,3.39,.21,2.14,6.1,.91,3.33,985
|
||||
1,14.1,2.02,2.4,18.8,103,2.75,2.92,.32,2.38,6.2,1.07,2.75,1060
|
||||
1,13.94,1.73,2.27,17.4,108,2.88,3.54,.32,2.08,8.90,1.12,3.1,1260
|
||||
1,13.05,1.73,2.04,12.4,92,2.72,3.27,.17,2.91,7.2,1.12,2.91,1150
|
||||
1,13.83,1.65,2.6,17.2,94,2.45,2.99,.22,2.29,5.6,1.24,3.37,1265
|
||||
1,13.82,1.75,2.42,14,111,3.88,3.74,.32,1.87,7.05,1.01,3.26,1190
|
||||
1,13.77,1.9,2.68,17.1,115,3,2.79,.39,1.68,6.3,1.13,2.93,1375
|
||||
1,13.74,1.67,2.25,16.4,118,2.6,2.9,.21,1.62,5.85,.92,3.2,1060
|
||||
1,13.56,1.73,2.46,20.5,116,2.96,2.78,.2,2.45,6.25,.98,3.03,1120
|
||||
1,14.22,1.7,2.3,16.3,118,3.2,3,.26,2.03,6.38,.94,3.31,970
|
||||
1,13.29,1.97,2.68,16.8,102,3,3.23,.31,1.66,6,1.07,2.84,1270
|
||||
1,13.72,1.43,2.5,16.7,108,3.4,3.67,.19,2.04,6.8,.89,2.87,1285
|
||||
2,12.37,.94,1.36,10.6,88,1.98,.57,.28,.42,1.95,1.05,1.82,520
|
||||
2,12.33,1.1,2.28,16,101,2.05,1.09,.63,.41,3.27,1.25,1.67,680
|
||||
2,12.64,1.36,2.02,16.8,100,2.02,1.41,.53,.62,5.75,.98,1.59,450
|
||||
2,13.67,1.25,1.92,18,94,2.1,1.79,.32,.73,3.8,1.23,2.46,630
|
||||
2,12.37,1.13,2.16,19,87,3.5,3.1,.19,1.87,4.45,1.22,2.87,420
|
||||
2,12.17,1.45,2.53,19,104,1.89,1.75,.45,1.03,2.95,1.45,2.23,355
|
||||
2,12.37,1.21,2.56,18.1,98,2.42,2.65,.37,2.08,4.6,1.19,2.3,678
|
||||
2,13.11,1.01,1.7,15,78,2.98,3.18,.26,2.28,5.3,1.12,3.18,502
|
||||
2,12.37,1.17,1.92,19.6,78,2.11,2,.27,1.04,4.68,1.12,3.48,510
|
||||
2,13.34,.94,2.36,17,110,2.53,1.3,.55,.42,3.17,1.02,1.93,750
|
||||
2,12.21,1.19,1.75,16.8,151,1.85,1.28,.14,2.5,2.85,1.28,3.07,718
|
||||
2,12.29,1.61,2.21,20.4,103,1.1,1.02,.37,1.46,3.05,.906,1.82,870
|
||||
2,13.86,1.51,2.67,25,86,2.95,2.86,.21,1.87,3.38,1.36,3.16,410
|
||||
2,13.49,1.66,2.24,24,87,1.88,1.84,.27,1.03,3.74,.98,2.78,472
|
||||
2,12.99,1.67,2.6,30,139,3.3,2.89,.21,1.96,3.35,1.31,3.5,985
|
||||
2,11.96,1.09,2.3,21,101,3.38,2.14,.13,1.65,3.21,.99,3.13,886
|
||||
2,11.66,1.88,1.92,16,97,1.61,1.57,.34,1.15,3.8,1.23,2.14,428
|
||||
2,13.03,.9,1.71,16,86,1.95,2.03,.24,1.46,4.6,1.19,2.48,392
|
||||
2,11.84,2.89,2.23,18,112,1.72,1.32,.43,.95,2.65,.96,2.52,500
|
||||
2,12.33,.99,1.95,14.8,136,1.9,1.85,.35,2.76,3.4,1.06,2.31,750
|
||||
2,12.7,3.87,2.4,23,101,2.83,2.55,.43,1.95,2.57,1.19,3.13,463
|
||||
2,12,.92,2,19,86,2.42,2.26,.3,1.43,2.5,1.38,3.12,278
|
||||
2,12.72,1.81,2.2,18.8,86,2.2,2.53,.26,1.77,3.9,1.16,3.14,714
|
||||
2,12.08,1.13,2.51,24,78,2,1.58,.4,1.4,2.2,1.31,2.72,630
|
||||
2,13.05,3.86,2.32,22.5,85,1.65,1.59,.61,1.62,4.8,.84,2.01,515
|
||||
2,11.84,.89,2.58,18,94,2.2,2.21,.22,2.35,3.05,.79,3.08,520
|
||||
2,12.67,.98,2.24,18,99,2.2,1.94,.3,1.46,2.62,1.23,3.16,450
|
||||
2,12.16,1.61,2.31,22.8,90,1.78,1.69,.43,1.56,2.45,1.33,2.26,495
|
||||
2,11.65,1.67,2.62,26,88,1.92,1.61,.4,1.34,2.6,1.36,3.21,562
|
||||
2,11.64,2.06,2.46,21.6,84,1.95,1.69,.48,1.35,2.8,1,2.75,680
|
||||
2,12.08,1.33,2.3,23.6,70,2.2,1.59,.42,1.38,1.74,1.07,3.21,625
|
||||
2,12.08,1.83,2.32,18.5,81,1.6,1.5,.52,1.64,2.4,1.08,2.27,480
|
||||
2,12,1.51,2.42,22,86,1.45,1.25,.5,1.63,3.6,1.05,2.65,450
|
||||
2,12.69,1.53,2.26,20.7,80,1.38,1.46,.58,1.62,3.05,.96,2.06,495
|
||||
2,12.29,2.83,2.22,18,88,2.45,2.25,.25,1.99,2.15,1.15,3.3,290
|
||||
2,11.62,1.99,2.28,18,98,3.02,2.26,.17,1.35,3.25,1.16,2.96,345
|
||||
2,12.47,1.52,2.2,19,162,2.5,2.27,.32,3.28,2.6,1.16,2.63,937
|
||||
2,11.81,2.12,2.74,21.5,134,1.6,.99,.14,1.56,2.5,.95,2.26,625
|
||||
2,12.29,1.41,1.98,16,85,2.55,2.5,.29,1.77,2.9,1.23,2.74,428
|
||||
2,12.37,1.07,2.1,18.5,88,3.52,3.75,.24,1.95,4.5,1.04,2.77,660
|
||||
2,12.29,3.17,2.21,18,88,2.85,2.99,.45,2.81,2.3,1.42,2.83,406
|
||||
2,12.08,2.08,1.7,17.5,97,2.23,2.17,.26,1.4,3.3,1.27,2.96,710
|
||||
2,12.6,1.34,1.9,18.5,88,1.45,1.36,.29,1.35,2.45,1.04,2.77,562
|
||||
2,12.34,2.45,2.46,21,98,2.56,2.11,.34,1.31,2.8,.8,3.38,438
|
||||
2,11.82,1.72,1.88,19.5,86,2.5,1.64,.37,1.42,2.06,.94,2.44,415
|
||||
2,12.51,1.73,1.98,20.5,85,2.2,1.92,.32,1.48,2.94,1.04,3.57,672
|
||||
2,12.42,2.55,2.27,22,90,1.68,1.84,.66,1.42,2.7,.86,3.3,315
|
||||
2,12.25,1.73,2.12,19,80,1.65,2.03,.37,1.63,3.4,1,3.17,510
|
||||
2,12.72,1.75,2.28,22.5,84,1.38,1.76,.48,1.63,3.3,.88,2.42,488
|
||||
2,12.22,1.29,1.94,19,92,2.36,2.04,.39,2.08,2.7,.86,3.02,312
|
||||
2,11.61,1.35,2.7,20,94,2.74,2.92,.29,2.49,2.65,.96,3.26,680
|
||||
2,11.46,3.74,1.82,19.5,107,3.18,2.58,.24,3.58,2.9,.75,2.81,562
|
||||
2,12.52,2.43,2.17,21,88,2.55,2.27,.26,1.22,2,.9,2.78,325
|
||||
2,11.76,2.68,2.92,20,103,1.75,2.03,.6,1.05,3.8,1.23,2.5,607
|
||||
2,11.41,.74,2.5,21,88,2.48,2.01,.42,1.44,3.08,1.1,2.31,434
|
||||
2,12.08,1.39,2.5,22.5,84,2.56,2.29,.43,1.04,2.9,.93,3.19,385
|
||||
2,11.03,1.51,2.2,21.5,85,2.46,2.17,.52,2.01,1.9,1.71,2.87,407
|
||||
2,11.82,1.47,1.99,20.8,86,1.98,1.6,.3,1.53,1.95,.95,3.33,495
|
||||
2,12.42,1.61,2.19,22.5,108,2,2.09,.34,1.61,2.06,1.06,2.96,345
|
||||
2,12.77,3.43,1.98,16,80,1.63,1.25,.43,.83,3.4,.7,2.12,372
|
||||
2,12,3.43,2,19,87,2,1.64,.37,1.87,1.28,.93,3.05,564
|
||||
2,11.45,2.4,2.42,20,96,2.9,2.79,.32,1.83,3.25,.8,3.39,625
|
||||
2,11.56,2.05,3.23,28.5,119,3.18,5.08,.47,1.87,6,.93,3.69,465
|
||||
2,12.42,4.43,2.73,26.5,102,2.2,2.13,.43,1.71,2.08,.92,3.12,365
|
||||
2,13.05,5.8,2.13,21.5,86,2.62,2.65,.3,2.01,2.6,.73,3.1,380
|
||||
2,11.87,4.31,2.39,21,82,2.86,3.03,.21,2.91,2.8,.75,3.64,380
|
||||
2,12.07,2.16,2.17,21,85,2.6,2.65,.37,1.35,2.76,.86,3.28,378
|
||||
2,12.43,1.53,2.29,21.5,86,2.74,3.15,.39,1.77,3.94,.69,2.84,352
|
||||
2,11.79,2.13,2.78,28.5,92,2.13,2.24,.58,1.76,3,.97,2.44,466
|
||||
2,12.37,1.63,2.3,24.5,88,2.22,2.45,.4,1.9,2.12,.89,2.78,342
|
||||
2,12.04,4.3,2.38,22,80,2.1,1.75,.42,1.35,2.6,.79,2.57,580
|
||||
3,12.86,1.35,2.32,18,122,1.51,1.25,.21,.94,4.1,.76,1.29,630
|
||||
3,12.88,2.99,2.4,20,104,1.3,1.22,.24,.83,5.4,.74,1.42,530
|
||||
3,12.81,2.31,2.4,24,98,1.15,1.09,.27,.83,5.7,.66,1.36,560
|
||||
3,12.7,3.55,2.36,21.5,106,1.7,1.2,.17,.84,5,.78,1.29,600
|
||||
3,12.51,1.24,2.25,17.5,85,2,.58,.6,1.25,5.45,.75,1.51,650
|
||||
3,12.6,2.46,2.2,18.5,94,1.62,.66,.63,.94,7.1,.73,1.58,695
|
||||
3,12.25,4.72,2.54,21,89,1.38,.47,.53,.8,3.85,.75,1.27,720
|
||||
3,12.53,5.51,2.64,25,96,1.79,.6,.63,1.1,5,.82,1.69,515
|
||||
3,13.49,3.59,2.19,19.5,88,1.62,.48,.58,.88,5.7,.81,1.82,580
|
||||
3,12.84,2.96,2.61,24,101,2.32,.6,.53,.81,4.92,.89,2.15,590
|
||||
3,12.93,2.81,2.7,21,96,1.54,.5,.53,.75,4.6,.77,2.31,600
|
||||
3,13.36,2.56,2.35,20,89,1.4,.5,.37,.64,5.6,.7,2.47,780
|
||||
3,13.52,3.17,2.72,23.5,97,1.55,.52,.5,.55,4.35,.89,2.06,520
|
||||
3,13.62,4.95,2.35,20,92,2,.8,.47,1.02,4.4,.91,2.05,550
|
||||
3,12.25,3.88,2.2,18.5,112,1.38,.78,.29,1.14,8.21,.65,2,855
|
||||
3,13.16,3.57,2.15,21,102,1.5,.55,.43,1.3,4,.6,1.68,830
|
||||
3,13.88,5.04,2.23,20,80,.98,.34,.4,.68,4.9,.58,1.33,415
|
||||
3,12.87,4.61,2.48,21.5,86,1.7,.65,.47,.86,7.65,.54,1.86,625
|
||||
3,13.32,3.24,2.38,21.5,92,1.93,.76,.45,1.25,8.42,.55,1.62,650
|
||||
3,13.08,3.9,2.36,21.5,113,1.41,1.39,.34,1.14,9.40,.57,1.33,550
|
||||
3,13.5,3.12,2.62,24,123,1.4,1.57,.22,1.25,8.60,.59,1.3,500
|
||||
3,12.79,2.67,2.48,22,112,1.48,1.36,.24,1.26,10.8,.48,1.47,480
|
||||
3,13.11,1.9,2.75,25.5,116,2.2,1.28,.26,1.56,7.1,.61,1.33,425
|
||||
3,13.23,3.3,2.28,18.5,98,1.8,.83,.61,1.87,10.52,.56,1.51,675
|
||||
3,12.58,1.29,2.1,20,103,1.48,.58,.53,1.4,7.6,.58,1.55,640
|
||||
3,13.17,5.19,2.32,22,93,1.74,.63,.61,1.55,7.9,.6,1.48,725
|
||||
3,13.84,4.12,2.38,19.5,89,1.8,.83,.48,1.56,9.01,.57,1.64,480
|
||||
3,12.45,3.03,2.64,27,97,1.9,.58,.63,1.14,7.5,.67,1.73,880
|
||||
3,14.34,1.68,2.7,25,98,2.8,1.31,.53,2.7,13,.57,1.96,660
|
||||
3,13.48,1.67,2.64,22.5,89,2.6,1.1,.52,2.29,11.75,.57,1.78,620
|
||||
3,12.36,3.83,2.38,21,88,2.3,.92,.5,1.04,7.65,.56,1.58,520
|
||||
3,13.69,3.26,2.54,20,107,1.83,.56,.5,.8,5.88,.96,1.82,680
|
||||
3,12.85,3.27,2.58,22,106,1.65,.6,.6,.96,5.58,.87,2.11,570
|
||||
3,12.96,3.45,2.35,18.5,106,1.39,.7,.4,.94,5.28,.68,1.75,675
|
||||
3,13.78,2.76,2.3,22,90,1.35,.68,.41,1.03,9.58,.7,1.68,615
|
||||
3,13.73,4.36,2.26,22.5,88,1.28,.47,.52,1.15,6.62,.78,1.75,520
|
||||
3,13.45,3.7,2.6,23,111,1.7,.92,.43,1.46,10.68,.85,1.56,695
|
||||
3,12.82,3.37,2.3,19.5,88,1.48,.66,.4,.97,10.26,.72,1.75,685
|
||||
3,13.58,2.58,2.69,24.5,105,1.55,.84,.39,1.54,8.66,.74,1.8,750
|
||||
3,13.4,4.6,2.86,25,112,1.98,.96,.27,1.11,8.5,.67,1.92,630
|
||||
3,12.2,3.03,2.32,19,96,1.25,.49,.4,.73,5.5,.66,1.83,510
|
||||
3,12.77,2.39,2.28,19.5,86,1.39,.51,.48,.64,9.899999,.57,1.63,470
|
||||
3,14.16,2.51,2.48,20,91,1.68,.7,.44,1.24,9.7,.62,1.71,660
|
||||
3,13.71,5.65,2.45,20.5,95,1.68,.61,.52,1.06,7.7,.64,1.74,740
|
||||
3,13.4,3.91,2.48,23,102,1.8,.75,.43,1.41,7.3,.7,1.56,750
|
||||
3,13.27,4.28,2.26,20,120,1.59,.69,.43,1.35,10.2,.59,1.56,835
|
||||
3,13.17,2.59,2.37,20,120,1.65,.68,.53,1.46,9.3,.6,1.62,840
|
||||
3,14.13,4.1,2.74,24.5,96,2.05,.76,.56,1.35,9.2,.61,1.6,560
|
||||
@@ -0,0 +1,100 @@
|
||||
1. Title of Database: Wine recognition data
|
||||
Updated Sept 21, 1998 by C.Blake : Added attribute information
|
||||
|
||||
2. Sources:
|
||||
(a) Forina, M. et al, PARVUS - An Extendible Package for Data
|
||||
Exploration, Classification and Correlation. Institute of Pharmaceutical
|
||||
and Food Analysis and Technologies, Via Brigata Salerno,
|
||||
16147 Genoa, Italy.
|
||||
|
||||
(b) Stefan Aeberhard, email: stefan@coral.cs.jcu.edu.au
|
||||
(c) July 1991
|
||||
3. Past Usage:
|
||||
|
||||
(1)
|
||||
S. Aeberhard, D. Coomans and O. de Vel,
|
||||
Comparison of Classifiers in High Dimensional Settings,
|
||||
Tech. Rep. no. 92-02, (1992), Dept. of Computer Science and Dept. of
|
||||
Mathematics and Statistics, James Cook University of North Queensland.
|
||||
(Also submitted to Technometrics).
|
||||
|
||||
The data was used with many others for comparing various
|
||||
classifiers. The classes are separable, though only RDA
|
||||
has achieved 100% correct classification.
|
||||
(RDA : 100%, QDA 99.4%, LDA 98.9%, 1NN 96.1% (z-transformed data))
|
||||
(All results using the leave-one-out technique)
|
||||
|
||||
In a classification context, this is a well posed problem
|
||||
with "well behaved" class structures. A good data set
|
||||
for first testing of a new classifier, but not very
|
||||
challenging.
|
||||
|
||||
(2)
|
||||
S. Aeberhard, D. Coomans and O. de Vel,
|
||||
"THE CLASSIFICATION PERFORMANCE OF RDA"
|
||||
Tech. Rep. no. 92-01, (1992), Dept. of Computer Science and Dept. of
|
||||
Mathematics and Statistics, James Cook University of North Queensland.
|
||||
(Also submitted to Journal of Chemometrics).
|
||||
|
||||
Here, the data was used to illustrate the superior performance of
|
||||
the use of a new appreciation function with RDA.
|
||||
|
||||
4. Relevant Information:
|
||||
|
||||
-- These data are the results of a chemical analysis of
|
||||
wines grown in the same region in Italy but derived from three
|
||||
different cultivars.
|
||||
The analysis determined the quantities of 13 constituents
|
||||
found in each of the three types of wines.
|
||||
|
||||
-- I think that the initial data set had around 30 variables, but
|
||||
for some reason I only have the 13 dimensional version.
|
||||
I had a list of what the 30 or so variables were, but a.)
|
||||
I lost it, and b.), I would not know which 13 variables
|
||||
are included in the set.
|
||||
|
||||
-- The attributes are (dontated by Riccardo Leardi,
|
||||
riclea@anchem.unige.it )
|
||||
1) Alcohol
|
||||
2) Malic acid
|
||||
3) Ash
|
||||
4) Alcalinity of ash
|
||||
5) Magnesium
|
||||
6) Total phenols
|
||||
7) Flavanoids
|
||||
8) Nonflavanoid phenols
|
||||
9) Proanthocyanins
|
||||
10)Color intensity
|
||||
11)Hue
|
||||
12)OD280/OD315 of diluted wines
|
||||
13)Proline
|
||||
|
||||
5. Number of Instances
|
||||
|
||||
class 1 59
|
||||
class 2 71
|
||||
class 3 48
|
||||
|
||||
6. Number of Attributes
|
||||
|
||||
13
|
||||
|
||||
7. For Each Attribute:
|
||||
|
||||
All attributes are continuous
|
||||
|
||||
No statistics available, but suggest to standardise
|
||||
variables for certain uses (e.g. for us with classifiers
|
||||
which are NOT scale invariant)
|
||||
|
||||
NOTE: 1st attribute is class identifier (1-3)
|
||||
|
||||
8. Missing Attribute Values:
|
||||
|
||||
None
|
||||
|
||||
9. Class Distribution: number of instances per class
|
||||
|
||||
class 1 59
|
||||
class 2 71
|
||||
class 3 48
|
||||
@@ -0,0 +1,9 @@
|
||||
{
|
||||
"[r]": {
|
||||
// generated automatically? What even....
|
||||
"editor.wordSeparators": "`~!@#$%^&*()-=+[{]}\\|;:'\",<>/",
|
||||
"editor.indentSize": "tabSize",
|
||||
"editor.useTabStops": true,
|
||||
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,41 @@
|
||||
##########################################
|
||||
### Principal Component Analysis (PCA) ###
|
||||
##########################################
|
||||
|
||||
## load libraries
|
||||
library(ggplot2)
|
||||
library(ggfortify)
|
||||
library(GGally)
|
||||
library(e1071)
|
||||
library(class)
|
||||
library(psych)
|
||||
library(readr)
|
||||
|
||||
## set working directory so that files can be referenced without the full path
|
||||
setwd("~/Courses/Data Analytics/Fall25/labs/lab 4/")
|
||||
|
||||
## read dataset
|
||||
wine <- read_csv("wine.data", col_names = FALSE)
|
||||
|
||||
## set column names
|
||||
names(wine) <- c("Type","Alcohol","Malic acid","Ash","Alcalinity of ash","Magnesium","Total phenols","Flavanoids","Nonflavanoid Phenols","Proanthocyanins","Color Intensity","Hue","Od280/od315 of diluted wines","Proline")
|
||||
|
||||
## inspect data frame
|
||||
head(wine)
|
||||
|
||||
## change the data type of the "Type" column from character to factor
|
||||
####
|
||||
# Factors look like regular strings (characters) but with factors R knows
|
||||
# that the column is a categorical variable with finite possible values
|
||||
# e.g. "Type" in the Wine dataset can only be 1, 2, or 3
|
||||
####
|
||||
|
||||
wine$Type <- as.factor(wine$Type)
|
||||
|
||||
|
||||
## visualize variables
|
||||
pairs.panels(wine[,-1],gap = 0,bg = c("red", "yellow", "blue")[wine$Type],pch=21)
|
||||
|
||||
ggpairs(wine, ggplot2::aes(colour = Type))
|
||||
|
||||
###
|
||||
@@ -0,0 +1,128 @@
|
||||
install.packages(
|
||||
c("e1071", "caret", "randomForest", "ggplot2", "pROC"),
|
||||
repos = c("https://cloud.r-project.org/"),
|
||||
dependencies = TRUE
|
||||
)
|
||||
|
||||
suppressPackageStartupMessages({
|
||||
library(e1071) # for svm/tune.svm
|
||||
library(caret) # for metrics
|
||||
library(randomForest) # alternative classifier
|
||||
library(ggplot2)
|
||||
})
|
||||
|
||||
set.seed(42)
|
||||
|
||||
read_wine <- function() {
|
||||
df <- read.csv("wine.data", header = FALSE)
|
||||
colnames(df) <- c(
|
||||
"Class",
|
||||
"Alcohol", "Malic.acid", "Ash", "Alcalinity.of.ash", "Magnesium",
|
||||
"Total.phenols", "Flavanoids", "Nonflavanoid.phenols", "Proanthocyanins",
|
||||
"Color.intensity", "Hue", "OD280.OD315", "Proline"
|
||||
)
|
||||
df$Class <- factor(df$Class)
|
||||
df
|
||||
}
|
||||
|
||||
df <- read_wine()
|
||||
|
||||
# split into train/test
|
||||
idx <- createDataPartition(df$Class, p = 0.8, list = FALSE)
|
||||
train <- df[idx, ]
|
||||
test <- df[-idx, ]
|
||||
|
||||
# choose a subset of features based on ANOVA F-test
|
||||
# I picked this sbuset before the runs:
|
||||
# alcohol, flavanoids, color intensity, od280/od315, proline, total phenols
|
||||
features <- c("Alcohol", "Flavanoids", "Color.intensity", "OD280.OD315", "Proline", "Total.phenols")
|
||||
x_train <- train[, features]
|
||||
y_train <- train$Class
|
||||
x_test <- test[, features]
|
||||
y_test <- test$Class
|
||||
|
||||
# scale features
|
||||
pp <- preProcess(x_train, method = c("center", "scale"))
|
||||
x_train_s <- predict(pp, x_train)
|
||||
x_test_s <- predict(pp, x_test)
|
||||
|
||||
# linear kernel svm with hyperparameter tuning (C)
|
||||
set.seed(42)
|
||||
lin_grid <- data.frame(cost = c(0.1, 1, 10, 100))
|
||||
tune_lin <- tune.svm(
|
||||
x = x_train_s, y = y_train,
|
||||
kernel = "linear",
|
||||
cost = lin_grid$cost,
|
||||
tunecontrol = tune.control(cross = 5)
|
||||
)
|
||||
lin_best <- tune_lin$best.model
|
||||
|
||||
# rbf kernel svm with tuning (C, gamma)
|
||||
set.seed(42)
|
||||
rbf_grid_cost <- c(0.1, 1, 10, 100, 1000)
|
||||
rbf_grid_gamma <- c(0.001, 0.01, 0.1, 1)
|
||||
tune_rbf <- tune.svm(
|
||||
x = x_train_s, y = y_train,
|
||||
kernel = "radial",
|
||||
cost = rbf_grid_cost,
|
||||
gamma = rbf_grid_gamma,
|
||||
tunecontrol = tune.control(cross = 5)
|
||||
)
|
||||
rbf_best <- tune_rbf$best.model
|
||||
|
||||
# alt classifier: random forest (same features)
|
||||
set.seed(42)
|
||||
rf_fit <- randomForest(x = x_train, y = y_train, ntree = 500, mtry = 2, importance = TRUE)
|
||||
|
||||
# evaluation helper
|
||||
eval_model <- function(model, x_test_s, y_test, name) {
|
||||
pred <- predict(model, x_test_s)
|
||||
cm <- confusionMatrix(pred, y_test)
|
||||
pr <- data.frame(
|
||||
model = name,
|
||||
accuracy = cm$overall["Accuracy"],
|
||||
precision_macro = mean(cm$byClass[, "Precision"], na.rm = TRUE),
|
||||
recall_macro = mean(cm$byClass[, "Recall"], na.rm = TRUE),
|
||||
f1_macro = mean(cm$byClass[, "F1"], na.rm = TRUE)
|
||||
)
|
||||
list(cm = cm, pr = pr)
|
||||
}
|
||||
|
||||
# eval svm models (use scaled features)
|
||||
lin_eval <- eval_model(lin_best, x_test_s, y_test, "svm_linear")
|
||||
rbf_eval <- eval_model(rbf_best, x_test_s, y_test, "svm_rbf")
|
||||
|
||||
# evaluate random forest (no scaling)
|
||||
rf_pred <- predict(rf_fit, x_test)
|
||||
rf_cm <- confusionMatrix(rf_pred, y_test)
|
||||
|
||||
rf_pr <- data.frame(
|
||||
model = "random_forest",
|
||||
accuracy = rf_cm$overall["Accuracy"],
|
||||
precision_macro = mean(rf_cm$byClass[, "Precision"], na.rm = TRUE),
|
||||
recall_macro = mean(rf_cm$byClass[, "Recall"], na.rm = TRUE),
|
||||
f1_macro = mean(rf_cm$byClass[, "F1"], na.rm = TRUE)
|
||||
)
|
||||
|
||||
perf <- rbind(lin_eval$pr, rbf_eval$pr, rf_pr)
|
||||
|
||||
# print
|
||||
cat("best params (linear svm): C =", lin_best$cost, "\n")
|
||||
cat("best params (rbf svm): C =", rbf_best$cost, " gamma =", rbf_best$gamma, "\n\n")
|
||||
print(perf)
|
||||
|
||||
# macro-f1 comparison
|
||||
ggplot(perf, aes(x = model, y = f1_macro)) +
|
||||
geom_col() +
|
||||
labs(title = "macro-F1 by model (wine test set)")
|
||||
|
||||
# save outputs
|
||||
write.table(perf, file = "lab5_performance_table.txt", sep = "\t", row.names = FALSE, quote = FALSE)
|
||||
sink("lab5_confusion_matrices.txt")
|
||||
cat("=== svm linear ===\n")
|
||||
print(lin_eval$cm)
|
||||
cat("\n=== svm rbf ===\n")
|
||||
print(rbf_eval$cm)
|
||||
cat("\n=== random forest ===\n")
|
||||
print(rf_cm)
|
||||
sink()
|
||||
@@ -0,0 +1,95 @@
|
||||
=== svm linear ===
|
||||
Confusion Matrix and Statistics
|
||||
|
||||
Reference
|
||||
Prediction 1 2 3
|
||||
1 11 1 0
|
||||
2 0 13 0
|
||||
3 0 0 9
|
||||
|
||||
Overall Statistics
|
||||
|
||||
Accuracy : 0.9706
|
||||
95% CI : (0.8467, 0.9993)
|
||||
No Information Rate : 0.4118
|
||||
P-Value [Acc > NIR] : 3.92e-12
|
||||
|
||||
Kappa : 0.9553
|
||||
|
||||
Mcnemar's Test P-Value : NA
|
||||
|
||||
Statistics by Class:
|
||||
|
||||
Class: 1 Class: 2 Class: 3
|
||||
Sensitivity 1.0000 0.9286 1.0000
|
||||
Specificity 0.9565 1.0000 1.0000
|
||||
Pos Pred Value 0.9167 1.0000 1.0000
|
||||
Neg Pred Value 1.0000 0.9524 1.0000
|
||||
Prevalence 0.3235 0.4118 0.2647
|
||||
Detection Rate 0.3235 0.3824 0.2647
|
||||
Detection Prevalence 0.3529 0.3824 0.2647
|
||||
Balanced Accuracy 0.9783 0.9643 1.0000
|
||||
|
||||
=== svm rbf ===
|
||||
Confusion Matrix and Statistics
|
||||
|
||||
Reference
|
||||
Prediction 1 2 3
|
||||
1 11 1 0
|
||||
2 0 13 0
|
||||
3 0 0 9
|
||||
|
||||
Overall Statistics
|
||||
|
||||
Accuracy : 0.9706
|
||||
95% CI : (0.8467, 0.9993)
|
||||
No Information Rate : 0.4118
|
||||
P-Value [Acc > NIR] : 3.92e-12
|
||||
|
||||
Kappa : 0.9553
|
||||
|
||||
Mcnemar's Test P-Value : NA
|
||||
|
||||
Statistics by Class:
|
||||
|
||||
Class: 1 Class: 2 Class: 3
|
||||
Sensitivity 1.0000 0.9286 1.0000
|
||||
Specificity 0.9565 1.0000 1.0000
|
||||
Pos Pred Value 0.9167 1.0000 1.0000
|
||||
Neg Pred Value 1.0000 0.9524 1.0000
|
||||
Prevalence 0.3235 0.4118 0.2647
|
||||
Detection Rate 0.3235 0.3824 0.2647
|
||||
Detection Prevalence 0.3529 0.3824 0.2647
|
||||
Balanced Accuracy 0.9783 0.9643 1.0000
|
||||
|
||||
=== random forest ===
|
||||
Confusion Matrix and Statistics
|
||||
|
||||
Reference
|
||||
Prediction 1 2 3
|
||||
1 11 1 0
|
||||
2 0 13 0
|
||||
3 0 0 9
|
||||
|
||||
Overall Statistics
|
||||
|
||||
Accuracy : 0.9706
|
||||
95% CI : (0.8467, 0.9993)
|
||||
No Information Rate : 0.4118
|
||||
P-Value [Acc > NIR] : 3.92e-12
|
||||
|
||||
Kappa : 0.9553
|
||||
|
||||
Mcnemar's Test P-Value : NA
|
||||
|
||||
Statistics by Class:
|
||||
|
||||
Class: 1 Class: 2 Class: 3
|
||||
Sensitivity 1.0000 0.9286 1.0000
|
||||
Specificity 0.9565 1.0000 1.0000
|
||||
Pos Pred Value 0.9167 1.0000 1.0000
|
||||
Neg Pred Value 1.0000 0.9524 1.0000
|
||||
Prevalence 0.3235 0.4118 0.2647
|
||||
Detection Rate 0.3235 0.3824 0.2647
|
||||
Detection Prevalence 0.3529 0.3824 0.2647
|
||||
Balanced Accuracy 0.9783 0.9643 1.0000
|
||||
@@ -0,0 +1,4 @@
|
||||
model accuracy precision_macro recall_macro f1_macro
|
||||
svm_linear 0.970588235294118 0.972222222222222 0.976190476190476 0.973161567364466
|
||||
svm_rbf 0.970588235294118 0.972222222222222 0.976190476190476 0.973161567364466
|
||||
random_forest 0.970588235294118 0.972222222222222 0.976190476190476 0.973161567364466
|
||||
@@ -0,0 +1,178 @@
|
||||
1,14.23,1.71,2.43,15.6,127,2.8,3.06,.28,2.29,5.64,1.04,3.92,1065
|
||||
1,13.2,1.78,2.14,11.2,100,2.65,2.76,.26,1.28,4.38,1.05,3.4,1050
|
||||
1,13.16,2.36,2.67,18.6,101,2.8,3.24,.3,2.81,5.68,1.03,3.17,1185
|
||||
1,14.37,1.95,2.5,16.8,113,3.85,3.49,.24,2.18,7.8,.86,3.45,1480
|
||||
1,13.24,2.59,2.87,21,118,2.8,2.69,.39,1.82,4.32,1.04,2.93,735
|
||||
1,14.2,1.76,2.45,15.2,112,3.27,3.39,.34,1.97,6.75,1.05,2.85,1450
|
||||
1,14.39,1.87,2.45,14.6,96,2.5,2.52,.3,1.98,5.25,1.02,3.58,1290
|
||||
1,14.06,2.15,2.61,17.6,121,2.6,2.51,.31,1.25,5.05,1.06,3.58,1295
|
||||
1,14.83,1.64,2.17,14,97,2.8,2.98,.29,1.98,5.2,1.08,2.85,1045
|
||||
1,13.86,1.35,2.27,16,98,2.98,3.15,.22,1.85,7.22,1.01,3.55,1045
|
||||
1,14.1,2.16,2.3,18,105,2.95,3.32,.22,2.38,5.75,1.25,3.17,1510
|
||||
1,14.12,1.48,2.32,16.8,95,2.2,2.43,.26,1.57,5,1.17,2.82,1280
|
||||
1,13.75,1.73,2.41,16,89,2.6,2.76,.29,1.81,5.6,1.15,2.9,1320
|
||||
1,14.75,1.73,2.39,11.4,91,3.1,3.69,.43,2.81,5.4,1.25,2.73,1150
|
||||
1,14.38,1.87,2.38,12,102,3.3,3.64,.29,2.96,7.5,1.2,3,1547
|
||||
1,13.63,1.81,2.7,17.2,112,2.85,2.91,.3,1.46,7.3,1.28,2.88,1310
|
||||
1,14.3,1.92,2.72,20,120,2.8,3.14,.33,1.97,6.2,1.07,2.65,1280
|
||||
1,13.83,1.57,2.62,20,115,2.95,3.4,.4,1.72,6.6,1.13,2.57,1130
|
||||
1,14.19,1.59,2.48,16.5,108,3.3,3.93,.32,1.86,8.7,1.23,2.82,1680
|
||||
1,13.64,3.1,2.56,15.2,116,2.7,3.03,.17,1.66,5.1,.96,3.36,845
|
||||
1,14.06,1.63,2.28,16,126,3,3.17,.24,2.1,5.65,1.09,3.71,780
|
||||
1,12.93,3.8,2.65,18.6,102,2.41,2.41,.25,1.98,4.5,1.03,3.52,770
|
||||
1,13.71,1.86,2.36,16.6,101,2.61,2.88,.27,1.69,3.8,1.11,4,1035
|
||||
1,12.85,1.6,2.52,17.8,95,2.48,2.37,.26,1.46,3.93,1.09,3.63,1015
|
||||
1,13.5,1.81,2.61,20,96,2.53,2.61,.28,1.66,3.52,1.12,3.82,845
|
||||
1,13.05,2.05,3.22,25,124,2.63,2.68,.47,1.92,3.58,1.13,3.2,830
|
||||
1,13.39,1.77,2.62,16.1,93,2.85,2.94,.34,1.45,4.8,.92,3.22,1195
|
||||
1,13.3,1.72,2.14,17,94,2.4,2.19,.27,1.35,3.95,1.02,2.77,1285
|
||||
1,13.87,1.9,2.8,19.4,107,2.95,2.97,.37,1.76,4.5,1.25,3.4,915
|
||||
1,14.02,1.68,2.21,16,96,2.65,2.33,.26,1.98,4.7,1.04,3.59,1035
|
||||
1,13.73,1.5,2.7,22.5,101,3,3.25,.29,2.38,5.7,1.19,2.71,1285
|
||||
1,13.58,1.66,2.36,19.1,106,2.86,3.19,.22,1.95,6.9,1.09,2.88,1515
|
||||
1,13.68,1.83,2.36,17.2,104,2.42,2.69,.42,1.97,3.84,1.23,2.87,990
|
||||
1,13.76,1.53,2.7,19.5,132,2.95,2.74,.5,1.35,5.4,1.25,3,1235
|
||||
1,13.51,1.8,2.65,19,110,2.35,2.53,.29,1.54,4.2,1.1,2.87,1095
|
||||
1,13.48,1.81,2.41,20.5,100,2.7,2.98,.26,1.86,5.1,1.04,3.47,920
|
||||
1,13.28,1.64,2.84,15.5,110,2.6,2.68,.34,1.36,4.6,1.09,2.78,880
|
||||
1,13.05,1.65,2.55,18,98,2.45,2.43,.29,1.44,4.25,1.12,2.51,1105
|
||||
1,13.07,1.5,2.1,15.5,98,2.4,2.64,.28,1.37,3.7,1.18,2.69,1020
|
||||
1,14.22,3.99,2.51,13.2,128,3,3.04,.2,2.08,5.1,.89,3.53,760
|
||||
1,13.56,1.71,2.31,16.2,117,3.15,3.29,.34,2.34,6.13,.95,3.38,795
|
||||
1,13.41,3.84,2.12,18.8,90,2.45,2.68,.27,1.48,4.28,.91,3,1035
|
||||
1,13.88,1.89,2.59,15,101,3.25,3.56,.17,1.7,5.43,.88,3.56,1095
|
||||
1,13.24,3.98,2.29,17.5,103,2.64,2.63,.32,1.66,4.36,.82,3,680
|
||||
1,13.05,1.77,2.1,17,107,3,3,.28,2.03,5.04,.88,3.35,885
|
||||
1,14.21,4.04,2.44,18.9,111,2.85,2.65,.3,1.25,5.24,.87,3.33,1080
|
||||
1,14.38,3.59,2.28,16,102,3.25,3.17,.27,2.19,4.9,1.04,3.44,1065
|
||||
1,13.9,1.68,2.12,16,101,3.1,3.39,.21,2.14,6.1,.91,3.33,985
|
||||
1,14.1,2.02,2.4,18.8,103,2.75,2.92,.32,2.38,6.2,1.07,2.75,1060
|
||||
1,13.94,1.73,2.27,17.4,108,2.88,3.54,.32,2.08,8.90,1.12,3.1,1260
|
||||
1,13.05,1.73,2.04,12.4,92,2.72,3.27,.17,2.91,7.2,1.12,2.91,1150
|
||||
1,13.83,1.65,2.6,17.2,94,2.45,2.99,.22,2.29,5.6,1.24,3.37,1265
|
||||
1,13.82,1.75,2.42,14,111,3.88,3.74,.32,1.87,7.05,1.01,3.26,1190
|
||||
1,13.77,1.9,2.68,17.1,115,3,2.79,.39,1.68,6.3,1.13,2.93,1375
|
||||
1,13.74,1.67,2.25,16.4,118,2.6,2.9,.21,1.62,5.85,.92,3.2,1060
|
||||
1,13.56,1.73,2.46,20.5,116,2.96,2.78,.2,2.45,6.25,.98,3.03,1120
|
||||
1,14.22,1.7,2.3,16.3,118,3.2,3,.26,2.03,6.38,.94,3.31,970
|
||||
1,13.29,1.97,2.68,16.8,102,3,3.23,.31,1.66,6,1.07,2.84,1270
|
||||
1,13.72,1.43,2.5,16.7,108,3.4,3.67,.19,2.04,6.8,.89,2.87,1285
|
||||
2,12.37,.94,1.36,10.6,88,1.98,.57,.28,.42,1.95,1.05,1.82,520
|
||||
2,12.33,1.1,2.28,16,101,2.05,1.09,.63,.41,3.27,1.25,1.67,680
|
||||
2,12.64,1.36,2.02,16.8,100,2.02,1.41,.53,.62,5.75,.98,1.59,450
|
||||
2,13.67,1.25,1.92,18,94,2.1,1.79,.32,.73,3.8,1.23,2.46,630
|
||||
2,12.37,1.13,2.16,19,87,3.5,3.1,.19,1.87,4.45,1.22,2.87,420
|
||||
2,12.17,1.45,2.53,19,104,1.89,1.75,.45,1.03,2.95,1.45,2.23,355
|
||||
2,12.37,1.21,2.56,18.1,98,2.42,2.65,.37,2.08,4.6,1.19,2.3,678
|
||||
2,13.11,1.01,1.7,15,78,2.98,3.18,.26,2.28,5.3,1.12,3.18,502
|
||||
2,12.37,1.17,1.92,19.6,78,2.11,2,.27,1.04,4.68,1.12,3.48,510
|
||||
2,13.34,.94,2.36,17,110,2.53,1.3,.55,.42,3.17,1.02,1.93,750
|
||||
2,12.21,1.19,1.75,16.8,151,1.85,1.28,.14,2.5,2.85,1.28,3.07,718
|
||||
2,12.29,1.61,2.21,20.4,103,1.1,1.02,.37,1.46,3.05,.906,1.82,870
|
||||
2,13.86,1.51,2.67,25,86,2.95,2.86,.21,1.87,3.38,1.36,3.16,410
|
||||
2,13.49,1.66,2.24,24,87,1.88,1.84,.27,1.03,3.74,.98,2.78,472
|
||||
2,12.99,1.67,2.6,30,139,3.3,2.89,.21,1.96,3.35,1.31,3.5,985
|
||||
2,11.96,1.09,2.3,21,101,3.38,2.14,.13,1.65,3.21,.99,3.13,886
|
||||
2,11.66,1.88,1.92,16,97,1.61,1.57,.34,1.15,3.8,1.23,2.14,428
|
||||
2,13.03,.9,1.71,16,86,1.95,2.03,.24,1.46,4.6,1.19,2.48,392
|
||||
2,11.84,2.89,2.23,18,112,1.72,1.32,.43,.95,2.65,.96,2.52,500
|
||||
2,12.33,.99,1.95,14.8,136,1.9,1.85,.35,2.76,3.4,1.06,2.31,750
|
||||
2,12.7,3.87,2.4,23,101,2.83,2.55,.43,1.95,2.57,1.19,3.13,463
|
||||
2,12,.92,2,19,86,2.42,2.26,.3,1.43,2.5,1.38,3.12,278
|
||||
2,12.72,1.81,2.2,18.8,86,2.2,2.53,.26,1.77,3.9,1.16,3.14,714
|
||||
2,12.08,1.13,2.51,24,78,2,1.58,.4,1.4,2.2,1.31,2.72,630
|
||||
2,13.05,3.86,2.32,22.5,85,1.65,1.59,.61,1.62,4.8,.84,2.01,515
|
||||
2,11.84,.89,2.58,18,94,2.2,2.21,.22,2.35,3.05,.79,3.08,520
|
||||
2,12.67,.98,2.24,18,99,2.2,1.94,.3,1.46,2.62,1.23,3.16,450
|
||||
2,12.16,1.61,2.31,22.8,90,1.78,1.69,.43,1.56,2.45,1.33,2.26,495
|
||||
2,11.65,1.67,2.62,26,88,1.92,1.61,.4,1.34,2.6,1.36,3.21,562
|
||||
2,11.64,2.06,2.46,21.6,84,1.95,1.69,.48,1.35,2.8,1,2.75,680
|
||||
2,12.08,1.33,2.3,23.6,70,2.2,1.59,.42,1.38,1.74,1.07,3.21,625
|
||||
2,12.08,1.83,2.32,18.5,81,1.6,1.5,.52,1.64,2.4,1.08,2.27,480
|
||||
2,12,1.51,2.42,22,86,1.45,1.25,.5,1.63,3.6,1.05,2.65,450
|
||||
2,12.69,1.53,2.26,20.7,80,1.38,1.46,.58,1.62,3.05,.96,2.06,495
|
||||
2,12.29,2.83,2.22,18,88,2.45,2.25,.25,1.99,2.15,1.15,3.3,290
|
||||
2,11.62,1.99,2.28,18,98,3.02,2.26,.17,1.35,3.25,1.16,2.96,345
|
||||
2,12.47,1.52,2.2,19,162,2.5,2.27,.32,3.28,2.6,1.16,2.63,937
|
||||
2,11.81,2.12,2.74,21.5,134,1.6,.99,.14,1.56,2.5,.95,2.26,625
|
||||
2,12.29,1.41,1.98,16,85,2.55,2.5,.29,1.77,2.9,1.23,2.74,428
|
||||
2,12.37,1.07,2.1,18.5,88,3.52,3.75,.24,1.95,4.5,1.04,2.77,660
|
||||
2,12.29,3.17,2.21,18,88,2.85,2.99,.45,2.81,2.3,1.42,2.83,406
|
||||
2,12.08,2.08,1.7,17.5,97,2.23,2.17,.26,1.4,3.3,1.27,2.96,710
|
||||
2,12.6,1.34,1.9,18.5,88,1.45,1.36,.29,1.35,2.45,1.04,2.77,562
|
||||
2,12.34,2.45,2.46,21,98,2.56,2.11,.34,1.31,2.8,.8,3.38,438
|
||||
2,11.82,1.72,1.88,19.5,86,2.5,1.64,.37,1.42,2.06,.94,2.44,415
|
||||
2,12.51,1.73,1.98,20.5,85,2.2,1.92,.32,1.48,2.94,1.04,3.57,672
|
||||
2,12.42,2.55,2.27,22,90,1.68,1.84,.66,1.42,2.7,.86,3.3,315
|
||||
2,12.25,1.73,2.12,19,80,1.65,2.03,.37,1.63,3.4,1,3.17,510
|
||||
2,12.72,1.75,2.28,22.5,84,1.38,1.76,.48,1.63,3.3,.88,2.42,488
|
||||
2,12.22,1.29,1.94,19,92,2.36,2.04,.39,2.08,2.7,.86,3.02,312
|
||||
2,11.61,1.35,2.7,20,94,2.74,2.92,.29,2.49,2.65,.96,3.26,680
|
||||
2,11.46,3.74,1.82,19.5,107,3.18,2.58,.24,3.58,2.9,.75,2.81,562
|
||||
2,12.52,2.43,2.17,21,88,2.55,2.27,.26,1.22,2,.9,2.78,325
|
||||
2,11.76,2.68,2.92,20,103,1.75,2.03,.6,1.05,3.8,1.23,2.5,607
|
||||
2,11.41,.74,2.5,21,88,2.48,2.01,.42,1.44,3.08,1.1,2.31,434
|
||||
2,12.08,1.39,2.5,22.5,84,2.56,2.29,.43,1.04,2.9,.93,3.19,385
|
||||
2,11.03,1.51,2.2,21.5,85,2.46,2.17,.52,2.01,1.9,1.71,2.87,407
|
||||
2,11.82,1.47,1.99,20.8,86,1.98,1.6,.3,1.53,1.95,.95,3.33,495
|
||||
2,12.42,1.61,2.19,22.5,108,2,2.09,.34,1.61,2.06,1.06,2.96,345
|
||||
2,12.77,3.43,1.98,16,80,1.63,1.25,.43,.83,3.4,.7,2.12,372
|
||||
2,12,3.43,2,19,87,2,1.64,.37,1.87,1.28,.93,3.05,564
|
||||
2,11.45,2.4,2.42,20,96,2.9,2.79,.32,1.83,3.25,.8,3.39,625
|
||||
2,11.56,2.05,3.23,28.5,119,3.18,5.08,.47,1.87,6,.93,3.69,465
|
||||
2,12.42,4.43,2.73,26.5,102,2.2,2.13,.43,1.71,2.08,.92,3.12,365
|
||||
2,13.05,5.8,2.13,21.5,86,2.62,2.65,.3,2.01,2.6,.73,3.1,380
|
||||
2,11.87,4.31,2.39,21,82,2.86,3.03,.21,2.91,2.8,.75,3.64,380
|
||||
2,12.07,2.16,2.17,21,85,2.6,2.65,.37,1.35,2.76,.86,3.28,378
|
||||
2,12.43,1.53,2.29,21.5,86,2.74,3.15,.39,1.77,3.94,.69,2.84,352
|
||||
2,11.79,2.13,2.78,28.5,92,2.13,2.24,.58,1.76,3,.97,2.44,466
|
||||
2,12.37,1.63,2.3,24.5,88,2.22,2.45,.4,1.9,2.12,.89,2.78,342
|
||||
2,12.04,4.3,2.38,22,80,2.1,1.75,.42,1.35,2.6,.79,2.57,580
|
||||
3,12.86,1.35,2.32,18,122,1.51,1.25,.21,.94,4.1,.76,1.29,630
|
||||
3,12.88,2.99,2.4,20,104,1.3,1.22,.24,.83,5.4,.74,1.42,530
|
||||
3,12.81,2.31,2.4,24,98,1.15,1.09,.27,.83,5.7,.66,1.36,560
|
||||
3,12.7,3.55,2.36,21.5,106,1.7,1.2,.17,.84,5,.78,1.29,600
|
||||
3,12.51,1.24,2.25,17.5,85,2,.58,.6,1.25,5.45,.75,1.51,650
|
||||
3,12.6,2.46,2.2,18.5,94,1.62,.66,.63,.94,7.1,.73,1.58,695
|
||||
3,12.25,4.72,2.54,21,89,1.38,.47,.53,.8,3.85,.75,1.27,720
|
||||
3,12.53,5.51,2.64,25,96,1.79,.6,.63,1.1,5,.82,1.69,515
|
||||
3,13.49,3.59,2.19,19.5,88,1.62,.48,.58,.88,5.7,.81,1.82,580
|
||||
3,12.84,2.96,2.61,24,101,2.32,.6,.53,.81,4.92,.89,2.15,590
|
||||
3,12.93,2.81,2.7,21,96,1.54,.5,.53,.75,4.6,.77,2.31,600
|
||||
3,13.36,2.56,2.35,20,89,1.4,.5,.37,.64,5.6,.7,2.47,780
|
||||
3,13.52,3.17,2.72,23.5,97,1.55,.52,.5,.55,4.35,.89,2.06,520
|
||||
3,13.62,4.95,2.35,20,92,2,.8,.47,1.02,4.4,.91,2.05,550
|
||||
3,12.25,3.88,2.2,18.5,112,1.38,.78,.29,1.14,8.21,.65,2,855
|
||||
3,13.16,3.57,2.15,21,102,1.5,.55,.43,1.3,4,.6,1.68,830
|
||||
3,13.88,5.04,2.23,20,80,.98,.34,.4,.68,4.9,.58,1.33,415
|
||||
3,12.87,4.61,2.48,21.5,86,1.7,.65,.47,.86,7.65,.54,1.86,625
|
||||
3,13.32,3.24,2.38,21.5,92,1.93,.76,.45,1.25,8.42,.55,1.62,650
|
||||
3,13.08,3.9,2.36,21.5,113,1.41,1.39,.34,1.14,9.40,.57,1.33,550
|
||||
3,13.5,3.12,2.62,24,123,1.4,1.57,.22,1.25,8.60,.59,1.3,500
|
||||
3,12.79,2.67,2.48,22,112,1.48,1.36,.24,1.26,10.8,.48,1.47,480
|
||||
3,13.11,1.9,2.75,25.5,116,2.2,1.28,.26,1.56,7.1,.61,1.33,425
|
||||
3,13.23,3.3,2.28,18.5,98,1.8,.83,.61,1.87,10.52,.56,1.51,675
|
||||
3,12.58,1.29,2.1,20,103,1.48,.58,.53,1.4,7.6,.58,1.55,640
|
||||
3,13.17,5.19,2.32,22,93,1.74,.63,.61,1.55,7.9,.6,1.48,725
|
||||
3,13.84,4.12,2.38,19.5,89,1.8,.83,.48,1.56,9.01,.57,1.64,480
|
||||
3,12.45,3.03,2.64,27,97,1.9,.58,.63,1.14,7.5,.67,1.73,880
|
||||
3,14.34,1.68,2.7,25,98,2.8,1.31,.53,2.7,13,.57,1.96,660
|
||||
3,13.48,1.67,2.64,22.5,89,2.6,1.1,.52,2.29,11.75,.57,1.78,620
|
||||
3,12.36,3.83,2.38,21,88,2.3,.92,.5,1.04,7.65,.56,1.58,520
|
||||
3,13.69,3.26,2.54,20,107,1.83,.56,.5,.8,5.88,.96,1.82,680
|
||||
3,12.85,3.27,2.58,22,106,1.65,.6,.6,.96,5.58,.87,2.11,570
|
||||
3,12.96,3.45,2.35,18.5,106,1.39,.7,.4,.94,5.28,.68,1.75,675
|
||||
3,13.78,2.76,2.3,22,90,1.35,.68,.41,1.03,9.58,.7,1.68,615
|
||||
3,13.73,4.36,2.26,22.5,88,1.28,.47,.52,1.15,6.62,.78,1.75,520
|
||||
3,13.45,3.7,2.6,23,111,1.7,.92,.43,1.46,10.68,.85,1.56,695
|
||||
3,12.82,3.37,2.3,19.5,88,1.48,.66,.4,.97,10.26,.72,1.75,685
|
||||
3,13.58,2.58,2.69,24.5,105,1.55,.84,.39,1.54,8.66,.74,1.8,750
|
||||
3,13.4,4.6,2.86,25,112,1.98,.96,.27,1.11,8.5,.67,1.92,630
|
||||
3,12.2,3.03,2.32,19,96,1.25,.49,.4,.73,5.5,.66,1.83,510
|
||||
3,12.77,2.39,2.28,19.5,86,1.39,.51,.48,.64,9.899999,.57,1.63,470
|
||||
3,14.16,2.51,2.48,20,91,1.68,.7,.44,1.24,9.7,.62,1.71,660
|
||||
3,13.71,5.65,2.45,20.5,95,1.68,.61,.52,1.06,7.7,.64,1.74,740
|
||||
3,13.4,3.91,2.48,23,102,1.8,.75,.43,1.41,7.3,.7,1.56,750
|
||||
3,13.27,4.28,2.26,20,120,1.59,.69,.43,1.35,10.2,.59,1.56,835
|
||||
3,13.17,2.59,2.37,20,120,1.65,.68,.53,1.46,9.3,.6,1.62,840
|
||||
3,14.13,4.1,2.74,24.5,96,2.05,.76,.56,1.35,9.2,.61,1.6,560
|
||||
@@ -0,0 +1,100 @@
|
||||
1. Title of Database: Wine recognition data
|
||||
Updated Sept 21, 1998 by C.Blake : Added attribute information
|
||||
|
||||
2. Sources:
|
||||
(a) Forina, M. et al, PARVUS - An Extendible Package for Data
|
||||
Exploration, Classification and Correlation. Institute of Pharmaceutical
|
||||
and Food Analysis and Technologies, Via Brigata Salerno,
|
||||
16147 Genoa, Italy.
|
||||
|
||||
(b) Stefan Aeberhard, email: stefan@coral.cs.jcu.edu.au
|
||||
(c) July 1991
|
||||
3. Past Usage:
|
||||
|
||||
(1)
|
||||
S. Aeberhard, D. Coomans and O. de Vel,
|
||||
Comparison of Classifiers in High Dimensional Settings,
|
||||
Tech. Rep. no. 92-02, (1992), Dept. of Computer Science and Dept. of
|
||||
Mathematics and Statistics, James Cook University of North Queensland.
|
||||
(Also submitted to Technometrics).
|
||||
|
||||
The data was used with many others for comparing various
|
||||
classifiers. The classes are separable, though only RDA
|
||||
has achieved 100% correct classification.
|
||||
(RDA : 100%, QDA 99.4%, LDA 98.9%, 1NN 96.1% (z-transformed data))
|
||||
(All results using the leave-one-out technique)
|
||||
|
||||
In a classification context, this is a well posed problem
|
||||
with "well behaved" class structures. A good data set
|
||||
for first testing of a new classifier, but not very
|
||||
challenging.
|
||||
|
||||
(2)
|
||||
S. Aeberhard, D. Coomans and O. de Vel,
|
||||
"THE CLASSIFICATION PERFORMANCE OF RDA"
|
||||
Tech. Rep. no. 92-01, (1992), Dept. of Computer Science and Dept. of
|
||||
Mathematics and Statistics, James Cook University of North Queensland.
|
||||
(Also submitted to Journal of Chemometrics).
|
||||
|
||||
Here, the data was used to illustrate the superior performance of
|
||||
the use of a new appreciation function with RDA.
|
||||
|
||||
4. Relevant Information:
|
||||
|
||||
-- These data are the results of a chemical analysis of
|
||||
wines grown in the same region in Italy but derived from three
|
||||
different cultivars.
|
||||
The analysis determined the quantities of 13 constituents
|
||||
found in each of the three types of wines.
|
||||
|
||||
-- I think that the initial data set had around 30 variables, but
|
||||
for some reason I only have the 13 dimensional version.
|
||||
I had a list of what the 30 or so variables were, but a.)
|
||||
I lost it, and b.), I would not know which 13 variables
|
||||
are included in the set.
|
||||
|
||||
-- The attributes are (dontated by Riccardo Leardi,
|
||||
riclea@anchem.unige.it )
|
||||
1) Alcohol
|
||||
2) Malic acid
|
||||
3) Ash
|
||||
4) Alcalinity of ash
|
||||
5) Magnesium
|
||||
6) Total phenols
|
||||
7) Flavanoids
|
||||
8) Nonflavanoid phenols
|
||||
9) Proanthocyanins
|
||||
10)Color intensity
|
||||
11)Hue
|
||||
12)OD280/OD315 of diluted wines
|
||||
13)Proline
|
||||
|
||||
5. Number of Instances
|
||||
|
||||
class 1 59
|
||||
class 2 71
|
||||
class 3 48
|
||||
|
||||
6. Number of Attributes
|
||||
|
||||
13
|
||||
|
||||
7. For Each Attribute:
|
||||
|
||||
All attributes are continuous
|
||||
|
||||
No statistics available, but suggest to standardise
|
||||
variables for certain uses (e.g. for us with classifiers
|
||||
which are NOT scale invariant)
|
||||
|
||||
NOTE: 1st attribute is class identifier (1-3)
|
||||
|
||||
8. Missing Attribute Values:
|
||||
|
||||
None
|
||||
|
||||
9. Class Distribution: number of instances per class
|
||||
|
||||
class 1 59
|
||||
class 2 71
|
||||
class 3 48
|
||||