39 lines
1.3 KiB
Markdown
39 lines
1.3 KiB
Markdown
# exploratory data analysis and models on the epi dataset
|
|
date: 2025-10-13
|
|
|
|
## dataset and choices
|
|
- **file**: `epi_results_2024_pop_gdp_v2.csv`
|
|
- **region column**: `region`
|
|
- **response var**: `EPI.new`
|
|
- **regions**: `Sub-Saharan Africa` vs `Latin America & Caribbean`
|
|
|
|
## 1) variable distributions
|
|
### 1.1 boxplots and histograms (with density!)
|
|

|
|

|
|

|
|

|
|
|
|
### 1.2 qq plot (two-sample)
|
|

|
|
|
|
## 2) linear models
|
|
### full: EPI.new ~ gdp
|
|
|
|
### full: EPI.new ~ gdp + population
|
|
|
|
### 2.2 same models on one region (comparison)
|
|
on region `Sub-Saharan Africa`, the better model is **region Sub-Saharan Africa: EPI.new ~ gdp + population** (r²=0.361, aic=265.4, bic=272.7).
|
|
|
|
## 3) classification (knn, label = region)
|
|
### model A
|
|
- **k**: 5 | **accuracy**: 0.5581 | **test n**: 43
|
|
variables: `c("AGR.new", "AIR.new", "APO.new")`
|
|

|
|
|
|
### model B
|
|
- **k**: 5 | **accuracy**: 0.5116 | **test n**: 43
|
|
variables: `c("BCA.new", "BDH.new", "CBP.new")`
|
|

|
|
|