# exploratory data analysis and models on the epi dataset date: 2025-10-13 ## dataset and choices - **file**: `epi_results_2024_pop_gdp_v2.csv` - **region column**: `region` - **response var**: `EPI.new` - **regions**: `Sub-Saharan Africa` vs `Latin America & Caribbean` ## 1) variable distributions ### 1.1 boxplots and histograms (with density!) ![](figures/box_Sub-Saharan_Africa_EPI.new.png) ![](figures/box_Latin_America_Caribbean_EPI.new.png) ![](figures/hist_Sub-Saharan_Africa_EPI.new.png) ![](figures/hist_Latin_America_Caribbean_EPI.new.png) ### 1.2 qq plot (two-sample) ![](figures/qq_EPI.new_Sub-Saharan_Africa_vs_Latin_America_Caribbean.png) ## 2) linear models ### full: EPI.new ~ gdp ### full: EPI.new ~ gdp + population ### 2.2 same models on one region (comparison) on region `Sub-Saharan Africa`, the better model is **region Sub-Saharan Africa: EPI.new ~ gdp + population** (r²=0.361, aic=265.4, bic=272.7). ## 3) classification (knn, label = region) ### model A - **k**: 5 | **accuracy**: 0.5581 | **test n**: 43 variables: `c("AGR.new", "AIR.new", "APO.new")` ![](figures/knn_confusion_model_A.png) ### model B - **k**: 5 | **accuracy**: 0.5116 | **test n**: 43 variables: `c("BCA.new", "BDH.new", "CBP.new")` ![](figures/knn_confusion_model_B.png)