This project was completed as a final project for STATS 401: Applied Statistical Methods II. The purpose of this project was to analyze several variables within a dataset and create a linear model with several predictor variables that model a response variable within the dataset. To do so, I also had to ensure my predictors fit the assumptions of Ordinary Least Squares, as this was the regression technique utilized, and assess the overall fit of the final model.
Date Started
March 12, 2023
Date Completed
April 18, 2023
My Responsibilities
Analyze variables within a dataset
Iterate on models with variable transformations, interactions, and additions
Test OLS assumptions of variables through diagnostic plots
Assess model fit, independence, and multicollinearity of final model
Interpret results of final model and its practical significance
Tools Used
R Studio
GGPlot
Google Suite
1. Background
Dataset Summary
This study focused on cancer mortality rates within the United States and what factors play a role in this. The dataset to assess this idea utilizes aggregated data from various sources, including census.gov, clinicaltrials.gov, and cancer.gov. Within this dataset, county-specific demographic data is provided on a random sample of 537 American counties.
Variable Descriptions
This project required several iterations of the model that consisted of 6 variables. The response variable was the county-specific cancer mortalities per 100,000 people. Predictors were county population, median income, percentage living in poverty, geographic region, and state.
2. Analysis and Exploration
Initial Model and Exploration
The initial linear model consisted of 4 predictors for the response variable; these predictors were population, median income, poverty rate, and region. While there were no transformations on variables in this model, a histogram of the population (See Figure 1) indicated a heavy right skew and an opportunity to better meet OLS assumptions through a log transformation. This model’s R-squared value indicates that 27% of the variance in our data can be explained by the model (See Figure 2).
Figure 1 (Population Variable Histogram)
Figure 2 (Initial Model Summary Output)
Figure 3 (Initial Model Residuals Plot)
Figure 4 (Initial Model QQ Plot)
While this was a good foundation, there were areas that could be improved upon in future iterations. When examining if this model fit the OLS assumptions through diagnostic plots, the residual plot (See Figure 3) showcases a random scatter of residuals above and below zero in a band of relatively constant width, signaling the assumptions of linearity and variance are not violated. Testing for normality (See Figure 4) demonstrated the initial linear regression model violated this assumption. Normality may seem to be mostly met, but the clear pattern of deviations on both ends of the quantiles cannot establish normality is reasonably met for the initial model.
Model Alterations
To improve the model, the region predictor was replaced with state, in hopes that the specificity will improve the model. The adjusted R-squared did increase with the model, indicating improvement, but after looking closely at the diagnostic plots, the normality assumption worsened (See Figure 5). Also, while examining the dataset more closely, it indicated different sample sizes per state, with some states like Arizona having a mere sample size of 2. To avoid overfitting, each group in this model should have a sample size of at least 40, and this dataset does not meet these requirements. To best accurately model the data without overfitting, it was decided to not replace the region predictor with state.
Another attempt to improve the model included a log transformation on the population predictor given the heavy right-skew. With this transformation, the predictor showed a normal distribution (See Figure 6). This log transformation improved linearity and constant variance assumptions (See Figure 7) and slightly improved the normality assumption (See Figure 8). The adjusted R-squared value for this model is lower than the initial model (See Figure 9), but since this model’s residual plot better meets regression assumptions, the log transformation was kept.
Figure 7 (Model 3 Residual Plot)
Figure 8 (Model 3 QQ Plot)
Figure 9 (Model 3 Summary Output)
3. Final Model
Model Predictors and Rationale
For the 4th iteration of improvement, an interaction between log-transformed population and region was included. This interaction improved linearity and constant variance assumptions more, as the residual plot’s red line is the most straight and variance is the most evenly spread out of every model (See Figure 10); the normality assumption also improved, as the deviations on the QQ plot were not as severe compared to before (See Figure 11). When plotting the interaction, it is notable that the relationship between cancer mortality rates and log-transformed population does differ by group, as some regions have positive relationships whereas others have negative relationships. The strength of each relationship also differs by region, as the Southeast has a stronger negative relationship compared to the West (See Figure 12). Aside from the relationship differences between regions, the summary output shows the adjusted R-squared value is the highest (0.27) out of all the models, demonstrating that this is the best fit (See Figure 13). For this reason, the interaction term was included in the final model.The conditional mean function for this model is:
The assumptions of ordinary least squares seem to be reasonably met with this model. As shown in Figure 12, linearity and constant variance are appropriately met, as the average value of the response falls on a straight line and the spread of points remain relatively the same width throughout the regression plot. Normality also seems to be reasonably met (see Figure 13), as the errors fall in a reasonably straight line on the QQ plot.
Our assumption of independence is also met, as each observation in this dataset is county-specific, so individual county-specific demographic data should not have any influence over observations for other county-specific demographic data. The variables in this model do not have any correlation with one another, so they do not impact one-another. Checking the variance inflation factor of the final model (see Figure 15) showed an indication of potential multicollinearity between region and the interaction term, so a side model was run without the interaction term. When calculating the VIF of these terms, the base predictor terms did not produce any high VIFs (see Figure 16), indicating the final model does not show evidence of multicollinearity between predictors.
Model Fit
When assessing the overall fit of the model, the terms utilized were the R-squared and root mean square error values. The R-squared value of 0.2931 (see Figure 14) shows that about 29.31% of variation in number of cancer mortality (per 100,000 people) can be explained by the linear model with county median income, poverty rate, population (log-transformed), and geographic region. The RMSE value of 23.21 (see Figure 14) provides the model’s average error is around 23 mortalities (per 100,000 people), which does not seem to be a bad estimate, considering the units. This error estimate, if seen as a percentage, indicates that estimates and observations of cancer mortality rates would, on average, differ only by around 0.023%.
While this final model strength, indicated by R-squared, seems weak and shows the model does not explain much of the data’s variability, it still explains a good portion of variability and has a low average error, within context. Given the context of the response, cancer mortalities can be influenced by numerous extraneous factors ( smoking, proximity to chemicals, genetics, etc.) not measured in the dataset. This model still provides adequate estimates with the provided variables for such a complex problem space.
4. Discussion
Interpretation and Significance
While median income, Southeast region, and log-transformed poverty all indicate statistical significance, it does not necessarily translate into practical significance. For the region predictor, when all other predictors are held at zero, the average cancer mortalities (per 100,000 people) are between 169 to 200 for the regions. Practically, this is not relevant, as there cannot be so many county-wide mortalities when the population is zero. For population, a 1% increase in county population indicates a 0.04 increase in cancer mortalities, controlling for other variables. So, if the population increased by 99%, cancer mortalities would only increase by 2.76, which is small for such a steep increase. For the poverty percentage, holding other variables constant, a 1% increase in county poverty percentage indicates a 0.02705 increase in cancer mortalities. For cancer mortalities to increase by 1 mortality (per 100,000 people), county poverty percentage would need to increase by 36.97%. The effect of this variable is miniscule, showing minor changes in the response despite such a steep increase in poverty percentage, demonstrating practical insignificance. For median income, with other predictors held constant, a 1 dollar increase in median income estimates an average decrease of 0.0007211 in cancer mortality. An increase of $10,000 in median income estimates an average decrease of 7.2 cancer mortalities (per 100,000 people). This indicates practical significance, as this is a large change in mortality numbers considering the salary increase.
Concluding Remarks and Next Steps
This model does an adequate job at predicting cancer mortality rates per 100,000 people with the information and predictors provided, as the model was able to explain around 30% of the variance in the data. While relationships between regions, population, and mortality rates were shown and practical significance was found in median income, more data could strengthen and improve this model in the future. If provided more time, it would be useful to collect more demographic information on counties. If data were collected on county smoker rates, air quality, and cancer diagnoses rates, assessing their relationships between mortality rates and other predictors could potentially further strengthen the model.