Excel, Python, Tableau, ML
Have you ever wondered how the Covid19 pandemic impacted the leading causes of death in the United States? This question inspired my latest research project, where I investigated the correlation between Covid19-related deaths in 2020 and the leading causes of death in the country from 2015-2019. Through careful analysis of available data, I uncovered some interesting findings that shed light on the relationship between the pandemic and public health in the United States. My hope is that this study can inform future public health policies and initiatives that aim to prevent and address the leading causes of death in the country.
As someone who is passionate about public health, I have been closely following the Covid19 pandemic and its impact on our lives. One area that particularly caught my attention is how the pandemic has affected the leading causes of death in the United States. This motivated me to conduct a study exploring the correlation between Covid19-related deaths in 2020 and the leading causes of death in the country from 2015-2019. By delving deeper into this topic, I aimed to contribute to our understanding of how the pandemic has impacted public health in the United States. Ultimately, I hope that my research can inform future policies and initiatives aimed at improving public health and preventing unnecessary deaths in the country.
About Data
The main data set is derived from U.S. government’s open data website. the of data for the causes of deaths between 2014-2019. In each row we can see a week’s number of deaths cause by on of the select death causes. We looked at this data to identify any patterns or trends in the leading causes of death. We also used two other datasets to help us with our analysis. The first was the yearly data of population for these states. This information helped us understand the demographics of the populations we were analyzing and how they might affect the results.
The second additional dataset was essentially the same as the first one, but for 2020, and it included Covid-19 deaths. We compared this dataset with our original one to see if there was any correlation between the leading causes of death and Covid-19 mortality rates.
by using these three datasets together, we were able to gain a more comprehensive understanding of the relationship between Covid-19 and the leading causes of death in the United States.
Weekly No. of 13 top death causes in all US states from 2014 - 2019
16,903 Rows
2 Categorical Features
17 Numeric Features
Weekly No. of 13 death causes in all US states from 2020 - 2022
3623 Rows
2 Categorical Features
19 Numeric Features
Estimation of population in each US state for 2010-2019 No. rows
51 Rows
1 Categorical Features
19 Numeric Features
Estimation of population in each US state for 2020
52 Rows
1 Categorical Features
3 Numeric Features
This project is an interesting study that explores the possible link between the leading causes of death in the United States and the mortality rate of Covid-19 in 2020.
The project uses correct indices and measurements in health policy to understand the historical and present profiles of diseases and health issues. By doing so, the study aims to identify potential implications and relationships between these causes and the mortality rate of Covid-19 in 2020.
It's worth noting that the Covid-19 pandemic took the world by surprise in 2020, despite warnings from public health experts and risk modelers. The study also takes into account the CDC's claim that underlying conditions have contributed to the deaths caused by Covid-19.
The project is significant because it explores how the top death causes in different states in the U.S. from 2014 to 2019 could have implications for the mortality rate of Covid-19 in 2020. Overall, the study provides valuable insights into the relationship between Covid-19 and the leading causes of death in the United States, and highlights the importance of using data-driven insights to make informed decisions in health policy.
We encountered several issues when cleaning and preparing the data, including numbers stored as texts, comma signs inside numerical values, different state names in different datasets, missing values, and unnormalized forms of the raw data. Despite these challenges, we were able to successfully clean and merge the data, creating a reliable dataset for our analysis. This process was crucial in ensuring the accuracy and validity of our findings, and helped us gain a comprehensive understanding of the relationship between Covid-19 and the leading causes of death in the United States. and there were another issue with the dataset:
Not comparable (normalized)
On a weekly basis
Add population data (yearly)
Aggregate data (yearly)
Aggregate the data for the years 2014-2019
Ranking death causes in each state based on the normalized data
Extracting top (n) causes in each state
After addressing the issues we do some descriptive analytics:
As it can be seen in these charts, the death ratio between 2014 and 2019 is relatively similar (the boldness and size of the circles) but in 2020 these ratios have gone higher for almost every state.
And here we can see the trend of top death causes (per population) in the U.S 2014-2022:
See how the top 2 causes of death distribute among all the states.
Insights from Descriptive analysis
Two first cause of deaths in every state:
Disease of heart
Malignant neoplasms
Two groups of states based on their top cause of death
Our objective was to examine the leading causes of death in different states in the U.S. from 2014 to 2019, and investigate any potential links or relationships between these causes and the mortality rate of Covid-19 in 2020. In essence, we wanted to determine if the top causes of death in previous years had any impact on the Covid-19 mortality rate in 2020.
Looking at the map, it seems like there could be a noticeable difference in the Covid-19 death rates between the two groups. To get a better idea, we also created a chart that shows the distribution of the death rate caused by Covid-19 in each group. One group is made up of states where the top cause of death is "Diseases of heart," while the other group is made up of states where the top cause of death is "Malignant neoplasms.".
This box plot also is consistent with the previous slide’s claim. So we create hypotheses:
How about the Second toHowp cause? No need! As they are the same as top first.
How about the Third top cause?
To continue to explore more we have investigated the second top cause of death for states are exact as the first top. In other words, the first and second cause of death for every state is the same (if the rank is ignored). So, to go further we have decided to inspect the group A states and the third top cause of death in these states.
There are 38 states in group A and there are only two unique causes of death for these states (as the third top cause of death: “Chronic lower respiratory diseases (J40-J47)” and “Cerebrovascular diseases (I60-I69)”. The chart below illustrates the difference of mean of Covid-19 death ratio between two groups of states:
The states with Chronic lower respiratory diseases (J40-J47) as their third top death cause (Group AC) and
the states with Cerebrovascular diseases (I60-I69) as their third top cause of death (Group AD)
Based on these results there is no significant evidence that there is a difference in the average of deaths caused by Covid-19 between these two groups of states.
How about other causes of death?
Utilizing Machine Learning Algorithm
We discovered that the average death ratio of Covid-19 in 2020 varied among states depending on their top cause of death in previous years. In this section, our objective is to use several machine learning algorithms to investigate the possible correlation between the top five causes of death in different states and their death rates due to Covid-19 in 2020. To achieve this, we will represent these top five causes as features for each state and use the Covid-19 death ratio as the target variable. Since there is a high correlation between the two variables that indicate the death rate of Covid-19, we will only use one of them as the target variable.
We have chosen the PyCaret library for Python to implement machine learning. The input dataset comprises the states as rows and the aggregation of the top five death causes as features (the unique death causes that appeared as one of the top five death causes at least in one state). The values represent the average death rate of each cause in the respective states. The first few rows of the transformed dataset are illustrated in the following figure:
In the next step we built and trained regressor models in PyCaret. The results were as the following figure.
The figure indicates that the scores of almost every model for this dataset are quite low. To gain a better understanding, we decided to use the decision tree model and see if it could be optimized to produce better results, based on these scores. Although the model's performance improved after tuning, it still has a long way to go before reaching optimal levels.
Despite the fact that the scores of the decision tree model are not ideal, one notable observation is that the most significant feature that we extracted by tuning this model is "Heart Diseases," which is consistent with the statistical model.
Based on the statistical tests we conducted, we observed a significant difference between the two groups of states in terms of the Covid-19 death rate in 2020. Despite the imperfection of our machine learning approach, the feature importance analysis yielded consistent results with the statistical analysis.
This project was created by [Maryam Aliakbari], with contributions from [Farzin Valiloo] .
Together, the team worked collaboratively to create a compelling and informative data analysis project that showcases the power of data visualization and storytelling.