A Covid Analysis

From a country-wise analysis of external factor impacts on Covid-19 cases to predictions of future vaccination rates

Deepika Vijay

Published in

Towards Data Science

17 min readApr 11, 2021

In this article, we are going to walk through

Relationships between variables such as a country’s poverty rates, life expectancy rates, stringency indices and, covid cases and deaths
Vaccination rates per country and their predictions up until the end of June
How I went about the data and the analyses (only the jupyter notebook part)

Covid Analysis dashboard — Image by Author

For the past year, covid-19 has captured the world’s attention. There are so many questions still unanswered about this virus that has caused havoc in the world. So literally any kind of analysis can bring about a better understanding and useful insights for ourselves and potentially the society. Hence I’m hoping that the following analyses will give readers a better idea of what factors likely influence the rate of infections and deaths from Covid-19 in a given country. Now, bear in mind that this of course doesn’t give us the full picture. For instance, in my perspective, literacy rates, population density, obesity rates, cultural aspects such as social norms, behaviour etc. all play a huge role in the increase of cases or deaths per country but I don’t have all the data and some of such factors can’t even be measured in usable ways.

Getting right to the analysis. Let’s start with the data. I got this amazing dataset from Our World in Data. The description and the source of each column are available here. I have taken the liberty of computing a few columns myself — namely country-wise vaccination rates, testing rates, infection rates and death rates. The scope of analysis using this dataset is incredible and this article covers only a part of it!

Python snippet basic Analysis: columns list, data shape, range of dates and a groupby df — Image by Author

The data comprises of all the above 42 columns where “location” is the country column and each country has data ranging from 1st of Jan 2020 to the 10th of April 2021. While much of the data is available almost from the beginning of this range, we know that testing was only made available much later and vaccinations even more so — late in December for most advanced countries.

Impact of testing rate on the rate of covid infections and deaths
Impact of stringency index on the rate of covid infections and deaths
Impact of age on covid related deaths
Impact of poverty on life expectancy (unrelated to covid)
Vaccination rates and predicting future rates by country

1. Impact of testing rate on the rate of covid infections and deaths

Does aggressive testing have an impact on curbing deaths and infections or does high rates of infections and deaths certainly mean more testing?

Tableau viz: Impacts of Testing on Cases and Deaths — Image by Author

Note: Unfortunately, embedding the interactive Tableau dashboard was rather challenging. The vizes can be followed here.

Above is a tableau visualization of rates of infections and deaths by country, tinted by testing rates. I have considered “Maximum” as the aggregate as it gives us the most recent observation for each of these columns. For instance, the testing rate is measured by:

testing_rate = (total_tests/population)*100

Since each cell of total_tests per country is an aggregated sum of tests per day, max testing_rate would give us the maximum test rate of the country as well as its most recent testing rate. We can see that there is no particularly obvious pattern implying a correlation between testing and either infection or death rates. Let’s take a look at the following viz.

Tableau viz: Impacts of Testing on Cases and Deaths 1 — Image by Author

Note that the overshoot of 100% in the testing rate is because each record of testing indicates total tests in a country where 1 person could’ve been tested more than once.

To the left is a scatter plot representing the most recent and the highest rate of infections against the most recent and highest rate of testing by country. Noticeably, although it shows a subtle correlation directing toward a higher rate of infection meaning a higher rate of testing, it doesn’t strongly do so. The same can be said of the scatter plot on the right analyzing the relationship between testing rates and death rates which appears to be even more scattered than the former. Upon hovering on the lower points of the plot, its visibly the rather low-mid-income countries such as Brazil, Peru, Columbia, Indonesia etc. that occupy these points indicating low testing rates in low-mid-income economies. Let’s see the correlations rates:

Python correlations snippet — Image by Author

Whilst it cannot for certain be determined whether high infections or deaths directly imply more testing, it is certain that factors like the country’s economic standing in addition could play a role. An interesting study conducted by researchers of Scientific Reports determined that there is indeed a very low correlation between mortality and testing rates while the correlation between testing and infection rates, although above average, is boosted up further in higher income countries according to Our World in Data reports as well as Scientific Reports [2]. This correlation gap could potentially be attributed to the lack of testing resources in low-income countries causing a bias in the over-all analysis. Where there is a lack of testing, there is a lack of knowledge of existing cases. This gap/bias was remedied in their study by splitting the data into subgroups of low, middle and high-income countries and then conducting individual regression analyses [3].

2. Impact of stringency index on the rate of covid infections and deaths

Does being a “stricter” country with regard to the implementation of measures in the covid era make the infections or deaths look any better? Or does having high rates of infections and deaths influence the stringency index?

Tableau viz: Impacts of Stringency Index on Cases and Deaths — Image by Author

As defined by Our World in Data, the stringency index is the measure of the strictness of a country with regard to how many covid restrictions are put in place, ranging from 0 to 100, 0 being the lowest. Stringency alone shows a rather low correlation with rates of deaths or infections as the countries indicated by the points is pretty scattered throughout. A very small amount of variance in death and infection rates is explainable by stringency and this can be backed by the following correlation statistics.

Python correlations snippet 1 — Image by Author

Although one would expect that countries get stricter as cases and deaths grow, stringency seems to be rather subjective to a country’s governing body. The countries lying in the strictest range (close to 100) are from around the globe and not necessarily do all of them share an overwhelmingly high rate of cases/deaths compared to the rest of the world (RoW). There is a lack of literature to back the study of stringency level’s impact on cases/deaths potentially due to the newness of the measure. It may be possible that countries face stricter measures following a surge of covid cases, however, statistics related to this measure may not be able to represent the reality accurately. As Our World in Data put it, this is because “It does not measure or imply the appropriateness or effectiveness of a country’s response. A higher score does not necessarily mean that a country’s response is ‘better’ than others lower on the index.” [3]

3. Impact of age on covid related deaths

Do countries with a good percentage of population above 65 years of age see greater deaths?

Tableau viz: Impacts of Aging on Cases and Deaths — Image by Author

Yes, from the above plot it is reasonable to say that there is a certain persistent pattern here i.e. much of the variance in death rates can be explained by whether or not a good percentage of the population lies in the above 65 age range. The point to the bottom right is an outlier where a big chunk of their population of 27% is 65 years and older and yet, they have seen very low deaths compared to the RoW. This point unsurprisingly is Japan. Could this be attributed to their advancements in technology or simply their famous reputation for immunity and generally great health?

Python correlations snippet 2— Image by Author

The correlation statistics drawn from python confirm an above-average correlation between deaths and the percentage of population greater than 65 years of age. There have been various studies claiming that age has a vital role to play in covid related mortalities. One such scientific article interestingly claims “several researchers have pointed to factors such as different scales and profiles of social interactions within households, endemic infections and median population age as affecting COVID-19 risk and mortality” [4]. Needless to explain, older populations face greater risks of infections and deaths from infection as a result of the physiological changes and underlying conditions that are accompanied by ageing, reports WHO [5].

4. Impact of poverty on life expectancy (unrelated to covid)

Albeit unrelated to covid, it is interesting to see how well the poverty rate is correlated to low life expectancy as in the figure below.

Tableau viz: Impacts of Poverty on Life Expectancy— Image by Author

Visibly from the chart above, countries living in extreme poverty experience low median life expectancy. The darker tint of blue indicating high rates of poverty lies mostly between the life expectancy range of 50 to 70 years while the majority of countries with populations living in low rates of poverty, experience rich, long lives. It is indeed alarming to see that some countries have life expectancies in as low a range as 50 to 60s. The following figure proves that a vast amount of the variance in life expectancy can be explained by poverty rates.

Tableau viz: Impacts of Poverty on Life Expectancy 1 — Image by Author

The above chart indicates pretty strongly that lower rates of poverty explain higher life expectancy. The following correlation statistic confirms the significant negative correlation.

Python correlations snippet 3— Image by Author

A very articulate report by Urban Institute and Center on Society and Health has explored the details of the role that poverty plays in life expectancy and provided gripping statistics to support inferences such as “People with Lower Incomes Report Poorer Health and Have a Higher Risk of Disease and Death” [6]. A countless number of elements foster this gap between income and life expectancy ranging from hygiene standards to the ability to seek timely medical attention to the availability of resources and so on.

5. Vaccination rates and predicting future rates by country

Now, moving on to the well-awaited analysis of vaccination rates by country. Do note that I am considering a fully vaccinated population for this computation i.e. both first and second doses if applicable. After having studied some of the columns as closely as presented in the above sections, we have obtained a good understanding of the data to do some more complex analyses.

Tableau viz: Vaccination rates by country, Geographic map — Image by Author

Although not very visible on the above map, the country (independent territory rather) boasting its highest vaccination rate of 87.4% is Gibraltar, located to the south of Spain. Runner up is Isreal which is almost 60% vaccinated. Evidently, the RoW still has a lot of catching up to do. We can see that the US, Morroco, Serbia, Chile and a few others are also slightly ahead of the RoW with around 20% of the population having already acquired the vaccinations.

Now that we have an understanding of the current rates, I’m going to try and predict the rates of vaccinations two months down the road. Are we about to gain freedom yet? To reach a point where the majority of the restrictions can be lifted, scientists have placed a threshold of 65–70% [1] implying that this percentage of the population of each country must be vaccinated or immune in order for the restrictions to be eased. I’d like to determine which countries are able to get there by vaccinations in the next two months using polynomial regression.

Note that I am simply going by current and historic rates of vaccination. By no means am I suggesting that the predictions are as good as they get, as in reality, adding in variables such as scheduled vaccine deliveries for each country might give us the ideal results but I’m just going with what I have here. Without further ado, let’s dive in!

Import libraries

import pandas as pd
import numpy as np
from matplotlib import pyplot
import matplotlib.pyplot as plt
from matplotlib import rcParams
import seaborn as sns
from IPython.display import display
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import r2_score
import statsmodels.formula.api as smf
from statsmodels.regression.linear_model import OLS
from sklearn.linear_model import LinearRegression
import math
from math import sqrt
from sklearn.metrics import mean_squared_error
from random import random
import datetime as dt

Prepare and Analyse Dataframe

df = pd.read_csv(r'C:\Users\Deepika\OneDrive\Documents\Professional\owid-covid-data.csv') #Keeping only relevant columns
df = df[['iso_code', 'continent', 'location', 'date', 'total_cases', 'new_cases','total_deaths', 'new_deaths','reproduction_rate', 'icu_patients',
         'hosp_patients','new_tests', 'total_tests','positive_rate', 'tests_per_case', 'tests_units', 'total_vaccinations',
       'people_vaccinated', 'people_fully_vaccinated', 'new_vaccinations','stringency_index',
       'population', 'population_density', 'median_age', 'aged_65_older',
       'aged_70_older', 'gdp_per_capita', 'extreme_poverty',
       'cardiovasc_death_rate', 'diabetes_prevalence', 'female_smokers',
       'male_smokers', 'handwashing_facilities', 'hospital_beds_per_thousand',
       'life_expectancy', 'human_development_index']]#A very important step
df['date'] = pd.to_datetime(df['date'])

Creating a function for additional columns and some analysis

# Creating function for additional columns and some analyses
def analyse_df(df):
    df['case_rate'] = (df['total_cases']/df['population'])*100
    df['death_rate'] = (df['total_deaths']/df['population'])*100
    df['test_rate'] = (df['total_tests']/df['population'])*100
    df['admissions_rate'] = (df['hosp_patients']/df['population'])*100
    df['critical_rate'] = (df['icu_patients']/df['population'])*100
    df['vaccination_rate'] = (df['people_fully_vaccinated']/df['population'])*100
    print('Columns: ', df.columns)
    print('Dataframe shape: ', df.shape)
    print('Date Range', df['date'].min(),df['date'].max())
    #get some stats for each country using groupby
    stats_df = df.groupby('location')[['date','case_rate','death_rate','test_rate','vaccination_rate',
                    'admissions_rate','critical_rate','stringency_index',
                                 'population']].agg({"date":['max', 'count'],
                 'case_rate':'max','death_rate':'max','test_rate':'max','vaccination_rate':'max',
                    'admissions_rate':'mean','critical_rate':'mean','stringency_index':'mean','population':'mean'})    display(stats_df)
    display(df.corr())    sns.heatmap(df.corr(), vmin=-1, vmax=1, center=0,cmap=sns.diverging_palette(20, 220, n=200),square=True)
    plt.xticks(rotation=45)
    rcParams['figure.figsize'] = 12,8    return df, stats_df

Calling the function

df = analyse_df(df)[0]
stats_df = analyse_df(df)[1]

The above code snippet returns a bunch of stats about some interesting rates in each country and how well or poorly each variable is correlated with another. We then have to save the data frame (as df here) with the newly created columns for further analyses. The below GIF provides a quick peek into this output.

Python analysis: df properties, correlation and correlation plot, column-wise grouped stats — Image by Author

Build Regression Model

I chose the polynomial model for this particular case as it is the most straightforward way to analyse and predict the upward trending behaviour of the vaccination rates. I did try a more advanced time-series analysis with the ARIMA model and found the results of the polynomial model to be more realistic and accurate. In the following section, I have shared the code of my analysis. Be sure to read the comments explaining each and every step along the way.

Create the model function

def poly(name, group):    # transfrom the date into an integer to be able to fit it into the model
    group['date_transformed'] = group['date'].map(dt.datetime.toordinal)     # Create a range to be able to tell the model later to predict within this range. I want to predict for a range that is 10 points more than half the number of observations in input data.
    Range = group['date_transformed'].max() + round(len(group)/2) + 10
    predict_dates = list(range(group['date_transformed'].max() + 1, Range))
    
    # Build the model
    # Make sure to transfrom the input data
    x = group['date_transformed'].values[:,np.newaxis]
    y = group['vaccination_rate'].values
    polynomial_features = PolynomialFeatures(degree=2)
    x_poly = polynomial_features.fit_transform(x)
    model = LinearRegression()
    model.fit(x_poly, y)    # Test the model and its accuracy
    y_poly_pred = model.predict(x_poly)
    rmse = np.sqrt(mean_squared_error(y,y_poly_pred))
    r2 = r2_score(y,y_poly_pred)    # Save the predictions as a column of the input data
    group['Pred'] = y_poly_pred
    group_export = group[['date','vaccination_rate','Pred']].set_index('date')    # View results
    print(name)
    print('rmse: ', rmse)
    print('r2: ', r2)
    return model, polynomial_features, predict_dates, group_export

Model results

Group the data by location (countries).

# Create grouped data for access later
df_grouped = df.groupby(['iso_code','continent','location'])

Implement the model into each category of the grouped data to get country-wise predictions.

# Dictionaries to save the results of the model
dct_original = {}
dct_future = {}# Access each country data seperately
for name, group in df_grouped:    # Make sure to select countries without NaN values in vaccination_rates
    group1 = group[group['people_fully_vaccinated'].notna()]    # Countries with at least 50 vaccination data points for better predictions
    if len(group1) > 50:    # Save outputs from the function into the following variables
        predict_dates = poly(name, group1)[2]
        model = poly(name, group1)[0]
        polynomial_features = poly(name, group1)[1]
        group_export = poly(name, group1)[3]
        group_export['Location'] = name[2]    # Future predictions for the range of dates specified in the function. Again, remember to tranform the input
        Predictions = model.predict(polynomial_features.fit_transform(np.array(predict_dates).reshape(-1,1)))    # Putting the predictions and dates into a dataframe
        Predictions_df = pd.DataFrame({'Future_dates': list(predict_dates),'Predictions': list(Predictions)})    # Converting the transformed dates to original date format
        Predictions_df = Predictions_df.set_index(Predictions_df['Future_dates'].map(dt.datetime.fromordinal))    # Add country to the dataframe to identify the data
        Predictions_df['Location'] = name[2]    
    
    # Save input data predictions and future predictions into dictionaries to access later
        dct_original[name] = group_export
        dct_future[name] = Predictions_df    # Plot current observed, predicted and future predicted data
        plt.figure(figsize=(10,5))
        plt.xticks(rotation=45)
        plt.title('Model ' + name[2])
        plt.xlabel('Date', fontsize=11)
        plt.ylabel('Vaccination Rate', fontsize=11)
        plt.scatter(group_export.index, group_export['vaccination_rate'])
        plt.plot(group_export['Pred'], color = 'g')
        plt.plot(Predictions_df[['Predictions']], color = 'm')
        plt.legend(['Validation', 'Predictions', 'Train'], loc='lower right')
        plt.show()    # View the Actual vs Predicted data and their data count
       print('Observations in Actual Data = %f, Predicted Observations=%f' % (len(group1), len(Predictions)))
       print( "\n".join("{} {}".format(x, y) for x, y in zip(predict_dates, Predictions)))

Sample outputs of some of the 40 countries with > 50 vaccination datapoints

From the above prediction for the UK, we can see that the RMSE is pretty low considering the vaccination rates data and the R² shows a high rate of accuracy. The prediction says that about 30% of the UK could be vaccinated by the beginning of June.

The RMSE for Chile is reasonably low and the R² suggests a pretty good accuracy. The predictions indicate that potentially 90% of the population could be vaccinated by the beginning of June.

Now for the US predictions, we see that nearly half of the US population could be vaccinated by the beginning of June. Although this might be a reasonable forecast, models with very low RMSE and very high R² might return overfitted results. Overfitting occurs when the validation lies very close to the actual data but the forecast of unseen data points (test or future data) is unrealistic or far-fetched. In this particular scenario, the model yields good results despite the very high levels of accuracy. Sure, 50% seems like a highly attainable rate given the current rate. However, it is often possible for models with such accuracy and closeness of training and validation points to wind up being overfitted. One way to remedy the overfitting of regression models would be to lower the degree of the polynomial. For instance, for this particular model, we have used a degree of 2, which is the lowest degree for a polynomial model.

# Polynomial degree used before 
polynomial_features = PolynomialFeatures(degree=2)

Model alternative

Alternatively, if we were to try and attain more generic predictions and avoid possible overfitting, we could consider changing the degree from 2 to 1 which will make it a linear regression model. Looking at the somewhat linear characteristic of the plot, we can also safely say that a linear model for a trend as such, should suffice. So let’s see if the linear model predicts any differently for this particular country. I am just going to go ahead and replace the polynomial model for the US. We will use the same code structure as for the polynomial model, except this time, we will not use “polynomial_features”.

for name, group in df_grouped:
    if name[2] == "United States": # Only for the US
        group1 = group[group['people_fully_vaccinated'].notna()]
        group1['date_transformed'] = group1['date'].map(dt.datetime.toordinal) # transform date column into integer to be able to build model
        Range = group1['date_transformed'].max() + round(len(group1)/2) + 10 
        predict_dates = list(range(group1['date_transformed'].max() + 1, Range)) # create a range of dates to make future predictions
        x = group1['date_transformed'].values[:,np.newaxis] # input data transformed
        y = group1['vaccination_rate'].values # input train data
        model = LinearRegression()
        model.fit(x, y) # Fitting linear regression 
        y_pred = model.predict(x)
        group1['Pred'] = y_pred
        r2 = model.score(x,y) # alternatively, r-squared can also be measured this way
        rmse = mean_squared_error(y, y_pred, squared=False)
        group_export = group1[['date','vaccination_rate','Pred']].set_index('date')
        Predictions = model.predict(np.array(predict_dates).reshape(-1,1))
        Predictions_df = pd.DataFrame({'Future_dates': list(predict_dates),'Predictions': list(Predictions)})
        Predictions_df = Predictions_df.set_index(Predictions_df['Future_dates'].map(dt.datetime.fromordinal))
        plt.xticks(rotation=45)
        print(name)
        print('rmse: ', rmse)
        print('r2: ', r2)
        plt.scatter(group_export.index, group_export['vaccination_rate'])
        plt.plot(group_export['Pred'], color = 'g')
        plt.plot(Predictions_df[['Predictions']], color = 'm')
        plt.legend(['Validation', 'Predictions', 'Train'], loc='lower right')

The results of the linear model show lower accuracy statistics than that of the polynomial model. The model now no longer shows symptoms of an overfit. The forecast of 30% here is less optimistic in our scenario but the key takeaway is that when there is a possibility of attaining overfitted results, lowering the degree might yield more generic and hence more reliable outcomes of unseen data points.

Vaccination Rates at the beginning of June

Tableau viz: Geographical map of forecast vaccination values

The above is a Tableau visualization of the final output. Since I set the filter of making predictions only for countries with at least 50 days of current vaccination rates, the predictions are narrowed down only to the coloured countries. This filter can be played around with to get predictions for more countries, however, it might compromise the accuracy.

A comparable visualization drawn out by The Economist boldly highlights the disparity between the rich and the poor with regard to vaccination gains.

Limitations of Model-based approach

While it is often tempting to blindly follow a model-based approach, we must bear in mind that all forecasts are based on assumptions and this one is no exception. This holds especially true for forecasting in the medical, economic, political, social or behavioural context over which we have little to no control. For this particular model, for instance, the following are the limitations:

As mentioned before, the model doesn’t factor in scheduled deliveries by the country which likely will account for improved predictions.
There is certainly not enough covid-19 vaccination data out there for a supervised model to be fully reliable. As seen in the geographic visualization of this model, many countries don’t even have 50 observations of vaccination data yet.
This model assumes consistency in vaccine roll-outs. However, for many countries, this assumption has proven to be false. The roll-out consistency of a country is certainly affected by the ever-changing laws in the covid era. Consider the ban of Astrazeneca in Denmark for instance [7]. Impacts of such unexpected external factors on the outcome can rarely be factored into a machine learning model.
With the many variants that have been discovered, questions about whether vaccination could mean immunity persist. Whether the outcomes of the current predictive models mean immunity, is not known. Hence this study assumes that vaccination rate is simply vaccination rate and not immunity.

Despite its limitations, it is needless to mention that a good model-based approach is, in many cases, better than blind intuition and guess-games. It may not give us the perfect answers but it certainly can give us an idea of what to expect and help us be better prepared and/or make improved decisions. The ideal outcome is probably the combination of a robust model and a touch of educated intuition.

If you did follow along and manage to acquire your own predictions, great! I hope they are looking as optimistic! Again, here’s the data and here are the Tableau vizes. Thanks for reading!

References

[1] D. Gypsyamber and D. David, What is Herd Immunity and How Can We Achieve It With COVID-19? (Apr, 2021), Johns Hopkins Bloomberg School of Public Health

[2] R. Hannah et al., Coronavirus (COVID-19) Testing (Apr, 2021), Our World in Data

[3] L. Li-Lin, T. Ching-Hung, H. Hsiu J & W. Chun-Ying, Covid-19 mortality is negatively associated with test number and government effectiveness (Jul, 2020), Scientific Reports

[4] T. Liji, Countries with older populations have higher SARS-CoV-2 infections and deaths, says study (Feb, 2021), News Medical Life Sciences

[5] WHO, Supporting older people during the COVID-19 pandemic is everyone’s business, (Apr, 2020), World Health Organization

[6] H. W. Steven, How Are Income and Wealth Linked to Health and Longevity? (Apr, 2015), Urban Institute & Center on Society and Health

[7] BBC, AstraZeneca vaccine: Denmark stops rollout completely (Apr, 2021), BBC News

[8] The EIU, The EIU’s latest vaccine rollout forecasts (Mar, 2021), The Economist