Exploring Covid-19 Dataset

Published in

The Startup

4 min readOct 5, 2020

In this article we will discuss about handling Covid-19 dataset through various steps such as loading the data into data frame, cleaning the data, extracting statistics from it and finally do some visualizations from which we can draw some conclusions based on the facts provided by the data.

Lets begin by importing necessary libraries to download and load the data into data frame

Now lets download the Covid-19 dataset from online repository

Downloaded the dataset using opendatasets provided by joivan.ml.
Let’s verify that the dataset was downloaded into the directory owid-covid-19-latest, and retrieve the list of files in the dataset.

You can go through the downloaded files using File > Open menu option in Jupyter. It seems like the dataset contains 3 files:
owid-covid-data-last-updated-timestamp.txt - containing information about the last updated timestamp.
owid-covid-codebook.csv - containing the list of codes in detail used for naming columns of owid-covid-data.csv and also source of the data from which data was collected.
owid-covid-data.csv - the full list of responses from different countries regarding covid-19.

Let’s load the CSV files using the Pandas library. We’ll use the name covid_raw_df for the data frame, to indicate that this is unprocessed data.

The dataset contains over 47000 entries to 41 columns for recording covid-19 data from all over the world.
Let’s view the list of columns in the data frame.

We can refer to the owid-covid-codebook.csv file to see the full text of each code. The codebook file contains only three columns: column, description and source, so we can load it as Pandas Series with column as the index and description as value omitting source column(not required for analysis).Anyways if interested you can load the source column also.

We can now use codebook_raw to retrieve the full explanation text for any column in covid_raw_df. Now lets see the explanation of stringency_index column.

While the covid-19 survey responses contain a wealth of information, we’ll limit our analysis to the following areas:
What are worst affected countries due to this pandemic and how those countries dealt with it.
Let’s select a subset of columns with the relevant data for our analysis.

Let’s extract a copy of the data from these columns into a new data frame covid_df, which we can continue to modify further without affecting the original data frame.

Let’s view some basic information about the data frame.

Most of the columns have the data type float64. Apart from location and date column every other column contains some empty values as Non-Null count for these columns is lower than the total number of rows(47749).We’ll need to deal with empty values and manually adjust the datatype also if required for each column on a case-by-case basis. Only date column which is of object data type need to be converted to datetime object.

Lets start visualizations by importing matplotlib and seaborn.

To find out worst affected countries below factors are considered:
To answer this question I have considered the below factors:

1. For identifying 5 worst affected countries total_deaths occurred is considered.
2. For action taken by those countries like malls shutdown, schools shutdown in order to contain the virus the stringency index is considered.

for python code of the below visualization please use view file option.

United States: This country responded with stringency index mostly being in range of 60 and 65 and effectively curbed the deaths occurring every month as death rate is decreasing month by month and as well as cases. We can say that this country is handling covid-19 crisis effectively from month of august.

Brazil: This country responded with stringency index mostly being in range of 75 and 80 but failed in handling death rate. But we can see slight dip from month of august hope this continues and soon the country becomes efficient in handling covid-19 crisis.

India: This country responded with stringency index mostly being in range of 70 and 100 but utterly failed in handling covid-19 related deaths due to its huge population as a factor despite implementing stringency index of 100 in the month of april. Let us all hope that death rate in this country should decrease and country should become better at handling covid crisis.

Mexico: This country response is similar to that of Brazil only difference is number of deaths in this country is less compared to Brazil.

United Kingdom: This country responded with stringency mostly index being in range of 65 and 75. This country has curbed the curve deaths it indicates that this country has handled crisis effectively but on the other side the covid-19 cases are again on a rise i.e. Second Wave we will talk about it in next section.

The second wave of pandemic is not a distinctly defined stage. It is generally taken as a stage when the disease appears to be contained in its spread before infections start shooting up from a different group of the population or in different locality. As of now we see this curve in United Kingdom.

Conclusion:

We have seen stringent action curves of worst affected countries and compared it with death rate curve and inferred does this response of countries have had any effect on handling covid-19 crisis and solution came out to be true it does had effect in countries like United States and United Kingdom but not in the likes of India even-though it have implemented strict lockdowns. And finally described about second wave of Covid-19.

References:

COVID-19 Dataset
Best Practices for Analytics Reporting
Making perfect “Chai” and other tales :)
Parul Pandey’s Guide Notebook
COVID-19 Statistics
Exploratory Data Analysis using Python — A Case Study by aakash N S
Data Science collaboration platform Jovian.ml

Thanks to jovian.ml Team for giving this opportunity for learning the things effectively through your course zero-to-pandas.

Exploring Covid-19 Dataset

Conclusion:

References:

Written by Mahesh Varma