Data visualization with Pandas and Matplotlib
COVID-19 data from WHO
Pandas and Matplotlib are two Python libraries that are helpful for data manipulation (cleaning and analysis) as well as visualization. In particular, data visualization is an important skill to express (scientific) ideas effectively. In this tutorial, I try to cover a few basic skills of data visualization using COVID-19 data from WHO (https://covid19.who.int/data/).
“Daily cases and deaths by date reported to WHO” and “Vaccination data” will be used in this practical activity.
An important first step before you start the data project: Data inspection
The very important thing before starting data analysis is to understand your database, especially the metadata (variable definition). In other words, the “metadata table” is the guidebook (sometimes we call it a codebook) for a data scientist. For example, the metadata table of “Daily cases and deaths by date reported to WHO” is as below.
“Field name” is the variable name that we will be using in the following analysis. “Type” is the data type of each variable. This is important for the operation during the analysis. For example, a type of “Integer” can mean that the variable is a continuous variable and therefore corresponding analysis can be applied. However, an “Integer” can also be a ordinal variable representing a scale (e.g., pain score between 0–10); in this case, we will have to look at the “Description” column to understand the meaning of the variable. Another key point to notice in the “Description” is the generation of the variable. For example, it states clearly how the “New_cases” are calculated, which later on affects the interpretation of the analysis result.
Data inspection using Python
After downloading the data (provided as csv file), we import the data into a Jupyter Notebook.
First, we import a few libraries and read the csv file into a Pandas data frame
# import libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import collections
from operator import is_not
from functools import partial# read the file
df = pd.read_csv("WHO-COVID-19-global-data.csv")
df.head()
We inspect the data types and the number of data by using
df.info()
A few things to check here.
1. Column: variable name
2. Non-Null Count: check if there is any missing data (if the non-null number is the same as the number of entries)
3. Dtype: data types; object as string data, int64 as numerical data
Also, we can use this code to see the number of rows and columns in the dataframe
# number of rows, number of columns
df.shape#(210930, 8)
New cases by day in the previous 30 days: Line graph
# get the latest 30 days of each countrydf = df.sort_values(by=["Country", "Date_reported"], ascending = [True, False])
df2 = df.groupby('Country').nth(list(np.arange(0,30))).reset_index()# check if the all countries are included after getting the the latest 30-day dataprint("the original number of countries: ", df.Country.nunique())
print("the number of countries-after getting the first 30 days: ", df2.Country.nunique())
Sum up the cases by day and plot the trend
# groupby(["the column you want to sum up by"]).sum()
df3 = df2.groupby(['Date_reported']).sum().reset_index()
df3.head()
# plotax = plt.gca()
plt.plot(df3["Date_reported"], df3['New_cases']) #.plot(x, y)
plt.rcParams["figure.figsize"] = (20, 10) # set figure size
ax.set_xticks(ax.get_xticks()[::7]) # show date every 7 days
plt.xticks(fontsize= 15) # change fontsize of the x ticks### add legends
# add legends inside the figureplt.legend(labels=["New cases"], loc="best", fontsize = 10)
# add legends outside the figure
# plt.legend(labels=["New cases"], bbox_to_anchor=(1.2, 0))### save the figure
plt.savefig("new cases.jpg", dpi=200, bbox_inches="tight")### show the figure
plt.show()
New cases by country: Bar chart
df3 = df2.groupby(['Country']).sum()[["New_cases"]].sort_values(by=["New_cases"],ascending = F
alse).reset_index()# top 10 countries
df3.head(10)
Show the new cases of the top 10 countries
plt.bar(df3['Country'][:10], df3['New_cases'][:10], width=0.6, color = "#069AF3")
plt.show()
We can change the bar chart number by specifing the “color” argument in the plt.bar(). A list of number can be referred to https://matplotlib.org/3.5.0/tutorials/colors/colors.html.
New cases by region: Pie chart
We first sum up the cases by region
df6 = df.groupby(['WHO_region']).sum()[["New_cases"]].sort_values(by=["New_cases"],ascending = False).reset_index()
df6 = df6.transpose()
cols = list(df6.loc["WHO_region"].values)
df6.columns = cols
df6 = df6.reset_index(drop=True)
df6
Since “Other” is not defiend in the metadata table and hence we delete this column.
df6 = df6.drop(columns=["Other"])
Draw the pie plot
data = df6.iloc[1,:].values
plt.pie(data, autopct='%1.4f%%', colors = ["#7BC8F6", "#E6DAA6", "#13EAC9", "#FF796C", "#FFD700", "#FF81C0"])
plt.legend(loc="center right", labels=list(df6.columns))
plt.show()
Vaccination data: manipulation of string data
vac = pd.read_csv("vaccination-data.csv")
In this practice, we want to observe the distribution of vaccine types. In this dataset, there are two columns we need.
NUMBER_VACCINES_TYPES_USED indicates the number of vaccines used in the country and VACCINES_USED is a string variable specifying all the vaccine names.
Note that in the VACCINES_USED columns, all vaccine names are separated by the symbol “,”. In Pandas, we can split string value into multiple columns. However, first, we need to know how many columns we need, which is the maximum number of vaccines used zomg all countries.
We can get such information by using the following code
vac.NUMBER_VACCINES_TYPES_USED.max()
# 12.0
There are at most 12 types of vaccines used in one country. Now let’s split the VACCINES_USED into multiple columns with each contains one type of vaccine.
vac[['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L']] = vac['VACCINES_USED'].str.split(',', 11, expand=True)# create a new dataframe containing only the vaccine types
vac_used = vac[['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L']]
vac_used.head()
Now we can calculate the number of different vaccines in this new dataframe using the following codes.
vac_summary = pd.DataFrame([dict(collections.Counter(vac_used.values.flatten()))]).transpose()
vac_summary = vac_summary.rename(columns={0:"Quantity"}).reset_index()
vac_summary = vac_summary.drop([11,16]).reset_index(drop=True)
vac_summary = vac_summary.sort_values(by="Quantity", ascending = False).reset_index(drop=True)# show top 10 most used vaccines
vac_summary.head(10)
Vaccination types by WHO regions: Filtering dataframe and the use of functions
In programming language, often we like to write functions for the repeatability of the task. Here we already have a set of codes calculating the number of vaccines used. Now we are interested in finding the top vaccines used in different WHO regions.
We can do this by filtering the vaccination dataframe. The unique names of WHO region can be obtained by
vac_used2["WHO_REGION"].unique()
Define a function for extracting data by WHO regions
def find_region(region):
vac_used2 = vac[['WHO_REGION', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L']]
vac_used3 = vac_used2[vac_used2["WHO_REGION"]=="%s"%(region)][['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L']].reset_index(drop=True)
display(vac_used3.head())
return vac_used3
Use the function
vac_used3 = find_region("EMRO")
Define a function for finding top 10 vaccines used
def top_10_vac(df):
vac_summary = pd.DataFrame([dict(collections.Counter(list(filter(partial(is_not, None), df.values.flatten()))))]).transpose()
vac_summary = vac_summary.rename(columns={0:"Quantity"}).reset_index()
vac_summary = vac_summary.drop(vac_summary[vac_summary["index"]=="NaN"].index).reset_index(drop=True)
display(vac_summary)### use the function
top_10_vac(vac_used3)