Data visualization with Pandas and Matplotlib

COVID-19 data from WHO

YaLinChen (Amber)
6 min readJun 12, 2022
Source: Unsplash

Pandas and Matplotlib are two Python libraries that are helpful for data manipulation (cleaning and analysis) as well as visualization. In particular, data visualization is an important skill to express (scientific) ideas effectively. In this tutorial, I try to cover a few basic skills of data visualization using COVID-19 data from WHO (https://covid19.who.int/data/).

“Daily cases and deaths by date reported to WHO” and “Vaccination data” will be used in this practical activity.

An important first step before you start the data project: Data inspection

The very important thing before starting data analysis is to understand your database, especially the metadata (variable definition). In other words, the “metadata table” is the guidebook (sometimes we call it a codebook) for a data scientist. For example, the metadata table of “Daily cases and deaths by date reported to WHO” is as below.

“Field name” is the variable name that we will be using in the following analysis. “Type” is the data type of each variable. This is important for the operation during the analysis. For example, a type of “Integer” can mean that the variable is a continuous variable and therefore corresponding analysis can be applied. However, an “Integer” can also be a ordinal variable representing a scale (e.g., pain score between 0–10); in this case, we will have to look at the “Description” column to understand the meaning of the variable. Another key point to notice in the “Description” is the generation of the variable. For example, it states clearly how the “New_cases” are calculated, which later on affects the interpretation of the analysis result.

Data inspection using Python

After downloading the data (provided as csv file), we import the data into a Jupyter Notebook.

First, we import a few libraries and read the csv file into a Pandas data frame

# import libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import collections
from operator import is_not
from functools import partial
# read the file
df = pd.read_csv("WHO-COVID-19-global-data.csv")
df.head()

We inspect the data types and the number of data by using

df.info()

A few things to check here.
1. Column: variable name
2. Non-Null Count: check if there is any missing data (if the non-null number is the same as the number of entries)
3. Dtype: data types; object as string data, int64 as numerical data

Also, we can use this code to see the number of rows and columns in the dataframe

# number of rows, number of columns
df.shape
#(210930, 8)

New cases by day in the previous 30 days: Line graph

# get the latest 30 days of each countrydf = df.sort_values(by=["Country", "Date_reported"], ascending = [True, False])
df2 = df.groupby('Country').nth(list(np.arange(0,30))).reset_index()
# check if the all countries are included after getting the the latest 30-day dataprint("the original number of countries: ", df.Country.nunique())
print("the number of countries-after getting the first 30 days: ", df2.Country.nunique())

Sum up the cases by day and plot the trend

# groupby(["the column you want to sum up by"]).sum()
df3 = df2.groupby(['Date_reported']).sum().reset_index()
df3.head()
Sum all the numerical variables
# plotax = plt.gca()
plt.plot(df3["Date_reported"], df3['New_cases']) #.plot(x, y)
plt.rcParams["figure.figsize"] = (20, 10) # set figure size
ax.set_xticks(ax.get_xticks()[::7]) # show date every 7 days
plt.xticks(fontsize= 15) # change fontsize of the x ticks
### add legends
# add legends inside the figureplt.legend(labels=["New cases"], loc="best", fontsize = 10)
# add legends outside the figure
# plt.legend(labels=["New cases"], bbox_to_anchor=(1.2, 0))
### save the figure
plt.savefig("new cases.jpg", dpi=200, bbox_inches="tight")
### show the figure
plt.show()

New cases by country: Bar chart

df3 = df2.groupby(['Country']).sum()[["New_cases"]].sort_values(by=["New_cases"],ascending = F
alse).reset_index()
# top 10 countries
df3.head(10)

Show the new cases of the top 10 countries

plt.bar(df3['Country'][:10], df3['New_cases'][:10], width=0.6, color = "#069AF3")
plt.show()

We can change the bar chart number by specifing the “color” argument in the plt.bar(). A list of number can be referred to https://matplotlib.org/3.5.0/tutorials/colors/colors.html.

le6 on the top means the unit is 1000,000 persons

New cases by region: Pie chart

We first sum up the cases by region

df6 = df.groupby(['WHO_region']).sum()[["New_cases"]].sort_values(by=["New_cases"],ascending = False).reset_index()
df6 = df6.transpose()
cols = list(df6.loc["WHO_region"].values)
df6.columns = cols
df6 = df6.reset_index(drop=True)
df6

Since “Other” is not defiend in the metadata table and hence we delete this column.

df6 = df6.drop(columns=["Other"])

Draw the pie plot

data = df6.iloc[1,:].values
plt.pie(data, autopct='%1.4f%%', colors = ["#7BC8F6", "#E6DAA6", "#13EAC9", "#FF796C", "#FFD700", "#FF81C0"])
plt.legend(loc="center right", labels=list(df6.columns))
plt.show()

Vaccination data: manipulation of string data

vac = pd.read_csv("vaccination-data.csv")

In this practice, we want to observe the distribution of vaccine types. In this dataset, there are two columns we need.

NUMBER_VACCINES_TYPES_USED indicates the number of vaccines used in the country and VACCINES_USED is a string variable specifying all the vaccine names.

Note that in the VACCINES_USED columns, all vaccine names are separated by the symbol “,”. In Pandas, we can split string value into multiple columns. However, first, we need to know how many columns we need, which is the maximum number of vaccines used zomg all countries.

We can get such information by using the following code

vac.NUMBER_VACCINES_TYPES_USED.max()
# 12.0

There are at most 12 types of vaccines used in one country. Now let’s split the VACCINES_USED into multiple columns with each contains one type of vaccine.

vac[['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L']] = vac['VACCINES_USED'].str.split(',', 11, expand=True)# create a new dataframe containing only the vaccine types
vac_used = vac[['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L']]
vac_used.head()

Now we can calculate the number of different vaccines in this new dataframe using the following codes.

vac_summary = pd.DataFrame([dict(collections.Counter(vac_used.values.flatten()))]).transpose()
vac_summary = vac_summary.rename(columns={0:"Quantity"}).reset_index()
vac_summary = vac_summary.drop([11,16]).reset_index(drop=True)
vac_summary = vac_summary.sort_values(by="Quantity", ascending = False).reset_index(drop=True)
# show top 10 most used vaccines
vac_summary.head(10)

Vaccination types by WHO regions: Filtering dataframe and the use of functions

In programming language, often we like to write functions for the repeatability of the task. Here we already have a set of codes calculating the number of vaccines used. Now we are interested in finding the top vaccines used in different WHO regions.

We can do this by filtering the vaccination dataframe. The unique names of WHO region can be obtained by

vac_used2["WHO_REGION"].unique()

Define a function for extracting data by WHO regions

def find_region(region):
vac_used2 = vac[['WHO_REGION', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L']]
vac_used3 = vac_used2[vac_used2["WHO_REGION"]=="%s"%(region)][['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L']].reset_index(drop=True)
display(vac_used3.head())
return vac_used3

Use the function

vac_used3 = find_region("EMRO")

Define a function for finding top 10 vaccines used

def top_10_vac(df):
vac_summary = pd.DataFrame([dict(collections.Counter(list(filter(partial(is_not, None), df.values.flatten()))))]).transpose()
vac_summary = vac_summary.rename(columns={0:"Quantity"}).reset_index()
vac_summary = vac_summary.drop(vac_summary[vac_summary["index"]=="NaN"].index).reset_index(drop=True)
display(vac_summary)
### use the function
top_10_vac(vac_used3)

--

--