See the Story Behind the Numbers (2024)

Panda performing data analytics on Covid-19

Exploring Data Analytics with COVID-19 Data

Welcome to another exciting edition of our newsletter! In this post, we’re diving into the world of data analytics using a real-world dataset: COVID-19 data. This project will help you understand how to manipulate, analyse, and visualise data to extract meaningful insights. By the end of this tutorial, you'll have a strong grasp of the fundamental concepts of data analytics, making your learning journey both interesting and motivating.

What You'll Learn:

Loading and Exploring Data: How to load a dataset and understand its structure.
Data Cleaning: How to handle missing values and prepare the data for analysis.
Data Analysis: How to perform basic analysis to extract meaningful insights.
Data Visualisation: How to visualise data using various types of plots.

Let’s get started!

Step 1: Setup Your Environment

Before we dive into the data, make sure you have the necessary tools installed. We’ll use Python along with the following libraries:

pandas
numpy
matplotlib
seaborn

You can install these libraries using pip

pip install pandas numpy matplotlib seaborn

Step 2: Load the Dataset

We'll use a sample COVID-19 dataset. You can download it from Johns Hopkins University's GitHub repository.

import pandas as pd

# Load the dataset

url = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"

data = pd.read_csv(url)

# Display the first few rows

print(data.head())

About the COVID-19 Dataset

The COVID-19 dataset used in this tutorial is sourced from the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE). This comprehensive dataset tracks the global spread of the COVID-19 virus, providing detailed information on confirmed cases, deaths, and recoveries across different countries and regions.

Key Features

Date Range: The dataset includes daily records from the beginning of the outbreak to the present, allowing for detailed time series analysis.
Geographical Coverage: Data is available for countries worldwide, with additional granularity for states/provinces in larger countries like the United States, Canada, and China.
Metrics Tracked:
- Confirmed Cases: The total number of confirmed COVID-19 cases.
  See Also
  Modin on LinkedIn: Pandas vs. SQL - Part 3: Pandas Is More Flexible Cara Praktis Menggunakan Values Count di Pandas 2024 | RevoU Creating a Google Virtual Machine Instance to Reduce Dataset Size for Improved Visibility xscen.catalog — Ouranos xscen Official Documentation
- Deaths: The total number of deaths attributed to COVID-19.
- Recovered: The total number of patients who have recovered from COVID-19.
Data Format: The dataset is structured in a time series format, with columns representing different dates and rows representing different geographical regions.

This dataset is widely used by researchers, analysts, and public health officials to monitor the spread of the virus, analyse trends, and inform policy decisions. Its detailed and regularly updated nature makes it an invaluable resource for anyone looking to understand the impact of COVID-19 on a global scale.

Output

Step 3: Explore the Data

Understanding the structure and content of your dataset is crucial. Let’s take a closer look at the data.

# Display basic information about the dataset

print(data.info())

# Show basic statistics

print(data.describe())

# Display the column names

print(data.columns)

Output

Step 4: Data Cleaning

Real-world data often requires cleaning. We’ll handle missing values and transform the data into a more usable format.

# Check for missing values

print(data.isnull().sum())

# Drop columns with missing values if necessary (Example: data.dropna(axis=1, inplace=True))

# Melt the dataset to have a better structure for analysis

data_melted = data.melt(id_vars=["Province/State", "Country/Region", "Lat", "Long"], var_name="Date", value_name="Confirmed")

data_melted["Date"] = pd.to_datetime(data_melted["Date"], format='%m/%d/%y')

# Display the first few rows of the cleaned data

print(data_melted.head())

Understanding `data.melt`

The data.melt function is used to convert a wide-format DataFrame into a long-format DataFrame. In wide format, each subject or entity has its own column. In long format, the data is stacked so that each row is a single observation. This is particularly useful for time series data or data that needs to be grouped and aggregated.

Syntax

data.melt(id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None, ignore_index=True)

Parameters

id_vars:
- This parameter specifies the columns that you want to keep as identifier variables.
- These columns will not be unpivoted (i.e., they will remain the same).
value_vars:
- This parameter specifies the columns that you want to unpivot.
- These columns will be converted from wide to long format.
var_name:
- This parameter sets the name of the new variable column.
- If not specified, the default name will be variable.
value_name:
- This parameter sets the name of the new value column.
- The default name is value.
col_level:
- If columns are multi-indexed, this parameter can specify which level to melt.
ignore_index:
- If set to True, the original index will be ignored.
- If set to False, the original index will be retained.

Melt the COVID-19 dataset:

id_vars=["Province/State", "Country/Region", "Lat", "Long"]: We keep these columns as they are because they are identifiers for each record.
var_name="Date": We specify that the new variable column should be named Date.
value_name="Confirmed": We specify that the new value column should be named Confirmed.

Output

Step 5: Data Analysis

Now that our data is clean, we can start analyzing it. Let’s look at the global spread of COVID-19 over time.

# Group by Date to see the global confirmed cases over time

global_cases = data_melted.groupby("Date")["Confirmed"].sum().reset_index()

# Display the first few rows

print(global_cases.head())

Output

Step 6: Data Visualization

Visualizing data helps in understanding trends and patterns. We’ll create some basic plots to visualize the spread of COVID-19.

import matplotlib.pyplot as plt

import seaborn as sns

# Plot the global confirmed cases over time

plt.figure(figsize=(10, 6))

plt.plot(global_cases["Date"], global_cases["Confirmed"], marker='o', linestyle='-')

plt.title('Global COVID-19 Confirmed Cases Over Time')

plt.xlabel('Date')

plt.ylabel('Confirmed Cases')

plt.grid(True)

plt.show()

Output

Global COVID-19 Confirmed Cases Over Time

Additional Analysis: Country-Specific Trends

We can also analyse the data for specific countries. Let’s see the trend for a specific country, like the United States.

# Filter data for the United States

us_data = data_melted[data_melted["Country/Region"] == "US"]

# Group by Date to see the confirmed cases over time for the US

us_cases = us_data.groupby("Date")["Confirmed"].sum().reset_index()

# Plot the confirmed cases over time for the US

plt.figure(figsize=(10, 6))

plt.plot(us_cases["Date"], us_cases["Confirmed"], marker='o', linestyle='-', color='red')

plt.title('COVID-19 Confirmed Cases Over Time in the US')

plt.xlabel('Date')

plt.ylabel('Confirmed Cases')

plt.grid(True)

plt.show()

Output

Confirmed cases over time for the US

Conclusion

Congratulations! You've just completed a basic data analytics project using COVID-19 data. Here’s what we covered:

Loading and Exploring Data: Understanding the dataset’s structure and basic statistics.
Data Cleaning: Handling missing values and transforming the data for analysis.
Data Analysis: Performing basic grouping and summing operations to extract insights.
Data Visualisation: Creating plots to visualise trends and patterns in the data.

By working through these steps, you’ve gained valuable skills in data manipulation, analysis, and visualisation. Keep exploring different datasets and applying these techniques to uncover more insights. Data analytics is a powerful tool, and you’re well on your way to mastering it!

Recommended AI Resources

	There's An AI For ThatStay up to date with the latest AI tools, by the #1 AI aggregator. Read by employees of Google, Microsoft, Meta, Salesforce, Intel, Samsung, and thousands of AI influencers and bloggers.

Ready for More Python Fun? 📬

Subscribe to our newsletter now and get a free Python cheat sheet! 📑 Dive deeper into Python programming with more exciting projects and tutorials designed just for beginners.

Keep exploring, keep coding, 👩‍💻👨‍💻and enjoy your journey into data analytics with Python!

Stay tuned for our next exciting project in the following edition!

Happy coding!🚀📊✨

See the Story Behind the Numbers (2024)

Exploring Data Analytics with COVID-19 Data

What You'll Learn:

Step 1: Setup Your Environment

Step 2: Load the Dataset

About the COVID-19 Dataset

Key Features

Step 3: Explore the Data

Step 4: Data Cleaning

Understanding data.melt

Syntax

Parameters

Step 5: Data Analysis

Step 6: Data Visualization

Additional Analysis: Country-Specific Trends

Conclusion

Recommended AI Resources

Ready for More Python Fun? 📬

References

Understanding `data.melt`