See the Story Behind the Numbers (2024)

See the Story Behind the Numbers (1)

Panda performing data analytics on Covid-19

Exploring Data Analytics with COVID-19 Data

Welcome to another exciting edition of our newsletter! In this post, we’re diving into the world of data analytics using a real-world dataset: COVID-19 data. This project will help you understand how to manipulate, analyse, and visualise data to extract meaningful insights. By the end of this tutorial, you'll have a strong grasp of the fundamental concepts of data analytics, making your learning journey both interesting and motivating.

What You'll Learn:

  1. Loading and Exploring Data: How to load a dataset and understand its structure.

  2. Data Cleaning: How to handle missing values and prepare the data for analysis.

  3. Data Analysis: How to perform basic analysis to extract meaningful insights.

  4. Data Visualisation: How to visualise data using various types of plots.

Let’s get started!

Step 1: Setup Your Environment

Before we dive into the data, make sure you have the necessary tools installed. We’ll use Python along with the following libraries:

  • pandas

  • numpy

  • matplotlib

  • seaborn

You can install these libraries using pip

pip install pandas numpy matplotlib seaborn

Step 2: Load the Dataset

We'll use a sample COVID-19 dataset. You can download it from Johns Hopkins University's GitHub repository.

import pandas as pd

# Load the dataset

url = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"

data = pd.read_csv(url)

# Display the first few rows

print(data.head())

About the COVID-19 Dataset

The COVID-19 dataset used in this tutorial is sourced from the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE). This comprehensive dataset tracks the global spread of the COVID-19 virus, providing detailed information on confirmed cases, deaths, and recoveries across different countries and regions.

Key Features

This dataset is widely used by researchers, analysts, and public health officials to monitor the spread of the virus, analyse trends, and inform policy decisions. Its detailed and regularly updated nature makes it an invaluable resource for anyone looking to understand the impact of COVID-19 on a global scale.

Output

See the Story Behind the Numbers (2)

Step 3: Explore the Data

Understanding the structure and content of your dataset is crucial. Let’s take a closer look at the data.

# Display basic information about the dataset

print(data.info())

# Show basic statistics

print(data.describe())

# Display the column names

print(data.columns)

Output

See the Story Behind the Numbers (3)

Step 4: Data Cleaning

Real-world data often requires cleaning. We’ll handle missing values and transform the data into a more usable format.

# Check for missing values

print(data.isnull().sum())

# Drop columns with missing values if necessary (Example: data.dropna(axis=1, inplace=True))

# Melt the dataset to have a better structure for analysis

data_melted = data.melt(id_vars=["Province/State", "Country/Region", "Lat", "Long"], var_name="Date", value_name="Confirmed")

data_melted["Date"] = pd.to_datetime(data_melted["Date"], format='%m/%d/%y')

# Display the first few rows of the cleaned data

print(data_melted.head())

Understanding data.melt

The data.melt function is used to convert a wide-format DataFrame into a long-format DataFrame. In wide format, each subject or entity has its own column. In long format, the data is stacked so that each row is a single observation. This is particularly useful for time series data or data that needs to be grouped and aggregated.

Syntax

data.melt(id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None, ignore_index=True)

Parameters

  1. id_vars:

    • This parameter specifies the columns that you want to keep as identifier variables.

    • These columns will not be unpivoted (i.e., they will remain the same).

  2. value_vars:

    • This parameter specifies the columns that you want to unpivot.

    • These columns will be converted from wide to long format.

  3. var_name:

    • This parameter sets the name of the new variable column.

    • If not specified, the default name will be variable.

  4. value_name:

    • This parameter sets the name of the new value column.

    • The default name is value.

  5. col_level:

    • If columns are multi-indexed, this parameter can specify which level to melt.

  6. ignore_index:

    • If set to True, the original index will be ignored.

    • If set to False, the original index will be retained.

Melt the COVID-19 dataset:

  • id_vars=["Province/State", "Country/Region", "Lat", "Long"]: We keep these columns as they are because they are identifiers for each record.

  • var_name="Date": We specify that the new variable column should be named Date.

  • value_name="Confirmed": We specify that the new value column should be named Confirmed.

Output

See the Story Behind the Numbers (4)

Step 5: Data Analysis

Now that our data is clean, we can start analyzing it. Let’s look at the global spread of COVID-19 over time.

# Group by Date to see the global confirmed cases over time

global_cases = data_melted.groupby("Date")["Confirmed"].sum().reset_index()

# Display the first few rows

print(global_cases.head())

Output

See the Story Behind the Numbers (5)

Step 6: Data Visualization

Visualizing data helps in understanding trends and patterns. We’ll create some basic plots to visualize the spread of COVID-19.

import matplotlib.pyplot as plt

import seaborn as sns

# Plot the global confirmed cases over time

plt.figure(figsize=(10, 6))

plt.plot(global_cases["Date"], global_cases["Confirmed"], marker='o', linestyle='-')

plt.title('Global COVID-19 Confirmed Cases Over Time')

plt.xlabel('Date')

plt.ylabel('Confirmed Cases')

plt.grid(True)

plt.show()

Output

See the Story Behind the Numbers (6)

See the Story Behind the Numbers (7)

Global COVID-19 Confirmed Cases Over Time

Additional Analysis: Country-Specific Trends

We can also analyse the data for specific countries. Let’s see the trend for a specific country, like the United States.

# Filter data for the United States

us_data = data_melted[data_melted["Country/Region"] == "US"]

# Group by Date to see the confirmed cases over time for the US

us_cases = us_data.groupby("Date")["Confirmed"].sum().reset_index()

# Plot the confirmed cases over time for the US

plt.figure(figsize=(10, 6))

plt.plot(us_cases["Date"], us_cases["Confirmed"], marker='o', linestyle='-', color='red')

plt.title('COVID-19 Confirmed Cases Over Time in the US')

plt.xlabel('Date')

plt.ylabel('Confirmed Cases')

plt.grid(True)

plt.show()

Output

See the Story Behind the Numbers (8)

See the Story Behind the Numbers (9)

Confirmed cases over time for the US

Conclusion

Congratulations! You've just completed a basic data analytics project using COVID-19 data. Here’s what we covered:

  • Loading and Exploring Data: Understanding the dataset’s structure and basic statistics.

  • Data Cleaning: Handling missing values and transforming the data for analysis.

  • Data Analysis: Performing basic grouping and summing operations to extract insights.

  • Data Visualisation: Creating plots to visualise trends and patterns in the data.

By working through these steps, you’ve gained valuable skills in data manipulation, analysis, and visualisation. Keep exploring different datasets and applying these techniques to uncover more insights. Data analytics is a powerful tool, and you’re well on your way to mastering it!

Recommended AI Resources

See the Story Behind the Numbers (10)

There's An AI For ThatStay up to date with the latest AI tools, by the #1 AI aggregator. Read by employees of Google, Microsoft, Meta, Salesforce, Intel, Samsung, and thousands of AI influencers and bloggers.

Ready for More Python Fun? 📬

Subscribe to our newsletter now and get a free Python cheat sheet! 📑 Dive deeper into Python programming with more exciting projects and tutorials designed just for beginners.

Keep exploring, keep coding, 👩‍💻👨‍💻and enjoy your journey into data analytics with Python!

Stay tuned for our next exciting project in the following edition!

Happy coding!🚀📊✨

See the Story Behind the Numbers (2024)

References

Top Articles
Latest Posts
Article information

Author: Margart Wisoky

Last Updated:

Views: 6146

Rating: 4.8 / 5 (58 voted)

Reviews: 81% of readers found this page helpful

Author information

Name: Margart Wisoky

Birthday: 1993-05-13

Address: 2113 Abernathy Knoll, New Tamerafurt, CT 66893-2169

Phone: +25815234346805

Job: Central Developer

Hobby: Machining, Pottery, Rafting, Cosplaying, Jogging, Taekwondo, Scouting

Introduction: My name is Margart Wisoky, I am a gorgeous, shiny, successful, beautiful, adventurous, excited, pleasant person who loves writing and wants to share my knowledge and understanding with you.