Table of Contents
Data Structuring #
Throughout the following exercises, you will practice structuring data in Python. Before starting on this programming exercise, we strongly recommend watching the video lecture and completing the IVQ for the associated topics.
All the information you need for solving this assignment is in this notebook, and all the code you will be implementing will take place within this notebook.
As we move forward, you can find instructions on how to install required libraries as they arise in this notebook. Before we begin with the exercises and analyzing the data, we need to import all libraries and extensions required for this programming exercise. Throughout the course, we will be using pandas for operations, and matplotlib and seaborn for plotting.
Objective #
We will be examining lightning strike data collected by the National Oceanic and Atmospheric Association (NOAA) for the year of 2018.
First, we will find the locations with the greatest number of strikes within a single day.
Then, we will examine the locations that had the greatest number of days with at least one lightning strike.
Next, we will determine whether certain days of the week had more lightning strikes than others.
Finally, we will add data from 2016 and 2017 and, for each month, calculate the percentage of total lightning strikes for that year that occurred in that month. We will then plot this data on a bar graph.
# Import statements
import pandas as pd
import numpy as np
import seaborn as sns
import datetime
from matplotlib import pyplot as plt
# Read in the 2016 data
df = pd.read_csv('eda_structuring_with_python_dataset1.csv')
df.head()
date | number_of_strikes | center_point_geom | |
---|---|---|---|
0 | 2018-01-03 | 194 | POINT(-75 27) |
1 | 2018-01-03 | 41 | POINT(-78.4 29) |
2 | 2018-01-03 | 33 | POINT(-73.9 27) |
3 | 2018-01-03 | 38 | POINT(-73.8 27) |
4 | 2018-01-03 | 92 | POINT(-79 28) |
Just like the data you encountered previously, the dataset has three columns: date
, number_of_strikes
and center_point_geom
. Start by converting the date
column to datetime.
# Convert the `date` column to datetime
df['date'] = pd.to_datetime(df['date'])
Now, let’s check the shape of the dataframe.
df.shape
(3401012, 3)
Let’s do a quick check for duplicates. If the shape of the data is different after running this code, we’ll know there were duplicate rows.
df.drop_duplicates().shape
(3401012, 3)
The shape of the dataset after dropping duplicates is the same, so we can assume no duplicates. Hence, there is at most one row per date, per area, per number of strikes.
Locations with most strikes in a single day #
To identify the locations with the most strikes in a single day, sort the number_of_strikes
column in descending value, or most to least.
# Sort by number of strikes in descending order
df.sort_values(by='number_of_strikes', ascending=False).head(10)
date | number_of_strikes | center_point_geom | |
---|---|---|---|
302758 | 2018-08-20 | 2211 | POINT(-92.5 35.5) |
278383 | 2018-08-16 | 2142 | POINT(-96.1 36.1) |
280830 | 2018-08-17 | 2061 | POINT(-90.2 36.1) |
280453 | 2018-08-17 | 2031 | POINT(-89.9 35.9) |
278382 | 2018-08-16 | 1902 | POINT(-96.2 36.1) |
11517 | 2018-02-10 | 1899 | POINT(-95.5 28.1) |
277506 | 2018-08-16 | 1878 | POINT(-89.7 31.5) |
24906 | 2018-02-25 | 1833 | POINT(-98.7 28.9) |
284320 | 2018-08-17 | 1767 | POINT(-90.1 36) |
24825 | 2018-02-25 | 1741 | POINT(-98 29) |
Locations with most days with at least one lightning strike #
To find the number of days that a given geographic location had at least one lightning strike, we’ll use the value_counts()
function on the center_point_geom
column. The logic is that if each row represents a location-day, then counting the number of times each location occurs in the data will give us the number of days that location had lightning.
# Identify locations that appear most in the dataset
df.center_point_geom.value_counts()
POINT(-81.5 22.5) 108
POINT(-84.1 22.4) 108
POINT(-82.5 22.9) 107
POINT(-82.7 22.9) 107
POINT(-82.5 22.8) 106
...
POINT(-119.3 35.1) 1
POINT(-119.3 35) 1
POINT(-119.6 35.6) 1
POINT(-119.4 35.6) 1
POINT(-58.5 45.3) 1
Name: center_point_geom, Length: 170855, dtype: int64
We find that the locations with the most days with lightning strikes had at least one strike on 108 days—nearly one out of every three days of the year. They are all rather close to each other geographically, which makes sense. Notice also that the value_counts()
function automatically sorts the results in descending order.
Let’s examine whether there is an even distribution of values, or whether 106+ strikes are unusually high value for days with lightning strikes. We’ll use the value_counts()
function again, but this time we’ll output the top 20 results. We’ll also rename the columns and apply a color gradient.
# Identify top 20 locations with most days of lightning
df.center_point_geom.value_counts()[:20].rename_axis('unique_values').reset_index(name='counts').style.background_gradient()
unique_values | counts | |
---|---|---|
0 | POINT(-81.5 22.5) | 108 |
1 | POINT(-84.1 22.4) | 108 |
2 | POINT(-82.5 22.9) | 107 |
3 | POINT(-82.7 22.9) | 107 |
4 | POINT(-82.5 22.8) | 106 |
5 | POINT(-84.2 22.3) | 106 |
6 | POINT(-76 20.5) | 105 |
7 | POINT(-75.9 20.4) | 105 |
8 | POINT(-82.2 22.9) | 104 |
9 | POINT(-78 18.2) | 104 |
10 | POINT(-83.9 22.5) | 103 |
11 | POINT(-84 22.4) | 102 |
12 | POINT(-82 22.8) | 102 |
13 | POINT(-82 22.4) | 102 |
14 | POINT(-82.3 22.9) | 102 |
15 | POINT(-78 18.3) | 102 |
16 | POINT(-84.1 22.5) | 101 |
17 | POINT(-75.5 20.6) | 101 |
18 | POINT(-84.2 22.4) | 101 |
19 | POINT(-76 20.4) | 101 |
Lightning strikes by day of week #
One useful grouping is categorizing lightning strikes by day of the week, which will tell us whether any particular day of the week had fewer or more lightning strikes than others. To calculate this, we’ll take advantage of the fact that the data in our date
column is of the datetime
class. Because these entries are datetime objects, we can extract date-related information from them and create new columns.
First, we’ll create a column called week
using dt.isocalendar()
on the date
column. This function is designed to be used on a pandas series, and it will return a new dataframe with year, week, and day columns. The information is formatted numerically, so, for example, 3 January 1950 would be respresented as:
Year | Week | Day |
---|---|---|
1950 | 1 | 3 |
Because we only want to extract the week number, we’ll add .week
to the end. You can learn more about dt.isocalendar()
in the dt.isocalendar() pandas documentation.
We’ll also add a weekday
column using dt.day_name()
. This is another pandas function designed to be used on a pandas series. It extracts the text name of the day for any given datetime date. You can learn more about this function in the dt.day_name() pandas documentation.
# Create two new columns
df['week'] = df.date.dt.isocalendar().week
df['weekday'] = df.date.dt.day_name()
df.head()
date | number_of_strikes | center_point_geom | week | weekday | |
---|---|---|---|---|---|
0 | 2018-01-03 | 194 | POINT(-75 27) | 1 | Wednesday |
1 | 2018-01-03 | 41 | POINT(-78.4 29) | 1 | Wednesday |
2 | 2018-01-03 | 33 | POINT(-73.9 27) | 1 | Wednesday |
3 | 2018-01-03 | 38 | POINT(-73.8 27) | 1 | Wednesday |
4 | 2018-01-03 | 92 | POINT(-79 28) | 1 | Wednesday |
Now, we can calculate the mean number of lightning strikes for each weekday of the year. We’ll use the groupby()
function to do this.
# Calculate mean count of lightning strikes for each weekday
df[['weekday','number_of_strikes']].groupby(['weekday']).mean()
number_of_strikes | |
---|---|
weekday | |
Friday | 13.349972 |
Monday | 13.152804 |
Saturday | 12.732694 |
Sunday | 12.324717 |
Thursday | 13.240594 |
Tuesday | 13.813599 |
Wednesday | 13.224568 |
Interesting! It seems that Saturday and Sunday has fewer average lightning strikes than the other five weekdays. Let’s plot the distributions of the strike counts for each day of the week. We want each distribution to be represented as a boxplot.
Let’s begin by defining the order of the days. We’ll begin with Monday and end with Sunday. This is how the days will be ordered in the plot we create.
# Define order of days for the plot
weekday_order = ['Monday','Tuesday', 'Wednesday', 'Thursday','Friday','Saturday','Sunday']
Now, we’ll code the plot. Remember that showfliers
is the parameter that controls whether or not outliers are displayed in the plot. If you input True
, outliers are included; if you input False
, outliers are left off of the box plot. Keep in mind, we aren’t deleting any outliers from the dataset when we create this chart, we are only excluding them from the visualization.
# Create boxplots of strike counts for each day of week
g = sns.boxplot(data=df,
x='weekday',
y='number_of_strikes',
order=weekday_order,
showfliers=False
);
g.set_title('Lightning distribution per weekday (2018)');
Notice that the median remains the same on all of the days of the week. As for Saturday and Sunday, however, the distributions are both lower than the rest of the week. We also know that the mean numbers of strikes that occurred on Saturday and Sunday were lower than the other weekdays. Why might this be? Perhaps the aerosol particles emitted by factories and vehicles increase the likelihood of lightning strike. In the U.S., Saturday and Sunday are days that many people don’t work, so there may be fewer factories operating and fewer cars on the road. This is only speculation, but it’s one possible path for further exploration.
Monthly lightning strikes 2016–2018 #
Finally, we’ll examine monthly lightning strike data from 2016–2018. We’ll calculate the percentage of total lightning strikes for each year that occurred in a given month. We will then plot this data on a bar graph.
# Import 2016–2017 data
df_2 = pd.read_csv('eda_structuring_with_python_dataset2.csv')
df_2.head()
date | number_of_strikes | center_point_geom | |
---|---|---|---|
0 | 2016-01-04 | 55 | POINT(-83.2 21.1) |
1 | 2016-01-04 | 33 | POINT(-83.1 21.1) |
2 | 2016-01-05 | 46 | POINT(-77.5 22.1) |
3 | 2016-01-05 | 28 | POINT(-76.8 22.3) |
4 | 2016-01-05 | 28 | POINT(-77 22.1) |
The data is in the same format as the 2018 data when we first imported it above. Let’s convert the date
column to datetime.
# Convert `date` column to datetime
df_2['date'] = pd.to_datetime(df_2['date'])
Now we can combine the 2016–2017 dataframe with the 2018 dataframe. There are several functions that can do this. We’ll use concat()
. Remember that the 2018 data has two added columns, week
and weekday
. To simplify the results of our combined dataframe, we’ll drop these added columns during the concatenation. Note that the following code doesn’t permanently modify df
. The columns drop only for this operation. You can learn more about the concat()
function in the concat() pandas documentation.
# Create new dataframe combining 2016–2017 data with 2018 data
union_df = pd.concat([df.drop(['weekday','week'],axis=1), df_2], ignore_index=True)
union_df.head()
date | number_of_strikes | center_point_geom | |
---|---|---|---|
0 | 2018-01-03 | 194 | POINT(-75 27) |
1 | 2018-01-03 | 41 | POINT(-78.4 29) |
2 | 2018-01-03 | 33 | POINT(-73.9 27) |
3 | 2018-01-03 | 38 | POINT(-73.8 27) |
4 | 2018-01-03 | 92 | POINT(-79 28) |
To help us with naming the bars of the bar plot, we’ll create three new columns that isolate the year, month number, and month name.
# add 3 new columns
union_df['year'] = union_df.date.dt.year
union_df['month'] = union_df.date.dt.month
union_df['month_txt'] = union_df.date.dt.month_name()
union_df.head()
date | number_of_strikes | center_point_geom | year | month | month_txt | |
---|---|---|---|---|---|---|
0 | 2018-01-03 | 194 | POINT(-75 27) | 2018 | 1 | January |
1 | 2018-01-03 | 41 | POINT(-78.4 29) | 2018 | 1 | January |
2 | 2018-01-03 | 33 | POINT(-73.9 27) | 2018 | 1 | January |
3 | 2018-01-03 | 38 | POINT(-73.8 27) | 2018 | 1 | January |
4 | 2018-01-03 | 92 | POINT(-79 28) | 2018 | 1 | January |
Let’s check the overall lightning strike count for each year.
# Calculate total number of strikes per year
union_df[['year','number_of_strikes']].groupby(['year']).sum()
number_of_strikes | |
---|---|
year | |
2016 | 41582229 |
2017 | 35095195 |
2018 | 44600989 |
Now we’ll calculate the percentage of total lightning strikes for each year that occurred in a given month and assign the results to a new dataframe called lightning_by_month
.
# Calculate total lightning strikes for each month of each year
lightning_by_month = union_df.groupby(['month_txt','year']).agg(
number_of_strikes = pd.NamedAgg(column='number_of_strikes',aggfunc=sum)
).reset_index()
lightning_by_month.head()
month_txt | year | number_of_strikes | |
---|---|---|---|
0 | April | 2016 | 2636427 |
1 | April | 2017 | 3819075 |
2 | April | 2018 | 1524339 |
3 | August | 2016 | 7250442 |
4 | August | 2017 | 6021702 |
By the way, we can use the agg()
function to calculate the same yearly totals we found before, with 2017 having fewer strikes than the other two years.
# Calculate total lightning strikes for each year
lightning_by_year = union_df.groupby(['year']).agg(
year_strikes = pd.NamedAgg(column='number_of_strikes',aggfunc=sum)
).reset_index()
lightning_by_year.head()
year | year_strikes | |
---|---|---|
0 | 2016 | 41582229 |
1 | 2017 | 35095195 |
2 | 2018 | 44600989 |
Back to our bar plot… We need to use the monthly totals to calculate percentages. For each month, we’ll need the monthly total strike count and the total strike count for that year. Let’s create another dataframe called percentage_lightning
that adds a new column called year_strikes
, which represents the total number of strikes for each year. We can do this using the merge()
function. We’ll merge the lightning_by_month
dataframe with the lightning_by_year
dataframe, specifying to merge on the year
column. This means that wherever the year
columns contain the same value in both dataframes, a row is created in our new dataframe with all the other columns from both dataframes being merged. To learn more about this function, refer to the merge() pandas documentation.
# Combine `lightning_by_month` and `lightning_by_year` dataframes into single dataframe
percentage_lightning = lightning_by_month.merge(lightning_by_year,on='year')
percentage_lightning.head()
month_txt | year | number_of_strikes | year_strikes | |
---|---|---|---|---|
0 | April | 2016 | 2636427 | 41582229 |
1 | August | 2016 | 7250442 | 41582229 |
2 | December | 2016 | 316450 | 41582229 |
3 | February | 2016 | 312676 | 41582229 |
4 | January | 2016 | 313595 | 41582229 |
Now we’ll create a new column in our new dataframe that represents the percentage of total lightning strikes for each year that occurred during each month. We do this by dividing the number_of_strikes
column by the year_strikes
column and multiplying the result by 100.
# Create new `percentage_lightning_per_month` column
percentage_lightning['percentage_lightning_per_month'] = (percentage_lightning.number_of_strikes/
percentage_lightning.year_strikes * 100.0)
percentage_lightning.head()
month_txt | year | number_of_strikes | year_strikes | percentage_lightning_per_month | |
---|---|---|---|---|---|
0 | April | 2016 | 2636427 | 41582229 | 6.340273 |
1 | August | 2016 | 7250442 | 41582229 | 17.436396 |
2 | December | 2016 | 316450 | 41582229 | 0.761022 |
3 | February | 2016 | 312676 | 41582229 | 0.751946 |
4 | January | 2016 | 313595 | 41582229 | 0.754156 |
Now we can plot the percentages by month in a bar graph!
plt.figure(figsize=(10,6));
month_order = ['January', 'February', 'March', 'April', 'May', 'June',
'July', 'August', 'September', 'October', 'November', 'December']
sns.barplot(
data = percentage_lightning,
x = 'month_txt',
y = 'percentage_lightning_per_month',
hue = 'year',
order = month_order );
plt.xlabel("Month");
plt.ylabel("% of lightning strikes");
plt.title("% of lightning strikes each Month (2016-2018)");
For all three years, there is a clear pattern over the course of each year. One month stands out. More than 1/3 of the lightning in 2018 happened in August.
If you have successfully completed the material above, congratulations! You now understand how to structure data in Python and should be able to start using it on your own datasets.