Analyzing Utah Avalanche Data¶
Participant ID: Reference
Date / Time:
Introduction¶
Welcome to our data analysis study. For this part of the study, you'll be working with a dataset sourced from the Utah Avalanche Center. The data provides insights into avalanche occurrences in Utah.
- You will use pandas to complete data cleanup and manipulation tasks.
- Carefully follow the step-by-step instructions provided for each task.
- Pandas is set up and ready for use, along with other Python libraries such as Matplotlib, Seaborn, and Altair for data visualization.
- You are allowed to use internet resources like documentation and forums, including Stack Overflow, to assist you in completing the tasks.
- In some cases, you will be asked to document your findings. Please do this in writing in a markdown cell.
- As you work through the tasks, take note of any interesting findings or challenges with the software or pandas that you may encounter, either by speaking your thoughts out loud or taking notes in a markdown cell.
- Feel free to add new code and markdown cells in the notebook as necessary to complete the tasks.
import helpers as h
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import altair as alt
Data Description¶
The table below describes the different columns in the dataset. Each row in the dataset represents a reported avalanche with details on location, trigger, and aspect. The data spans multiple years, starting from 2004 up to 2023.
Column | Description |
---|---|
Region | Region in Utah where the avalanche occurred |
Month | Month in which the avalanche was recorded |
Day | Day on which the avalanche was recorded |
Year | Year in which the avalanche was recorded |
Trigger | Cause of the avalanche |
Weak Layer | Layer of snow that was weakest and likely to fail |
Depth_inches | Depth of the avalanche in inches |
Vertical_inches | Vertical distance covered by the avalanche in inches |
Aspect | Direction of the slope where the avalanche occurred |
Elevation_feet | Elevation of the location in feet |
Coordinates | Approximate geographical coordinates of the avalanche location |
Comments 1 | Additional comments provided by the reporter |
df = pd.read_csv('./avalanches_data.csv')
df.head()
;Region | Month | Day | Year | ;Trigger | ;Weak Layer | Depth_inches | Vertical_inches | ;Aspect | Elevation_feet | Coordinates | Comments 1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Salt Lake | 11 | 9 | 2012 | Snowboarder | New Snow/Old Snow Interface | 14.0 | 360.0 | North | 10400.0 | 40.577977000000, -111.595817000000 | While it was a small avalanche that was I caug... |
1 | Salt Lake | 11 | 11 | 2012 | Skier | New Snow/Old Snow Interface | 30.0 | 1200.0 | North | 9700.0 | 40.592619000000, -111.616099000000 | A North facing aspect with an exposed ridge in... |
2 | Salt Lake | 11 | 11 | 2012 | Skier | Facets | 36.0 | 5400.0 | North | 10200.0 | 40.599291000000, -111.642315000000 | Remotely triggered all the new storm snow (abo... |
3 | Salt Lake | 11 | 11 | 2012 | Skier | New Snow | 18.0" | 6000.0 | Southeast | 10200.0 | 40.598313000000, -111.628304000000 | Impressive fast powder cloud ran in front of t... |
4 | Salt Lake | 11 | 11 | 2012 | Skier | Facets | 42.0 | 9600.0 | North | 10400.0 | 40.578590000000, -111.595087000000 | Three of us toured from Brighton to low saddle... |
Task 1: Column Names and Data Types¶
In the first task we will perform some basic data cleaning operations to get our dataset ready for further tasks.
Task 1a: Remove Columns¶
Remove the follwoing columns to streamline the dataset for further analysis:
- Comments 1: Contains textual comments not crucial for quantitative analysis.
- Coordinates: Detailed location data not needed for the current scope of analysis.
Instructions¶
- Column Removal:
- Remove the specified columns using Pandas commands.
- Generate dataframe:
- Assign the modified dataframe to variable
df_task_1a
- Assign the modified dataframe to variable
- Show Output:
- Print the head of
df_task_1a
to show the changes.
- Print the head of
df_task_1a = df.drop(columns=["Comments 1", "Coordinates"])
df_task_1a.head()
;Region | Month | Day | Year | ;Trigger | ;Weak Layer | Depth_inches | Vertical_inches | ;Aspect | Elevation_feet | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Salt Lake | 11 | 9 | 2012 | Snowboarder | New Snow/Old Snow Interface | 14.0 | 360.0 | North | 10400.0 |
1 | Salt Lake | 11 | 11 | 2012 | Skier | New Snow/Old Snow Interface | 30.0 | 1200.0 | North | 9700.0 |
2 | Salt Lake | 11 | 11 | 2012 | Skier | Facets | 36.0 | 5400.0 | North | 10200.0 |
3 | Salt Lake | 11 | 11 | 2012 | Skier | New Snow | 18.0" | 6000.0 | Southeast | 10200.0 |
4 | Salt Lake | 11 | 11 | 2012 | Skier | Facets | 42.0 | 9600.0 | North | 10400.0 |
Task 1b: Fix Column Names¶
It looks like something went wrong when reading the file and some column headers start with a ;
. Please remove the semicolon from all headers.
Instructions¶
- Rename Columns:
- Employ Pandas commands to rename the columns, eliminating the leading ";" as specified:
- ;Aspect → Aspect
- ;Region → Region
- ;Trigger → Trigger
- ;Weak Layer → Weak Layer
- Employ Pandas commands to rename the columns, eliminating the leading ";" as specified:
- Generate dataframe:
- Assign the updated dataframe to variable
df_task_1b
.
- Assign the updated dataframe to variable
- Verify the Output:
- Print the head of
df_task_1b
to confirm the updated column names.
- Print the head of
df_task_1b = df_task_1a.rename(columns={
";Aspect": "Aspect",
";Region": "Region",
";Trigger": "Trigger",
";Weak Layer": "Weak Layer"
})
df_task_1b.head()
Region | Month | Day | Year | Trigger | Weak Layer | Depth_inches | Vertical_inches | Aspect | Elevation_feet | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Salt Lake | 11 | 9 | 2012 | Snowboarder | New Snow/Old Snow Interface | 14.0 | 360.0 | North | 10400.0 |
1 | Salt Lake | 11 | 11 | 2012 | Skier | New Snow/Old Snow Interface | 30.0 | 1200.0 | North | 9700.0 |
2 | Salt Lake | 11 | 11 | 2012 | Skier | Facets | 36.0 | 5400.0 | North | 10200.0 |
3 | Salt Lake | 11 | 11 | 2012 | Skier | New Snow | 18.0" | 6000.0 | Southeast | 10200.0 |
4 | Salt Lake | 11 | 11 | 2012 | Skier | Facets | 42.0 | 9600.0 | North | 10400.0 |
Task 1c: Correcting Data Type of 'Depth_inches'¶
There is a data type issue in the Depth_inches
column of our dataframe. This column is incorrectly formatted as an object (string) due to the presence of the inches symbol "
.
Remove any inches symbols "
from the Depth_inches
column and convert it to a float data type.
df_task_1b.dtypes
Region object Month int64 Day int64 Year int64 Trigger object Weak Layer object Depth_inches object Vertical_inches float64 Aspect object Elevation_feet float64 dtype: object
Instructions¶
- Remove Inches Symbol and Correct Format:
- Use Pandas to replace the inches symbol in the
Depth_inches
column.
- Use Pandas to replace the inches symbol in the
- Convert Data Type:
- Convert the
Depth_inches
column to float.
- Convert the
- Generate Dataframe:
- Save the updated dataframe as
df_task_1c
.
- Save the updated dataframe as
- Show Output:
- Print the dtypes of
df_task_1c
to confirm the changes.
- Print the dtypes of
df_task_1c = df_task_1b.copy()
df_task_1c["Depth_inches"] = df_task_1c["Depth_inches"].str.replace('"', "", regex=True).astype(float)
df_task_1c.head()
Region | Month | Day | Year | Trigger | Weak Layer | Depth_inches | Vertical_inches | Aspect | Elevation_feet | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Salt Lake | 11 | 9 | 2012 | Snowboarder | New Snow/Old Snow Interface | 14.0 | 360.0 | North | 10400.0 |
1 | Salt Lake | 11 | 11 | 2012 | Skier | New Snow/Old Snow Interface | 30.0 | 1200.0 | North | 9700.0 |
2 | Salt Lake | 11 | 11 | 2012 | Skier | Facets | 36.0 | 5400.0 | North | 10200.0 |
3 | Salt Lake | 11 | 11 | 2012 | Skier | New Snow | 18.0 | 6000.0 | Southeast | 10200.0 |
4 | Salt Lake | 11 | 11 | 2012 | Skier | Facets | 42.0 | 9600.0 | North | 10400.0 |
Task 2: Filtering data¶
In Task 2, we further improve our data by removing outliers and removing certain records to have more consistent data.
Task 2a: Remove Outliers¶
In this task, we address data accuracy by filtering out anomalies in the elevation data. We observe some records with elevations outside the plausible range for Utah, suggesting recording errors.
Remove avalanche records with elevations below ~4,000 feet and above ~15,000 feet, which are outside the realistic range for Utah.
Instructions¶
- Identify and Remove Anomalies:
- Refer to the seaborn scatterplot for
Elevation_feet
vsVertical_inches
- Use Pandas commands to filter out these anomalous records where Elevation_feet is either below ~4,000 feet or above ~15,000 feet, from the dataframe.
- Refer to the seaborn scatterplot for
- Generate Dataframe:
- Save the cleaned dataframe as
df_task_2a
.
- Save the cleaned dataframe as
- Plot Output:
- Recreate the scatterplot from step 1 in a new cell using
df_task_2a
. - Print the head of
df_task_2a
.
- Recreate the scatterplot from step 1 in a new cell using
### Scatterplot Code Start
plt.figure(figsize=(7, 5))
sns.scatterplot(data=df_task_1c, x='Elevation_feet', y='Vertical_inches')
plt.xlabel('Elevation (feet)')
plt.ylabel('Vertical Distance (inches)')
# Display the plot
plt.show()
df_task_1c.head()
Region | Month | Day | Year | Trigger | Weak Layer | Depth_inches | Vertical_inches | Aspect | Elevation_feet | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Salt Lake | 11 | 9 | 2012 | Snowboarder | New Snow/Old Snow Interface | 14.0 | 360.0 | North | 10400.0 |
1 | Salt Lake | 11 | 11 | 2012 | Skier | New Snow/Old Snow Interface | 30.0 | 1200.0 | North | 9700.0 |
2 | Salt Lake | 11 | 11 | 2012 | Skier | Facets | 36.0 | 5400.0 | North | 10200.0 |
3 | Salt Lake | 11 | 11 | 2012 | Skier | New Snow | 18.0 | 6000.0 | Southeast | 10200.0 |
4 | Salt Lake | 11 | 11 | 2012 | Skier | Facets | 42.0 | 9600.0 | North | 10400.0 |
df_task_2a = df_task_1c[(df_task_1c['Elevation_feet'] > 4000) & (df_task_1c['Elevation_feet'] < 15000)]
### Scatterplot Code Start
plt.figure(figsize=(7, 5))
sns.scatterplot(data=df_task_2a, x='Elevation_feet', y='Vertical_inches')
plt.xlabel('Elevation (feet)')
plt.ylabel('Vertical Distance (inches)')
# Display the plot
plt.show()
df_task_2a.head()
Region | Month | Day | Year | Trigger | Weak Layer | Depth_inches | Vertical_inches | Aspect | Elevation_feet | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Salt Lake | 11 | 9 | 2012 | Snowboarder | New Snow/Old Snow Interface | 14.0 | 360.0 | North | 10400.0 |
1 | Salt Lake | 11 | 11 | 2012 | Skier | New Snow/Old Snow Interface | 30.0 | 1200.0 | North | 9700.0 |
2 | Salt Lake | 11 | 11 | 2012 | Skier | Facets | 36.0 | 5400.0 | North | 10200.0 |
3 | Salt Lake | 11 | 11 | 2012 | Skier | New Snow | 18.0 | 6000.0 | Southeast | 10200.0 |
4 | Salt Lake | 11 | 11 | 2012 | Skier | Facets | 42.0 | 9600.0 | North | 10400.0 |
Task 2b: Filtering Out Old Data¶
The interactive barchart below, shows the data aggregated by year. There are noticeably fewer records for the years before 2010.
During this subtask we will remove the older records, keeping only the records for years 2010 and above.
Instructions¶
- Identify Sparse Years:
- Use the Seaborn plot with bar chart visualizing the number of avalanches per year.
- Based on the bar chart, identify years before 2010 with fewer avalanche records.
- Filter Out Sparse Years:
- Write Pandas code to exclude these years from the dataset.
- Show Output:
- Print the head of
df_task_2b
and recreate the bar chart to show the dataset focusing on years 2010 and onwards.
- Print the head of
## Date time
plt.figure(figsize=(10, 5))
sns.countplot(x=df_task_2a["Year"])
plt.xlabel('Year')
plt.ylabel('# of records')
# Display the plot
plt.show()
df_task_2a.head()
Region | Month | Day | Year | Trigger | Weak Layer | Depth_inches | Vertical_inches | Aspect | Elevation_feet | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Salt Lake | 11 | 9 | 2012 | Snowboarder | New Snow/Old Snow Interface | 14.0 | 360.0 | North | 10400.0 |
1 | Salt Lake | 11 | 11 | 2012 | Skier | New Snow/Old Snow Interface | 30.0 | 1200.0 | North | 9700.0 |
2 | Salt Lake | 11 | 11 | 2012 | Skier | Facets | 36.0 | 5400.0 | North | 10200.0 |
3 | Salt Lake | 11 | 11 | 2012 | Skier | New Snow | 18.0 | 6000.0 | Southeast | 10200.0 |
4 | Salt Lake | 11 | 11 | 2012 | Skier | Facets | 42.0 | 9600.0 | North | 10400.0 |
df_task_2b = df_task_2a[df_task_2a['Year'] >= 2010]
## Date time
plt.figure(figsize=(10, 5))
sns.countplot(x=df_task_2b["Year"])
plt.xlabel('Year')
plt.ylabel('# of records')
# Display the plot
plt.show()
df_task_2b.head()
Region | Month | Day | Year | Trigger | Weak Layer | Depth_inches | Vertical_inches | Aspect | Elevation_feet | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Salt Lake | 11 | 9 | 2012 | Snowboarder | New Snow/Old Snow Interface | 14.0 | 360.0 | North | 10400.0 |
1 | Salt Lake | 11 | 11 | 2012 | Skier | New Snow/Old Snow Interface | 30.0 | 1200.0 | North | 9700.0 |
2 | Salt Lake | 11 | 11 | 2012 | Skier | Facets | 36.0 | 5400.0 | North | 10200.0 |
3 | Salt Lake | 11 | 11 | 2012 | Skier | New Snow | 18.0 | 6000.0 | Southeast | 10200.0 |
4 | Salt Lake | 11 | 11 | 2012 | Skier | Facets | 42.0 | 9600.0 | North | 10400.0 |
Task 3: Data Wrangling¶
Task 3a: Creating and assigning 'Avalanche Season'¶
Next, we'll introduce a new categorical variable named Avalanche Season
into our dataset. This addition aims to classify each avalanche record into different parts of the avalanche season (Start, Middle, End) based on the month it occurred in.
Create a new category Avalanche Season
in the dataset and assign each record to Start
, Middle
, or End
of the avalanche season based on its month.
Instructions¶
- Create New Variable:
- Add a new column
Avalanche Season
to the DataFrame.
- Add a new column
- Assign Category:
- Using the
month
from theDate
column assign proper values to the new category. - You should use the following ranges for assigning proper categories:
Start
of Season for October, November, DecemberMiddle
of Season for January, February, MarchEnd
of Season for April, May, June
- Using the
- Generate Dataframe:
- Save the modified DataFrame with the new
Avalanche Season
category todf_task_3a
.
- Save the modified DataFrame with the new
- Show Output:
- Display the head of
df_task_3a
.
- Display the head of
## Date time
plt.figure(figsize=(10, 5))
sns.countplot(x=df_task_2b["Month"])
plt.xlabel('Month')
plt.ylabel('# of records')
# Display the plot
plt.show()
df_task_2b.head()
Region | Month | Day | Year | Trigger | Weak Layer | Depth_inches | Vertical_inches | Aspect | Elevation_feet | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Salt Lake | 11 | 9 | 2012 | Snowboarder | New Snow/Old Snow Interface | 14.0 | 360.0 | North | 10400.0 |
1 | Salt Lake | 11 | 11 | 2012 | Skier | New Snow/Old Snow Interface | 30.0 | 1200.0 | North | 9700.0 |
2 | Salt Lake | 11 | 11 | 2012 | Skier | Facets | 36.0 | 5400.0 | North | 10200.0 |
3 | Salt Lake | 11 | 11 | 2012 | Skier | New Snow | 18.0 | 6000.0 | Southeast | 10200.0 |
4 | Salt Lake | 11 | 11 | 2012 | Skier | Facets | 42.0 | 9600.0 | North | 10400.0 |
df_task_3a = df_task_2b.copy()
df_task_3a["Avalanche Season"] = "End"
df_task_3a.loc[df_task_3a["Month"] <= 3, "Avalanche Season"] = "Middle"
df_task_3a.loc[df_task_3a["Month"] >= 10, "Avalanche Season"] = "Start"
# Optional for particpants
## Date time
plt.figure(figsize=(10, 5))
sns.countplot(x=df_task_3a["Month"], hue=df_task_3a["Avalanche Season"])
plt.xlabel('Month')
plt.ylabel('# of records')
# Display the plot
plt.show()
df_task_3a.head()
Region | Month | Day | Year | Trigger | Weak Layer | Depth_inches | Vertical_inches | Aspect | Elevation_feet | Avalanche Season | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Salt Lake | 11 | 9 | 2012 | Snowboarder | New Snow/Old Snow Interface | 14.0 | 360.0 | North | 10400.0 | Start |
1 | Salt Lake | 11 | 11 | 2012 | Skier | New Snow/Old Snow Interface | 30.0 | 1200.0 | North | 9700.0 | Start |
2 | Salt Lake | 11 | 11 | 2012 | Skier | Facets | 36.0 | 5400.0 | North | 10200.0 | Start |
3 | Salt Lake | 11 | 11 | 2012 | Skier | New Snow | 18.0 | 6000.0 | Southeast | 10200.0 | Start |
4 | Salt Lake | 11 | 11 | 2012 | Skier | Facets | 42.0 | 9600.0 | North | 10400.0 | Start |
Task 3b: Analyzing Top Avalanche Trigger by Season¶
Now we'll analyze which trigger is most prevalent for avalanches in each season phase (Start, Middle, End) using the Avalanche Season
category created in Task 3a.
Instructions¶
- Context:
- We have a facted bar chart. The
x
axis encodes theTrigger
column in the data and the columns encode the newly added categoryAvalanche Season
.
- We have a facted bar chart. The
- Analyze Trigger Data:
- Observe the most common trigger for each season.
- You can hover on the bars to get the exact frequency.
- Document Findings:
- Note down the most common trigger for each season based on your interactive analysis in the markdown cell.
NEW_COLUMN = "Avalanche Season"
selection = alt.selection_point(name="selector", fields=[f"{NEW_COLUMN}"], bind="legend")
chart = alt.Chart(df_task_3a).mark_bar().encode(
x="Trigger:N",
y="count():Q",
color=alt.Color(f"{NEW_COLUMN}:N"),
column=alt.Column(f"{NEW_COLUMN}:N").sort(["Start", "Middle", "End"]),
opacity=alt.condition(selection,alt.value(1), alt.value(0.3)),
tooltip="count()"
).add_params(selection)
chart
Task 3b Notes:
- Most common Trigger for
Start
of the season: Natural (210) - Most common Trigger for
Middle
of the season: Skier (572) - Most common Trigger for
End
of the season: Natural (88)