Analyzing Utah Avalanche Data¶

Participant ID: Reference

Date / Time:

Introduction¶

Welcome to our data analysis study. For this part of the study, you'll be working with a dataset sourced from the Utah Avalanche Center. The data provides insights into avalanche occurrences in Utah.

  • You will use pandas to complete data cleanup and manipulation tasks.
  • Carefully follow the step-by-step instructions provided for each task.
  • Pandas is set up and ready for use, along with other Python libraries such as Matplotlib, Seaborn, and Altair for data visualization.
  • You are allowed to use internet resources like documentation and forums, including Stack Overflow, to assist you in completing the tasks.
  • In some cases, you will be asked to document your findings. Please do this in writing in a markdown cell.
  • As you work through the tasks, take note of any interesting findings or challenges with the software or pandas that you may encounter, either by speaking your thoughts out loud or taking notes in a markdown cell.
  • Feel free to add new code and markdown cells in the notebook as necessary to complete the tasks.
In [1]:
import helpers as h
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
import altair as alt

Data Description¶

The table below describes the different columns in the dataset. Each row in the dataset represents a reported avalanche with details on location, trigger, and aspect. The data spans multiple years, starting from 2004 up to 2023.

Column Description
Region Region in Utah where the avalanche occurred
Month Month in which the avalanche was recorded
Day Day on which the avalanche was recorded
Year Year in which the avalanche was recorded
Trigger Cause of the avalanche
Weak Layer Layer of snow that was weakest and likely to fail
Depth_inches Depth of the avalanche in inches
Vertical_inches Vertical distance covered by the avalanche in inches
Aspect Direction of the slope where the avalanche occurred
Elevation_feet Elevation of the location in feet
Coordinates Approximate geographical coordinates of the avalanche location
Comments 1 Additional comments provided by the reporter
In [2]:
df = pd.read_csv('./avalanches_data.csv')
df.head()
Out[2]:
;Region Month Day Year ;Trigger ;Weak Layer Depth_inches Vertical_inches ;Aspect Elevation_feet Coordinates Comments 1
0 Salt Lake 11 9 2012 Snowboarder New Snow/Old Snow Interface 14.0 360.0 North 10400.0 40.577977000000, -111.595817000000 While it was a small avalanche that was I caug...
1 Salt Lake 11 11 2012 Skier New Snow/Old Snow Interface 30.0 1200.0 North 9700.0 40.592619000000, -111.616099000000 A North facing aspect with an exposed ridge in...
2 Salt Lake 11 11 2012 Skier Facets 36.0 5400.0 North 10200.0 40.599291000000, -111.642315000000 Remotely triggered all the new storm snow (abo...
3 Salt Lake 11 11 2012 Skier New Snow 18.0" 6000.0 Southeast 10200.0 40.598313000000, -111.628304000000 Impressive fast powder cloud ran in front of t...
4 Salt Lake 11 11 2012 Skier Facets 42.0 9600.0 North 10400.0 40.578590000000, -111.595087000000 Three of us toured from Brighton to low saddle...

Task 1: Column Names and Data Types¶

In the first task we will perform some basic data cleaning operations to get our dataset ready for further tasks.

Task 1a: Remove Columns¶

Remove the follwoing columns to streamline the dataset for further analysis:

  • Comments 1: Contains textual comments not crucial for quantitative analysis.
  • Coordinates: Detailed location data not needed for the current scope of analysis.

Instructions¶

  1. Column Removal:
    • Remove the specified columns using Pandas commands.
  2. Generate dataframe:
    • Assign the modified dataframe to variable df_task_1a
  3. Show Output:
    • Print the head of df_task_1a to show the changes.
In [3]:
df_task_1a = df.drop(columns=["Comments 1", "Coordinates"])
df_task_1a.head()
Out[3]:
;Region Month Day Year ;Trigger ;Weak Layer Depth_inches Vertical_inches ;Aspect Elevation_feet
0 Salt Lake 11 9 2012 Snowboarder New Snow/Old Snow Interface 14.0 360.0 North 10400.0
1 Salt Lake 11 11 2012 Skier New Snow/Old Snow Interface 30.0 1200.0 North 9700.0
2 Salt Lake 11 11 2012 Skier Facets 36.0 5400.0 North 10200.0
3 Salt Lake 11 11 2012 Skier New Snow 18.0" 6000.0 Southeast 10200.0
4 Salt Lake 11 11 2012 Skier Facets 42.0 9600.0 North 10400.0

Task 1b: Fix Column Names¶

It looks like something went wrong when reading the file and some column headers start with a ;. Please remove the semicolon from all headers.

Instructions¶

  1. Rename Columns:
    • Employ Pandas commands to rename the columns, eliminating the leading ";" as specified:
      • ;Aspect → Aspect
      • ;Region → Region
      • ;Trigger → Trigger
      • ;Weak Layer → Weak Layer
  2. Generate dataframe:
    • Assign the updated dataframe to variable df_task_1b.
  3. Verify the Output:
    • Print the head of df_task_1b to confirm the updated column names.
In [4]:
df_task_1b = df_task_1a.rename(columns={
    ";Aspect": "Aspect",
    ";Region": "Region",
    ";Trigger": "Trigger",
    ";Weak Layer": "Weak Layer"
})
df_task_1b.head()
Out[4]:
Region Month Day Year Trigger Weak Layer Depth_inches Vertical_inches Aspect Elevation_feet
0 Salt Lake 11 9 2012 Snowboarder New Snow/Old Snow Interface 14.0 360.0 North 10400.0
1 Salt Lake 11 11 2012 Skier New Snow/Old Snow Interface 30.0 1200.0 North 9700.0
2 Salt Lake 11 11 2012 Skier Facets 36.0 5400.0 North 10200.0
3 Salt Lake 11 11 2012 Skier New Snow 18.0" 6000.0 Southeast 10200.0
4 Salt Lake 11 11 2012 Skier Facets 42.0 9600.0 North 10400.0

Task 1c: Correcting Data Type of 'Depth_inches'¶

There is a data type issue in the Depth_inches column of our dataframe. This column is incorrectly formatted as an object (string) due to the presence of the inches symbol ".

Remove any inches symbols " from the Depth_inches column and convert it to a float data type.

In [5]:
df_task_1b.dtypes
Out[5]:
Region              object
Month                int64
Day                  int64
Year                 int64
Trigger             object
Weak Layer          object
Depth_inches        object
Vertical_inches    float64
Aspect              object
Elevation_feet     float64
dtype: object

Instructions¶

  1. Remove Inches Symbol and Correct Format:
    • Use Pandas to replace the inches symbol in the Depth_inches column.
  2. Convert Data Type:
    • Convert the Depth_inches column to float.
  3. Generate Dataframe:
    • Save the updated dataframe as df_task_1c.
  4. Show Output:
    • Print the dtypes of df_task_1c to confirm the changes.
In [6]:
df_task_1c = df_task_1b.copy()
df_task_1c["Depth_inches"] = df_task_1c["Depth_inches"].str.replace('"', "", regex=True).astype(float)

df_task_1c.head()
Out[6]:
Region Month Day Year Trigger Weak Layer Depth_inches Vertical_inches Aspect Elevation_feet
0 Salt Lake 11 9 2012 Snowboarder New Snow/Old Snow Interface 14.0 360.0 North 10400.0
1 Salt Lake 11 11 2012 Skier New Snow/Old Snow Interface 30.0 1200.0 North 9700.0
2 Salt Lake 11 11 2012 Skier Facets 36.0 5400.0 North 10200.0
3 Salt Lake 11 11 2012 Skier New Snow 18.0 6000.0 Southeast 10200.0
4 Salt Lake 11 11 2012 Skier Facets 42.0 9600.0 North 10400.0

Task 2: Filtering data¶

In Task 2, we further improve our data by removing outliers and removing certain records to have more consistent data.

Task 2a: Remove Outliers¶

In this task, we address data accuracy by filtering out anomalies in the elevation data. We observe some records with elevations outside the plausible range for Utah, suggesting recording errors.

Remove avalanche records with elevations below ~4,000 feet and above ~15,000 feet, which are outside the realistic range for Utah.

Instructions¶

  1. Identify and Remove Anomalies:
    • Refer to the seaborn scatterplot for Elevation_feet vs Vertical_inches
    • Use Pandas commands to filter out these anomalous records where Elevation_feet is either below ~4,000 feet or above ~15,000 feet, from the dataframe.
  2. Generate Dataframe:
    • Save the cleaned dataframe as df_task_2a.
  3. Plot Output:
    • Recreate the scatterplot from step 1 in a new cell using df_task_2a.
    • Print the head of df_task_2a.
In [7]:
### Scatterplot Code Start

plt.figure(figsize=(7, 5))
sns.scatterplot(data=df_task_1c, x='Elevation_feet', y='Vertical_inches')

plt.xlabel('Elevation (feet)')
plt.ylabel('Vertical Distance (inches)')

# Display the plot
plt.show()

df_task_1c.head()
No description has been provided for this image
Out[7]:
Region Month Day Year Trigger Weak Layer Depth_inches Vertical_inches Aspect Elevation_feet
0 Salt Lake 11 9 2012 Snowboarder New Snow/Old Snow Interface 14.0 360.0 North 10400.0
1 Salt Lake 11 11 2012 Skier New Snow/Old Snow Interface 30.0 1200.0 North 9700.0
2 Salt Lake 11 11 2012 Skier Facets 36.0 5400.0 North 10200.0
3 Salt Lake 11 11 2012 Skier New Snow 18.0 6000.0 Southeast 10200.0
4 Salt Lake 11 11 2012 Skier Facets 42.0 9600.0 North 10400.0
In [8]:
df_task_2a = df_task_1c[(df_task_1c['Elevation_feet'] > 4000) & (df_task_1c['Elevation_feet'] < 15000)]
In [9]:
### Scatterplot Code Start

plt.figure(figsize=(7, 5))
sns.scatterplot(data=df_task_2a, x='Elevation_feet', y='Vertical_inches')

plt.xlabel('Elevation (feet)')
plt.ylabel('Vertical Distance (inches)')

# Display the plot
plt.show()

df_task_2a.head()
No description has been provided for this image
Out[9]:
Region Month Day Year Trigger Weak Layer Depth_inches Vertical_inches Aspect Elevation_feet
0 Salt Lake 11 9 2012 Snowboarder New Snow/Old Snow Interface 14.0 360.0 North 10400.0
1 Salt Lake 11 11 2012 Skier New Snow/Old Snow Interface 30.0 1200.0 North 9700.0
2 Salt Lake 11 11 2012 Skier Facets 36.0 5400.0 North 10200.0
3 Salt Lake 11 11 2012 Skier New Snow 18.0 6000.0 Southeast 10200.0
4 Salt Lake 11 11 2012 Skier Facets 42.0 9600.0 North 10400.0

Task 2b: Filtering Out Old Data¶

The interactive barchart below, shows the data aggregated by year. There are noticeably fewer records for the years before 2010.

During this subtask we will remove the older records, keeping only the records for years 2010 and above.

Instructions¶

  1. Identify Sparse Years:
    • Use the Seaborn plot with bar chart visualizing the number of avalanches per year.
    • Based on the bar chart, identify years before 2010 with fewer avalanche records.
  2. Filter Out Sparse Years:
    • Write Pandas code to exclude these years from the dataset.
  3. Show Output:
    • Print the head of df_task_2b and recreate the bar chart to show the dataset focusing on years 2010 and onwards.
In [10]:
## Date time
plt.figure(figsize=(10, 5))
sns.countplot(x=df_task_2a["Year"])

plt.xlabel('Year')
plt.ylabel('# of records')

# Display the plot
plt.show()

df_task_2a.head()
No description has been provided for this image
Out[10]:
Region Month Day Year Trigger Weak Layer Depth_inches Vertical_inches Aspect Elevation_feet
0 Salt Lake 11 9 2012 Snowboarder New Snow/Old Snow Interface 14.0 360.0 North 10400.0
1 Salt Lake 11 11 2012 Skier New Snow/Old Snow Interface 30.0 1200.0 North 9700.0
2 Salt Lake 11 11 2012 Skier Facets 36.0 5400.0 North 10200.0
3 Salt Lake 11 11 2012 Skier New Snow 18.0 6000.0 Southeast 10200.0
4 Salt Lake 11 11 2012 Skier Facets 42.0 9600.0 North 10400.0
In [11]:
df_task_2b = df_task_2a[df_task_2a['Year'] >= 2010]
In [12]:
## Date time
plt.figure(figsize=(10, 5))
sns.countplot(x=df_task_2b["Year"])

plt.xlabel('Year')
plt.ylabel('# of records')

# Display the plot
plt.show()

df_task_2b.head()
No description has been provided for this image
Out[12]:
Region Month Day Year Trigger Weak Layer Depth_inches Vertical_inches Aspect Elevation_feet
0 Salt Lake 11 9 2012 Snowboarder New Snow/Old Snow Interface 14.0 360.0 North 10400.0
1 Salt Lake 11 11 2012 Skier New Snow/Old Snow Interface 30.0 1200.0 North 9700.0
2 Salt Lake 11 11 2012 Skier Facets 36.0 5400.0 North 10200.0
3 Salt Lake 11 11 2012 Skier New Snow 18.0 6000.0 Southeast 10200.0
4 Salt Lake 11 11 2012 Skier Facets 42.0 9600.0 North 10400.0

Task 3: Data Wrangling¶

Task 3a: Creating and assigning 'Avalanche Season'¶

Next, we'll introduce a new categorical variable named Avalanche Season into our dataset. This addition aims to classify each avalanche record into different parts of the avalanche season (Start, Middle, End) based on the month it occurred in.

Create a new category Avalanche Season in the dataset and assign each record to Start, Middle, or End of the avalanche season based on its month.

Instructions¶

  1. Create New Variable:
    • Add a new column Avalanche Season to the DataFrame.
  2. Assign Category:
    • Using the month from the Date column assign proper values to the new category.
    • You should use the following ranges for assigning proper categories:
      • Start of Season for October, November, December
      • Middle of Season for January, February, March
      • End of Season for April, May, June
  3. Generate Dataframe:
    • Save the modified DataFrame with the new Avalanche Season category to df_task_3a.
  4. Show Output:
    • Display the head of df_task_3a.
In [13]:
## Date time
plt.figure(figsize=(10, 5))
sns.countplot(x=df_task_2b["Month"])

plt.xlabel('Month')
plt.ylabel('# of records')

# Display the plot
plt.show()

df_task_2b.head()
No description has been provided for this image
Out[13]:
Region Month Day Year Trigger Weak Layer Depth_inches Vertical_inches Aspect Elevation_feet
0 Salt Lake 11 9 2012 Snowboarder New Snow/Old Snow Interface 14.0 360.0 North 10400.0
1 Salt Lake 11 11 2012 Skier New Snow/Old Snow Interface 30.0 1200.0 North 9700.0
2 Salt Lake 11 11 2012 Skier Facets 36.0 5400.0 North 10200.0
3 Salt Lake 11 11 2012 Skier New Snow 18.0 6000.0 Southeast 10200.0
4 Salt Lake 11 11 2012 Skier Facets 42.0 9600.0 North 10400.0
In [14]:
df_task_3a = df_task_2b.copy()
df_task_3a["Avalanche Season"] = "End"
df_task_3a.loc[df_task_3a["Month"] <= 3, "Avalanche Season"] = "Middle"
df_task_3a.loc[df_task_3a["Month"] >= 10, "Avalanche Season"] = "Start"
In [15]:
# Optional for particpants
## Date time
plt.figure(figsize=(10, 5))
sns.countplot(x=df_task_3a["Month"], hue=df_task_3a["Avalanche Season"])

plt.xlabel('Month')
plt.ylabel('# of records')

# Display the plot
plt.show()

df_task_3a.head()
No description has been provided for this image
Out[15]:
Region Month Day Year Trigger Weak Layer Depth_inches Vertical_inches Aspect Elevation_feet Avalanche Season
0 Salt Lake 11 9 2012 Snowboarder New Snow/Old Snow Interface 14.0 360.0 North 10400.0 Start
1 Salt Lake 11 11 2012 Skier New Snow/Old Snow Interface 30.0 1200.0 North 9700.0 Start
2 Salt Lake 11 11 2012 Skier Facets 36.0 5400.0 North 10200.0 Start
3 Salt Lake 11 11 2012 Skier New Snow 18.0 6000.0 Southeast 10200.0 Start
4 Salt Lake 11 11 2012 Skier Facets 42.0 9600.0 North 10400.0 Start

Task 3b: Analyzing Top Avalanche Trigger by Season¶

Now we'll analyze which trigger is most prevalent for avalanches in each season phase (Start, Middle, End) using the Avalanche Season category created in Task 3a.

Instructions¶

  1. Context:
    • We have a facted bar chart. The x axis encodes the Trigger column in the data and the columns encode the newly added category Avalanche Season.
  2. Analyze Trigger Data:
    • Observe the most common trigger for each season.
    • You can hover on the bars to get the exact frequency.
  3. Document Findings:
    • Note down the most common trigger for each season based on your interactive analysis in the markdown cell.
In [16]:
NEW_COLUMN = "Avalanche Season"

selection = alt.selection_point(name="selector", fields=[f"{NEW_COLUMN}"], bind="legend")

chart = alt.Chart(df_task_3a).mark_bar().encode(
    x="Trigger:N",
    y="count():Q",
    color=alt.Color(f"{NEW_COLUMN}:N"),
    column=alt.Column(f"{NEW_COLUMN}:N").sort(["Start", "Middle", "End"]),
    opacity=alt.condition(selection,alt.value(1), alt.value(0.3)),
    tooltip="count()"
).add_params(selection)

chart
Out[16]:

Task 3b Notes:

  • Most common Trigger for Start of the season: Natural (210)
  • Most common Trigger for Middle of the season: Skier (572)
  • Most common Trigger for End of the season: Natural (88)