Load packages and data output by the PolyglotDB formants tutorial script:
library(tidyverse)
library(broom)
v <- read_csv("librispeech_pitch_full_set.csv")
Save original dataset:
v_orig <- v
Quick look at columns of the dataframe:
glimpse(v)
## Rows: 30,210
## Columns: 15
## $ traj_id <chr> "fcb0eac8-fcdb-11ec-b292-fc4dd44cd2b1", "…
## $ vowel <chr> "AH1", "AH1", "AH1", "AH1", "AH0", "AH0",…
## $ consonant <chr> "B", "B", "B", "B", "DH", "DH", "DH", "DH…
## $ following_phone <chr> "T", "T", "T", "T", "R", "R", "R", "R", "…
## $ word <chr> "but", "but", "but", "but", "the", "the",…
## $ word_duration <dbl> 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17,…
## $ word_transcription <chr> "B.AH1.T", "B.AH1.T", "B.AH1.T", "B.AH1.T…
## $ following_word_transcription <chr> "T.AH0", "T.AH0", "T.AH0", "T.AH0", "R.IH…
## $ begin <dbl> 12.69, 12.69, 12.69, 12.69, 5.73, 5.73, 5…
## $ end <dbl> 12.73, 12.73, 12.73, 12.73, 5.84, 5.84, 5…
## $ discourse <chr> "1089-134691-0013", "1089-134691-0013", "…
## $ speaker <dbl> 1089, 1089, 1089, 1089, 1089, 1089, 1089,…
## $ sex <chr> "M", "M", "M", "M", "M", "M", "M", "M", "…
## $ time <dbl> 12.695, 12.705, 12.715, 12.725, 5.740, 5.…
## $ F0 <dbl> -1.00, -1.00, -1.00, -1.00, 102.25, 102.1…
We just want data from words beginning with CV. There are some words starting with VV, which is a bug in the query. Exclude these:
v <- filter(v, !(consonant %in% c("AW1", "OW1", "ER0"))) %>% droplevels()
We are primarily interested in:
F0
: pitch (in Hz), where -1 means “undefined” (unvoiced frame of speech)sex
of speakerspeaker
consonant
: consonant preceding the vowelDefine and correct some columns:
stress
: this is e.g. 1
in AH1
stressed
: stress = 0 vs 2/1vowel
: this is e.g. AH
in AH
speaker
: should be character (or factor)rel_time
: time since beginning of vowelv <- v %>% mutate(
stress = str_split(vowel, "[A-Z]+", simplify = TRUE)[, 2],
stressed = ifelse(stress == "0", "N", "Y"),
vowel = str_split(vowel, "[0-9]", simplify = TRUE)[, 1],
speaker = as.character(speaker)
)
v <- v %>%
group_by(traj_id) %>%
summarize(min_time = min(time)) %>%
left_join(v) %>%
mutate(rel_time = time - min_time) %>%
select(-min_time)
Below we’ll look at effects of consonant type and vowel height on F0. This requires defining:
v <- v %>%
mutate(
cons_class =
fct_recode(
consonant,
voiced_obs = "B", voiced_obs = "DH", voiced_obs = "D", voiced_obs = "G", voiced_obs = "V", voiced_obs = "JH", voiced_obs = "Z",
vless_obs = "CH", vless_obs = "F", vless_obs = "CH", vless_obs = "HH", vless_obs = "K", vless_obs = "P", vless_obs = "S", vless_obs = "SH", vless_obs = "T", vless_obs = "TH",
nasal = "M", nasal = "N",
sonorant = "L", sonorant = "R", sonorant = "W", sonorant = "Y"
)
)
v <- v %>%
mutate(
vowel_height =
fct_recode(
vowel,
high = "IY", high = "UH", high = "UW", high = "IH",
mid = "EY", mid = "EH", mid = "ER", mid = "OW", mid = "AH",
low = "AE", low = "AA", low = "AO",
dipth = "OY", dipth = "AY", dipth = "AW"
)
)
Make pitch = -1 be NA:
v[v$F0 == -1, "F0"] <- NA
Speaker mean pitch, and \(z\)-score pitch within speaker:
v <- v %>%
group_by(speaker) %>%
summarize(F0_mean = mean(F0, na.rm = T)) %>%
left_join(v) %>%
group_by(speaker) %>%
mutate(F0_std = as.numeric(scale(F0)))
Add vowel duration (I think), basic stats to be used to detect trajectories where pitch tracking was off.
v <- v %>%
group_by(traj_id) %>%
summarise(
dur = 0.010 * (n() - 1),
## number and % of NA points in this trajectory
na_n = sum(is.na(F0)),
not_na_n = n() - na_n,
na_frac = na_n / n(),
) %>%
left_join(v)
All trajectories, filtering out those with non-trivial undefined F0:
v %>%
## filter out trajectories where any appreciable portion of vowel didn't have F0 detected:
filter(na_frac < 0.1 & na_n <= 2) %>%
ggplot(aes(x = rel_time, y = F0_std)) +
geom_line(aes(group = traj_id))
## Warning: Removed 73 row(s) containing missing values (geom_path).
There are some crazy trajectories there. For this demo, let’s just remove anything 3 SD from speaker mean or >200 msec long (which is a very long vowel):
v <- v %>%
filter(na_frac < 0.1 & na_n <= 2) %>%
filter(abs(F0_std) < 3) %>%
filter(dur < 0.2)
v %>%
filter(na_frac < 0.1 & na_n <= 2) %>%
ggplot(aes(x = rel_time, y = F0_std)) +
geom_line(aes(group = traj_id)) +
facet_wrap(~speaker)
Looks a bit better.
How much data was excluded?
## this percent
1 - nrow(v) / nrow(v_orig)
## [1] 0.3268454
We expect:
Examine in just stressed syllables, to avoid prosodic effects.
Voiced vs. voiceless obstruent trajectories as a function of % through the vowel:
v %>%
filter(stressed == "Y" & cons_class %in% c("voiced_obs", "vless_obs")) %>%
ggplot(aes(x = rel_time / dur, y = F0_std)) +
geom_smooth(aes(color = cons_class))
Beautiful. In Hz, faceting by gender:
v %>%
filter(stressed == "Y" & cons_class %in% c("voiced_obs", "vless_obs")) %>%
ggplot(aes(x = rel_time / dur, y = F0)) +
geom_smooth(aes(color = cons_class)) +
facet_wrap(~sex)
The effect is larger (in Hz) for F speakers than for M speakers, as expected.
Now by-speaker:
v %>%
filter(stressed == "Y" & cons_class %in% c("voiced_obs", "vless_obs")) %>%
ggplot(aes(x = rel_time / dur, y = F0_std)) +
geom_smooth(aes(color = cons_class)) +
facet_wrap(~speaker, scales = "free_y") +
theme(legend.position = "none")
Here there is some interesting variability, but most speakers show the expected pattern: blue \(>\) red, at least at beginning of the vowel.
Trajectories as a function of vowel height, just for stressed syllables, omitting diphthongs (which don’t have a well-defined height):
v %>%
filter(stressed == "Y" & vowel_height != "dipth") %>%
droplevels() %>%
ggplot(aes(x = rel_time / dur, y = F0_std)) +
geom_smooth(aes(color = vowel_height))
Beautiful. In Hz, faceting by gender:
v %>%
filter(stressed == "Y" & vowel_height != "dipth") %>%
droplevels() %>%
ggplot(aes(x = rel_time / dur, y = F0)) +
geom_smooth(aes(color = vowel_height)) +
facet_wrap(~sex)
The effect is larger (in Hz) for F speakers than for M speakers, as expected.