library(tidyverse)
library(broom)
v <- read_csv("tutorial_formants.csv")
In this analysis file we process the output of Polyglot from the formants tutorial (formants_tutorial.csv
) and show how it can be used to address (simple) research questions involving variation between speakers in vowel space:
This analysis is partly adapted from course materials by Márton Sóskuthy (UBC).
Quick look at columns of the dataframe:
glimpse(v)
## Rows: 74,001
## Columns: 13
## $ speaker <dbl> 1089, 1089, 1089, 1089, 1089, 1089, 1089, 1089, 1089, 1089…
## $ speaker_sex <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M"…
## $ file <chr> "1089-134691-0013", "1089-134691-0011", "1089-134691-0013"…
## $ speech_rate <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ word <chr> "us", "contrite", "him", "all", "only", "becomes", "elabor…
## $ phone <chr> "AH1", "AY1", "IH1", "AO1", "OW1", "AH1", "ER0", "IH1", "I…
## $ previous <chr> "D", "R", "HH", "V", "Z", "K", "B", "HH", "T", "D", "S", "…
## $ following <chr> "S", "T", "M", "L", "N", "M", "EY2", "Z", "W", "V", "L", "…
## $ phone_start <dbl> 7.32, 9.42, 13.15, 6.46, 11.20, 5.67, 18.65, 13.76, 1.15, …
## $ phone_end <dbl> 7.40, 9.50, 13.22, 6.82, 11.31, 5.74, 18.76, 13.81, 1.33, …
## $ F1 <dbl> 501.02, 653.54, 396.60, 580.47, 549.19, 512.96, 615.42, 33…
## $ F2 <dbl> 1517.61, 1345.15, 1785.44, 899.85, 955.67, 1215.04, 1214.0…
## $ F3 <dbl> 2356.44, 2022.31, 2498.21, 2306.28, 2606.67, 2247.82, 1826…
These mostly look like we’d expect: phone_start
and phone_end
are times, previous
and following
are the phones following the vowel (phone
), F1/F2/F3 are formants, and o on. speech_rate
should not be all NAs; there must be a bug.
Define and correct some columns:
stress
: this is e.g. 1
in AH1
stressed
: stress = 0 vs 2/1vowel
: this is e.g. AH
in AH
speaker
: should be character (or factor)v <- v %>% mutate(
duration = phone_end - phone_start,
stress = str_split(phone, '[A-Z]+', simplify = TRUE)[,2],
stressed = ifelse(stress=='0', 'N', 'Y'),
vowel = str_split(phone, '[0-9]', simplify = TRUE)[,1],
speaker = as.character(speaker)
)
Let’s also label which vowels are diphtohngs, since we’ll want to just plot monophthongs (= all other vowels). This means vowels that just have one target (“heed”, “head”, but not “hide”), so it makes sense to just show one point for them in a vowel plot.
diphthongs <- c("AY", "AW", 'OY')
v <- v %>% mutate(
vowel_class = ifelse(vowel %in% diphthongs, 'dip', 'mono')
)
Let’s call that the raw dataset:
v_raw <- v
Examine vowel duration distributions, just for monophthongs, for stressed vowels. Our goal here is to assess what good cutoffs to exclude vowels that are reduced, or whose durations are so long as to be force-alignment errors:
filter(v, vowel_class == 'mono' & stressed == 'Y') %>%
ggplot(aes(y=duration, x=vowel)) +
# faceting by speaker
#facet_wrap(.~speaker) +
# using violins to show distributions per vowel
geom_violin() +
# adding the median separately as a point
stat_summary(fun=median, geom="point", shape=16) +
scale_y_log10() +
annotation_logticks(sides='l')
We can see that:
duration
is very left-skewedTo make plots of the formant data, get mean F1 and F2 by speaker/vowel/stress, add back to original dataframe:
v_summ <- v %>% group_by(speaker, speaker_sex, stressed, vowel, vowel_class) %>%
summarise(F1_mean = mean(F1),
F2_mean = mean(F2),
F3_mean = mean(F3),
n = n()
)
v <- v_summ %>% left_join(v)
Now let’s plot the raw formant data, by speaker, for stressed monophthongs, showing just points based on 10+ observations:
# setting up plot; separate phones will be shown
# by colour
v %>% filter(stressed == 'Y' & vowel_class=='mono' & n>=10) %>%
ggplot(aes(x=F2, y=F1, col=vowel)) +
# faceting by speaker
facet_wrap(.~speaker, scales='free') +
# using phone labels in plot
stat_ellipse(level = 0.67) +
geom_point(size=0.5)+
# geom_label(data = filter(v_summ, stressed == 'Y' & vowel_class=='mono' & n>5), aes(F2_mean, F1_mean, label=vowel)) +
# setting up vowel-chart-like F1 / F2 axes
scale_y_reverse() +
scale_x_reverse()
It’s hard to see what’s going on across all speakers—let’s just look at a representative few:
# setting up plot; separate phones will be shown
# by colour
some_speakers <- c('1089', '1284', '1580', '4446', '8224')
v %>% filter(speaker %in% some_speakers & stressed == 'Y' & vowel_class=='mono' & n>=10) %>%
ggplot(aes(x=F2, y=F1, col=vowel)) +
# faceting by speaker
facet_wrap(.~speaker, scales='free') +
# using phone labels in plot
stat_ellipse(level = 0.67) +
geom_point(size=0.5, alpha=0.5)+
geom_point(data = filter(v_summ, speaker %in% some_speakers & stressed == 'Y' & vowel_class=='mono' & n>5), aes(F2_mean, F1_mean, color=vowel)) +
# setting up vowel-chart-like F1 / F2 axes
scale_y_reverse() +
scale_x_reverse()
For most speakers we see roughly the expected vowel quadrilateral, but with significant noise. Most clearly, there are some tokens for each speaker far outside the quadrilateral, with measurements that can’t be right.
We’ll try to find a subset of measurements which are more reliable, by doing some filtering:
## From Marton Soskuthy, edited
## min_size: if there are fewer than this many observations,
## don't exclude any
mvt_outlier <- function (x, y, level_sd= 3, min_size = 10) {
if(length(x) >= min_size){
mat <- as.matrix(cbind(x,y))
mu <- c(mean(mat[,1]), mean(mat[,2]))
sigma <- cov(mat)
out <- mahalanobis(mat, center=mu, cov=sigma) > level_sd
} else{
out <-rep(FALSE, length(x))
}
}
v_filt <- v %>%
# grouping by vowel / speaker
group_by(vowel, speaker) %>%
filter(duration > 0.051 & duration < 0.251) %>%
# removing obvious outliers
filter(F1 < 1250,
!mvt_outlier(F1, F2)
) %>%
ungroup()
How much data did we lose?
nrow(v)
## [1] 74001
nrow(v_filt)
## [1] 43758
Vowel plot with filtered data (not not restricting by number of examples):
# setting up plot; separate phones will be shown
# by colour
v_filt %>% filter(stressed == 'Y' & vowel_class=='mono') %>%
ggplot(aes(x=F2, y=F1, col=vowel)) +
# faceting by speaker
facet_wrap(.~speaker, scales='free') +
# using phone labels in plot
stat_ellipse(level = 0.67) +
geom_point(size=0.5)+
# geom_label(data = filter(v_summ, stressed == 'Y' & vowel_class=='mono' & n>5), aes(F2_mean, F1_mean, label=vowel)) +
# setting up vowel-chart-like F1 / F2 axes
scale_y_reverse() +
scale_x_reverse()
This looks more sensible, though there are still tokens with issues, and a couple problematic speakers (8455, maybe 672).
Let’s examine the F1/F2 centroids for stressed monophthongs (now not restricting by number of tokens):
v_filt %>% filter(vowel_class == 'mono' & stressed == 'Y') %>%
group_by(speaker, vowel) %>%
summarise(F1=mean(F1),
F2=mean(F2)) %>%
ggplot(aes(x=F2, y=F1, col=vowel)) +
facet_wrap(~speaker) +
geom_text(aes(label=vowel)) +
scale_y_reverse() +
scale_x_reverse()
Let’s plot a subset of vowels to point out some features that could reflect dialect variation:
vowel_subset_1 <- c('AA', 'AO', 'AE', 'EY', 'UW', 'IY')
v_filt %>% filter(vowel %in% vowel_subset_1 & stressed == 'Y') %>%
group_by(speaker, vowel) %>%
summarise(F1=mean(F1),
F2=mean(F2)) %>%
ggplot(aes(x=F2, y=F1, col=vowel)) +
facet_wrap(~speaker) +
geom_text(aes(label=vowel)) +
scale_y_reverse() +
scale_x_reverse()
American English dialects differ in whether AO (“caught”) and AA (“cot”) are merged. It looks like:
Vowel space by gender:
v_filt_summ <- v_filt %>% group_by(speaker, speaker_sex, stressed, vowel, vowel_class) %>%
summarise(F1_mean = mean(F1),
F2_mean = mean(F2),
F3_mean = mean(F3),
n = n()
)
hull_df <- v_filt_summ %>% ungroup() %>% filter(vowel_class=='mono' & stressed=='Y') %>% group_by(speaker) %>% slice(chull(F2_mean, F1_mean))
hull_df %>% ggplot(aes(x=F2_mean, y=F1_mean)) + geom_polygon(aes(color=speaker_sex), fill=NA) + facet_grid(~speaker_sex, labeller='label_both') + scale_x_reverse() + scale_y_reverse() +
theme(legend.position = 'none')
We clearly see that vowel space is shifted to lower F2 and F1 on average for male speakers—which is expected, given physiology—but also that female speakers have a “broader” vowel space, on average.
Same thing, by gender and vowel stress:
hull_df_2 <- v_filt_summ %>% ungroup() %>% filter(vowel_class=='mono') %>% group_by(speaker, stressed) %>% slice(chull(F2_mean, F1_mean))
hull_df_2 %>% ggplot(aes(x=F2_mean, y=F1_mean)) + geom_polygon(aes(color=speaker_sex), fill=NA) + facet_grid(stressed~speaker_sex, labeller='label_both') + scale_x_reverse() + scale_y_reverse() +
theme(legend.position = 'none')
Same thing, by speaker:
hull_df_2 %>% ggplot(aes(x=F2_mean, y=F1_mean)) + geom_polygon(aes(color=stressed), fill=NA) + facet_wrap(~speaker) + scale_x_reverse() + scale_y_reverse()