On this page you can configure a morphological parser for your language. A
morphological parser is a program that takes a word as input and returns a
morphological analysis as output.
Configuring a morphological parser on the OLD entails specifying a phonology
and a morphotactics as finite state transducers (FSTs) and composing them into a
single morphophonology FST. A morphophonology specified as an FST functions
symmetrically as both a parser and a generator: it can return the set of morpheme
sequences compatible with a phonetic representation as well as the set of
phonetic representations compatible with a sequence of morphemes. When parsing,
a probability calculator (based on a bigram language model of morpheme
sequences) chooses the best parse from the set of candidates.
As concerns language documentation and analysis, a morphological parser and
its subcomponents can be used to:
test phonological theories against a dynamic corpus, i.e., inform you in
real time whether your string of phonemically transcribed morphemes
phonologizes to your phonetic representation.
automate the task of morphological analysis, i.e., fill in the morpheme
break and morpheme gloss fields based on the transcription entered.
There are three primary components to a morphological parser as implemented
on an OLD application:
a phonology (encoded as a finite state transducer, i.e., FST)
a morphotactics (encoded as an FST)
a morphological language model (assigns probabilities to morphological
analyses)
The first two components (phonology and morphotactics) are composed into a
single FST that we might call the morphophonology. The morphophonology FST is
a program that takes a word as input and returns all morphological analyses
compatible with the morphotactics and the phonology.
The morphological language model takes the output of the morphophonology,
i.e., a list of possible morphological analyses, and returns the most probable
such analysis.
Only the phonology must be specified by the users of the system. The
morphotactics and morphological language model (i.e., the word probability
calculator) can be induced from the data present in the database, i.e., the
Forms that contain morphologically analyzed words. When no phonology is
specified, the parser assumes that there are no phonological alternations.
Also, when the database lacks words morphologically analyzed by the users, there
will be neither morphotactics nor a language model.
Parser requirements
The morphological parser requires that foma (and its command line utility
counterpart, flookup) be installed on the server. It also requires that the
files 'phonology.foma', 'morphotactics.foma', 'morphophonology.foma',
'morphophonology.foma.bin' and 'probabilityCalculator.pickle' be present in the
'parser' directory. This section indicates whether these requirements are met.
(See below for how to install foma and generate the requisite files.)
% if c.fomaIsInstalled:
foma is installed.
% else:
foma is not installed.
% endif
% if c.flookupIsInstalled:
flookup is installed.
% else:
flookup is not installed.
% endif
% if c.phonologyExists:
phonology file is present.
% else:
phonology file is not present.
% endif
% if c.morphotacticsExists:
morphotactics file is present.
% else:
morphotactics file is not present.
% endif
% if c.morphophonologyBinaryExists:
morphophonology binary file is present.
% else:
morphophonology binary file is not present.
% endif
% if c.probabilityCalculatorExists:
probability calculator file is present.
% else:
probability calculator file is not present.
% endif
Installing foma
The program foma must be installed on the server in order for the
morphological parser to function.
Your system administrator must install foma. This involves installing
libreadline, zlib1g-dev, foma and flookup. On a debian-based system, try the
following
apt-get install libreadline-dev
apt-get install zlib1g-dev
wget http://dingo.sbs.arizona.edu/~mhulden/foma-0.9.15alpha.tar.gz
tar -xzvf foma-0.9.15alpha.tar.gz
cd foma
touch *.c *.h
make
make install
The above installed foma but not the flookup utility. In order to install
flookup, I downloaded the flookup binary (available
here) and copied
it to /usr/local/bin.
Phonology
The "phonology" of your language on an OLD application is the finite state
transducer (FST) that represents the relation between the underlying phonemic
shape of your lexical items and their surface realization in context.
The system assumes lexical items to be those Forms that lack space characters
and are not tagged with a morphosyntactic category in the set {'S', 's',
'sent', 'sentence', 'Sentence'}. It further assumes that these lexical items
are transcribed (in the transcription field) with an underlying phonemic
transcription. In contrast, sentences, phrases, words and any other
poly-morphemic Forms are assumed to have surface transcriptions.
A finite state transducer (FST) is a computational formalism that can be
thought of as representing the relation between two languages, where "language"
is understood in the computationo-logical sense as a set of strings. FSTs can
be used to, among other things, represent the phonology of a language, i.e., the
relationship between the underlying phonemic shape of a word and its surface
realization.
There are several computer programs that can be used to create FSTs and do
conversions on strings. Probably the most widely known of these is XFST, the
Xerox Finite State Tool. The OLD uses the open source program foma. Other
similar tools are SFST, HFST & PC-Kimmo. Programs like XFST and foma define
languages that facilitate the specification of FSTs. The FST specification
language implemented in foma allows one to generate FSTs via scripts containing
SPE-style context-sensitive rewrite rules. It is in writing such a script that
one specifies a phonology for a language being analyzed and documented on an OLD
application.
Specifying a phonology
Use the text box below to encode the phonology of your language as an FST
using foma's FST creation language. Requirements of the phonology:
it must be a valid foma script
it must define an FST named 'phonology'
it should not contain the 'regex' command
See the 'foma scripting basics' and 'example phonology script' subsections
below.
Clicking "Save & Compile Phonology" saves the phonology script to a file and
generates a binary foma FST file so that the phonology can be tested.
Foma scripting basics
A -> B || C _ D means "rewrite A as B only when it occurs
between C and D"
A (->) B || C _ D means "optionally rewrite A as B
only when it occurs between C and D"
[..] denotes the empty symbol in a rewrite rule, e.g.,
[..] -> s || i _ t means "insert an 's' between an 'i' and a
't'
define name expression assigns the FSM/FST generated by
expression to name, e.g., define
iLoss i -> 0 || y _ "-" [a | o];
.o. is used to denote the composition operation that forms
a single FST from two or more, e.g., define phonology assimilation .o.
devoicing
reserved characters (e.g., "-") need to be enclosed in quotes
statements are terminated by ;
Use # to comment out lines
Use "#" to reference word delimiter, i.e., the left or
right side of a word. Before using the parser to analyze a word, an OLD
application will first enclose it in "#" symbols.
Example phonology script
Here is an example foma script which implements a phonology of the Blackfoot
language. The rules are taken (with some modification and interpretation) from
Frantz (1997).
################################################################################
# The phonological rules of Frantz (1997) as FSTs
################################################################################
# How to understand this file
################################################################################
# - "A -> B || C _ D" means "rewrite A as B only when it occurs between C and D"
# - "(->)" is optional rewrite
# - "define name expression" assigns the FSM/FST generated by "expression" to
# "name"
# - Special characters (e.g., "-") need to be enclosed in quotes
# - ".o." denotes the composition operation
# - "[..]" denotes the empty symbol in a rewrite rule (using "0" in an insertion
# rule will result in 1 or more (!) insertions
# Comments
################################################################################
# Some interpretation of the ordered rewrite rules of Frantz (1997) was
# required:
# - what to do with the morpheme segmentation symbol "-" in the rules
# - Frantz (1997) provides a partial ordering: some decisions had to be made
#test nit-waanIt-k-wa nitaanikka
#test nit-waanIt-aa-wa nitaanistaawa
#test nit-siksipawa nitssiksipawa
#test nit-ssikópii nitsssikópii
#test á-sínaaki-wa áísínaakiwa
#test nikáá-ssikópii nikáíssikópii
#test káta'-simi-wa kátai'simiwa
#test áak-oto-apinnii-wa áakotaapinniiwa áakotapinniiwa
#test w-ínni ónni
#test w-iihsíssi ohsíssi
#test áak-Ipiima áaksipiima
#test kitsí'powata-oaawa kitsí'powatawaawa
#test á-Io'kaa-wa áyo'kaawa
#test yaatóó-t aatóót
#test waaníí-t aaníít
#test w-óko'si óko'si
#test á-yo'kaa-o'pa áyo'kao'pa
#test imitáá-iksi imitáíksi
#test á-yo'kaa-yi-aawa áyo'kaayaawa
#test á-ihpiyi-o'pa áíhpiyo'pa
#test á-okstaki-yi-aawa áókstakiiyaawa áókstakiyaawa
#test á-okska'si-o'pa áókska'so'pa
#test nit-Ioyi nitsoyi
#test otokska'si-hsi otokska'ssi
#test otá'po'taki-hsi otá'po'takssi
#test pii-hsini pissini
#test áak-yaatoowa áakaatoowa
#test nit-waanii nitaanii
#test kikáta'-waaniihpa kikáta'waaniihpa
#test áíhpiyi-yináyi áíhpiiyináyi áíhpiyiyináyi
#test áókska'si-hpinnaan áókska'sspinnaan
#test nit-it-itsiniki nitsitsitsiniki
#test á'-omai'taki-wa áó'mai'takiwa
#test káta'-ookaawaatsi kátaookaawaatsi
#test káta'-ottakiwaatsi kátaoottakiwaatsi
#test á'-isttohkohpiy'ssi áíisttohkohpiy'ssi
#test á'-o'tooyiniki áó'tooyiniki
#test káta'-ohto'toowa kátao'ohto'toowa kátaohto'toowa
#test nit-ssksinoawa nitssksinoawa
#test á-okska'siwa áókska'siwa
#test atsikí-istsi atsikíístsi
#test kakkóó-iksi kakkóíksi
#test nit-ihpiyi nitsspiyi
define phonemes [p|t|k|m|n|s|w|y|h|"'"|a|i|o|á|í|ó];
define vowels [a|i|o|á|í|ó];
define accentedVowels [á|í|ó];
define consonants [p|t|k|m|n|s|w|y];
define obstruents [p|t|k|m|n|s];
define stops [p|t|k|m|n];
define plosives [p|t|k];
define glides [w|y];
# 1. C1-C2 -> C2C2
# Gemination
define pGem plosives "-" -> p || _ p;
define tGem plosives "-" -> t || _ t;
define kGem plosives "-" -> k || _ k;
define gemination pGem .o. tGem .o. kGem;
# 2. It -> Ist
# s-Insertion (assumes that "breaking I" is a phoneme)
define sInsertion [..] -> s || I _ t;
# 3.a. C-s -> Css
# s-Connection A
define sConnectionA "-" -> s || stops _ s;
# 3.b. V(')-s -> V(')-is
# s-Connection B
# condition: where 's' is not part of a suffix
# present implementation: rule is optional
define sConnectionB [..] (->) i || vowels ("'") "-" _ s;
# 4. o-a -> aa
# o-Replacement
# note: for some speakers the o is deleted
# condition: where 'a' is not part of a suffix
# present implementation: rule is optional
define oReplacementA o (->) [a | 0] || _ "-" a;
define oReplacementB ó (->) [á | 0] || _ "-" a;
define oReplacementC [o | ó] (->) [á | 0] || _ "-" á;
define oReplacement oReplacementA .o. oReplacementB .o. oReplacementC;
# 5. w-i(i) -> o
# Coalescence
define coalescenceA w "-" i (i) -> o || _ [p|t|k|m|n|s|w|y|h|"'"];
define coalescenceB w "-" í (í) -> ó || _ [p|t|k|m|n|s|w|y|h|"'"];
define coalescence coalescenceA .o. coalescenceB;
# 6. k-I -> ksi
# Breaking
define breaking "-" -> s || k _ I;
# 7. I -> i
# Neutralization
define neutralization I -> i;
# 8.a. V-iV -> VyV
# Desyllabification A
define desyllabificationA "-" i -> y || vowels _ vowels;
# 8.b. V-oV -> VwV
# Desyllabification B
define desyllabificationB "-" o -> w || vowels _ vowels;
# 9. #G -> 0
# Semivowel Drop
define semivowelDrop glides -> 0 || "#" _;
# 10. V1V1-V -> V1V
# Vowel Shortening
define vowelShorteningA [a | á] -> 0 || [a | á] _ "-" vowels;
define vowelShorteningI [i | í] -> 0 || [i | í] _ "-" vowels;
define vowelShorteningO [o | ó] -> 0 || [o | ó] _ "-" vowels;
define vowelShortening vowelShorteningA .o. vowelShorteningI .o. vowelShorteningO;
# 11. Vyi-{a,o} -> Vy{a,o}
# i-Loss
define iLossA [i|í] -> 0 || [a|á|o|ó] y _ [a|á|o|ó];
define iLossB i y [i|í] -> i (i) y || _ [a|á|o|ó];
define iLossC í y [i|í] -> í (í) y || _ [a|á|o|ó];
define iLoss iLossA .o. iLossB .o. iLossC;
# 12. si{a,o} -> s{a,o}
# i-Absorption
define iAbsorption [i|í] ("-") -> 0 || s _ [a|á|o|ó];
# 13. sihs -> ss
# ih-Loss
define ihLoss [i|í] "-" h -> 0 || s _ s;
# 14. ihs -> ss
# Presibilation
define presibilation [i|í] "-" h -> s || _ s;
# 15. CG -> C , where C ne "'"
# Semivowel Loss
define semivowelLoss "-" glides -> 0 || obstruents _;
# 16. Ciyiy -> Ciiy
# y-Reduction (optional)
define yReduction y (->) 0 || [obstruents | "'"] [i|í] _ [i|í] ("-") y;
# 17. sih -> ss
# Postsibilation
define postsibilation [i|í] ("-") h -> s || s _;
# 18. ti -> tsi
# t-Affrication
define tAffrication "-" -> s || t _ [i|í];
# 19. V'VC -> VV'C
# Glottal Metathesis
define glottalMetathesisA "'" "-" a -> "-" a "'" || vowels _ [consonants|"'"|h];
define glottalMetathesisAccA "'" "-" á -> "-" á "'" || vowels _ [consonants|"'"|h];
define glottalMetathesisALong "'" "-" a a -> "-" a a "'" || vowels _ [consonants|"'"|h];
define glottalMetathesisAccALong "'" "-" á á -> "-" á á "'" || vowels _ [consonants|"'"|h];
define glottalMetathesisI "'" "-" i -> "-" i "'" || vowels _ [consonants|"'"|h];
define glottalMetathesisAccI "'" "-" í -> "-" í "'" || vowels _ [consonants|"'"|h];
define glottalMetathesisILong "'" "-" i i -> "-" i i "'" || vowels _ [consonants|"'"|h];
define glottalMetathesisAccILong "'" "-" í í -> "-" í í "'" || vowels _ [consonants|"'"|h];
define glottalMetathesisO "'" "-" o -> "-" o "'" || vowels _ [consonants|"'"|h];
define glottalMetathesisAccO "'" "-" ó -> "-" ó "'" || vowels _ [consonants|"'"|h];
define glottalMetathesisOLong "'" "-" o o -> "-" o o "'" || vowels _ [consonants|"'"|h];
define glottalMetathesisAccOLong "'" "-" ó ó -> "-" ó ó "'" || vowels _ [consonants|"'"|h];
define glottalMetathesis glottalMetathesisA .o. glottalMetathesisAccA .o.
glottalMetathesisALong .o. glottalMetathesisAccALong .o.
glottalMetathesisI .o. glottalMetathesisAccI .o. glottalMetathesisILong .o.
glottalMetathesisAccILong .o. glottalMetathesisO .o.
glottalMetathesisAccO .o. glottalMetathesisOLong .o.
glottalMetathesisAccOLong;
# 20. VV1V1'C -> VV1V1C
# Glottal Loss
define glottalLossA a a "'" -> a a || vowels ("-") _ consonants;
define glottalLossAccA á á "'" -> á á || vowels ("-") _ consonants;
define glottalLossI i i "'" -> i i || vowels ("-") _ consonants;
define glottalLossAccI í í "'" -> í í || vowels ("-") _ consonants;
define glottalLossO o o "'" -> o o || vowels ("-") _ consonants;
define glottalLossAccO ó ó "'" -> ó ó || vowels ("-") _ consonants;
define glottalLoss glottalLossA .o. glottalLossAccA .o. glottalLossI .o.
glottalLossAccI .o. glottalLossO .o. glottalLossAccO;
# 21. V'(s)CC -> VV(s)CC , where C ne 's'
# Glottal Assimilation
define glottalAssimilationA a "'" -> a a || _ (s) [p p | t t | k k | m m | n n];
define glottalAssimilationAAcc á "'" -> á á || _ (s) [p p | t t | k k | m m | n n];
define glottalAssimilationI i "'" -> i i || _ (s) [p p | t t | k k | m m | n n];
define glottalAssimilationIAcc í "'" -> í í || _ (s) [p p | t t | k k | m m | n n];
define glottalAssimilationO o "'" -> o o || _ (s) [p p | t t | k k | m m | n n];
define glottalAssimilationOAcc ó "'" -> ó ó || _ (s) [p p | t t | k k | m m | n n];
define glottalAssimilation glottalAssimilationA .o. glottalAssimilationAAcc .o.
glottalAssimilationI .o. glottalAssimilationIAcc .o. glottalAssimilationO .o.
glottalAssimilationOAcc;
# 22. '' -> '
# Glottal Reduction
define glottalReduction "'" "'" -> "'";
# 23. V1'h -> V1'V1h
# Vowel Epenthesis
# note: In place of this rule, some speakers have the following rule:
# ' -> 0 / _ h
define vowelEpenthesisA a "'" -> [a "'" a | a] || _ h;
define vowelEpenthesisAAcc á "'" -> [á "'" á | á] || _ h;
define vowelEpenthesisI i "'" -> [i "'" i | i] || _ h;
define vowelEpenthesisIAcc í "'" -> [í "'" í | í] || _ h;
define vowelEpenthesisO o "'" -> [o "'" o | o] || _ h;
define vowelEpenthesisOAcc ó "'" -> [ó "'" ó | ó] || _ h;
define vowelEpenthesis vowelEpenthesisA .o. vowelEpenthesisAAcc .o.
vowelEpenthesisI .o. vowelEpenthesisIAcc .o.
vowelEpenthesisO .o. vowelEpenthesisOAcc;
# 24. sssC -> ssC
# sss-Shortening
define sssShortening s -> 0 || _ s s [stops | glides];
# 25.
# Accent Spread
define accentSpreadA a -> á || accentedVowels "-" _;
define accentSpreadI i -> í || accentedVowels "-" _;
define accentSpreadO o -> ó || accentedVowels "-" _;
define accentSpread accentSpreadO .o. accentSpreadA .o. accentSpreadI;
# 26. - -> 0
# Break-Delete
define breakDelete "-" -> 0;
define phonology semivowelLoss .o.
gemination .o.
coalescence .o.
sInsertion .o.
sConnectionB .o.
yReduction .o.
breaking .o.
oReplacement .o.
ihLoss .o.
sConnectionA .o.
presibilation .o.
sssShortening .o.
semivowelDrop .o.
vowelShortening .o.
neutralization .o.
tAffrication .o.
postsibilation .o.
iAbsorption .o.
desyllabificationB .o.
desyllabificationA .o.
glottalMetathesis .o.
vowelEpenthesis .o.
glottalReduction .o.
glottalLoss .o.
glottalAssimilation .o.
accentSpread .o.
breakDelete .o.
iLoss;
Phonology tester script
Below is provided a simple Python script that can be used to test a phonology
against a series of word/analysis pairs. To get the script, click the "toggle
script" link below, copy and paste its contents into a text editor and save
the resulting file as 'phonology_tester.py'. The script contains its own usage
instructions.
import subprocess
import os
import codecs
import re
import pickle
import sys
"""Script takes a foma script and tests its phonology regex against its tests.
The tests are comments prefixed by "#test". The tests are of the format
#test input output1 (output2)
For example:
#test imitaa-iksi imitaiksi
Assuming the above example, this script would report whether "imitaa-iksi"
phonologizes to "imitaiksi".
How to use this script
================================================================================
1. Make sure you have foma and Python installed.
2. Place this script in the same directory as your foma script.
3. Make sure you have some test comments in your foma script (see above).
4. Run "python phonology_tester.py" at the command line.
"""
phonologyFileName = 'phonology.foma'
phonologyTestingFileName = 'phonology_testing.foma'
phonologyTestingShellScriptName = 'phonology_testing.sh'
phonologyTestingBinaryFileName = 'phonology_testing.foma.bin'
# Foma reserved symbols. See
# http://code.google.com/p/foma/wiki/RegularExpressionReference#Reserved_symbols
fomaReserved = [u'\u0021', u'\u0022', u'\u0023', u'\u0024', u'\u0025', u'\u0026',
u'\u0028', u'\u0029', u'\u002A', u'\u002B', u'\u002C', u'\u002D', u'\u002E',
u'\u002F', u'\u0030', u'\u003A', u'\u003B', u'\u003C', u'\u003E', u'\u003F',
u'\u005B', u'\u005C', u'\u005D', u'\u005E', u'\u005F', u'\u0060', u'\u007B',
u'\u007C', u'\u007D', u'\u007E', u'\u00AC', u'\u00B9', u'\u00D7', u'\u03A3',
u'\u03B5', u'\u207B', u'\u2081', u'\u2082', u'\u2192', u'\u2194', u'\u2200',
u'\u2203', u'\u2205', u'\u2208', u'\u2218', u'\u2225', u'\u2227', u'\u2228',
u'\u2229', u'\u222A', u'\u2264', u'\u2265', u'\u227A', u'\u227B']
def escapeFomaReserved(i):
def escape(ii):
if ii in fomaReserved:
ii = u'"%s"' % ii
return ii
return ''.join([escape(x) for x in i])
def getTests(phonologyList):
tests = []
for line in phonologyList:
if line[:5] == "#test":
tests.append((line.split()[1], line.split()[2:]))
#for test in tests:
# expecteds = '"%s"' % '", "'.join(test[1])
# print 'Test "%s" against %s' % (test[0], expecteds)
return tests
def getPhonologyTestingFile(phonologyList, tests):
regexes = [u' '.join([escapeFomaReserved(x) for x in t[0]]) for t in tests]
morphotactics = u'define morphotactics "#" [%s] "#";' % u' | \n '.join(
regexes)
return u'%s\n\n\n%s\n\n\n%s\n\n\n%s' % (
''.join(phonologyList),
morphotactics,
u'define morphophonology morphotactics .o. phonology;',
u'regex morphophonology;')
def phonologize(underlyingWord):
binaryFilePath = os.path.join(
os.getcwd(), phonologyTestingBinaryFileName)
process = subprocess.Popen(
['flookup', '-x', '-i', binaryFilePath],
shell=False,
stdin=subprocess.PIPE,
stdout=subprocess.PIPE)
process.stdin.write(underlyingWord.encode('utf-8'))
result = unicode(process.communicate()[0], 'utf-8').split('\n')
return [x.replace('#', '') for x in result if x]
def generateFST():
phonologyFile = codecs.open(phonologyFileName, 'r', 'utf-8')
phonologyTestingFile = codecs.open(phonologyTestingFileName, 'w', 'utf-8')
phonologyList = phonologyFile.readlines()
# Get tests to perform on the phonology
tests = getTests(phonologyList)
# Write the phonology foma script to be used in the testing
testingFile = getPhonologyTestingFile(phonologyList, tests)
phonologyTestingFile.write(testingFile)
# Create the shell script for generating the binary FST
shellScript = open(phonologyTestingShellScriptName, 'w')
cmd = 'foma -e "source %s" -e "save stack %s" -e "quit"' % (
phonologyTestingFileName, phonologyTestingBinaryFileName)
shellScript.write(cmd)
shellScript.close()
os.chmod(phonologyTestingShellScriptName, 0755)
# Generate the binary file FST for the phonology
scriptFullPath = os.path.join(os.getcwd(), phonologyTestingShellScriptName)
process = subprocess.Popen([scriptFullPath], shell=True,
stdout=subprocess.PIPE)
output = unicode(process.communicate()[0], 'utf-8')
print output
def performTests():
if phonologyTestingBinaryFileName not in os.listdir(os.getcwd()):
print 'HAVE TO GENERATE'
print os.listdir(os.getcwd())
generateFST()
phonologyFile = codecs.open(phonologyFileName, 'r', 'utf-8')
phonologyList = phonologyFile.readlines()
tests = getTests(phonologyList)
report = u''
failures = u''
for test in tests:
passed = u'GOOD'
details = u''
result = phonologize(u'#' + test[0] + u'#')
expecteds = test[1]
for e in expecteds:
if e in result:
details += u'\t%s is a surface realization\n' % e
else:
details += u'\t%s IS NOT A SURFACE REALIZATION\n' % e
passed = u'BAD'
details += u'\tsurface realizations: %s' % ', '.join(result)
tmp = u'%s: %s\n%s\n' % (passed, test[0], details)
report += tmp
if passed == u'BAD':
failures += tmp
print report
if failures:
print 'TESTS FAILED:\n%s' % failures
if __name__ == '__main__':
try:
option = sys.argv[1]
except IndexError:
option = None
if option and option == '-g':
generateFST()
else:
performTests()
Test the phonology
Here you can test your phonology FST in two ways:
apply phonology to a token
apply phonology to subset of the Forms in the database
Apply the phonology to a token
Enter a string of morphemes (i.e., a morphological analysis of a word) in the
text box below and click 'Phonologize Token'. This will use your phonology FST
script to map an underlying morphophonemic representation to a surface phonetic
one and is a good way of seeing whether your phonology is doing what you want it
to. (This is the output of running apply down morpheme-string from
within foma or running echo "morpheme-string" | flookup -x -i
phonology.foma.bin at the command line.)
Apply the phonology to a subset
of the database
To specify the set of Forms to which the phonology will be applied, enter
search criteria by clicking 'Toggle interface for filtering Forms'. Then click
the 'Phonologize DB Subset' button and the system will apply the phonology to
the morpheme break line of each word of each Form in your search results.
The morphotactics is the set of valid words, that is, strings of morphemes,
of your language.
The morphotactics is deduced from the analyzed words and lexical items that
the users of this OLD application enter. Specifically, the system searches the
syntactic category string field of all Forms for valid syntactic category
words. For example, a Form might have the syntactic category string
"D N-? V-Agr". The syntactic category words in this example would be "D",
"N-?" and "V-Agr" and the "N-?" one would be invalid (OLD applications use "?"
as the category for unrecognized morphemes). From such a Form the system would
conclude that a morpheme of category D can form a valid word as can a morpheme
of category V followed by one of category Agr.
Using this morphotactic and lexical information, an OLD application can
automatically generate a finite state transducer that represents the set of
valid words of your language.
Generate the morphotactics
Click the Generate Morphotactics button to have the system generate a
morphotactic FST for your language based on the data in the database.
Morphophonology
The morphophonology is an FST created by composing the morphotactics FST with
the phonology FST. This morphophonology FST is then written as a binary file
that the flookup command line program can use to parse words.
Generate the morphophonology
Click the Generate Morphophonology button to have the system generate the FST
encoding the morphophonology of your language.
Warning: depending on the size of your lexicon and the complexity of your
phonology, this can take quite a long time (up to 5 minutes) and can monopolize
your server's resources. It is suggested that you (a) be patient and (b) run
this command during a low usage period.
Probability calculator
The morphological language model assigns a probability to each word, i.e., to
each sequence of morphemes. These probabilities are used to create a
probability calculator that can rank morphological analyses.
Generate the probability calculator
Click the Generate Probability Calculator button to have the system generate
the probability calculator based on the morphological language model.
Test the parser
Here you can test the parser that you have configured in a number of ways.
First, you can test it on words that you enter. Second, you can have the system
divide the analyzed words in the database into training and test sets and
generate an F-score for the parser. Third, you can have the system run the
parser on the unanalyzed words present in the database.
Parse a word
Enter a word in the text box and click 'Parse' to parse it.
Evaluate the parser
Click the Evaluate Parser button to have the system divide the analyzed
words in the database into training and test sets and generate an F-score for
the parser as a measure of its accuracy.
(Note: this function creates a new probability calculator (based on a bigram
language model) that draws on a randomly selected 90% of the analyzed words,
i.e., the training set, and tests accuracy against the other 10%, i.e., the test
set.)
Issues
Here I list some problems with the morphological parser implementation, some
being general and some specific to my test language, Blackfoot.
General: language model seems to be faulty in that it appears that
analyses involving fewer morphemes are disproportionately estimated at a
higher probability. Consider the two most probable parses generated for
"nitssiksipaawa": "nit-siksip-aawa" and "nit-siksip-aa-wa". The first is
shorter and is calculated to be most probable by the system while the
second, though it is the correct analysis, is longer and is therefore
calculated to be less probable. This might not actually be a problem...
What to do about default accenting/pitch prominence? Should this be
encoded as a morphophonological rule? How could this be done?
Currently, orthographic variation is not accurately anticipating
variations in pitch accent. Consider that "otsk00skiwa" was not generating
the appropriate variant "0tsskosskiwa".
What to do about Frantz's underlying "I". He does not seem to use it
in the dictionary, yet the analyses of his grammar seem to require it.
Consider that "nitánikka" will not parse to "nit-wanIt-k-wa" without it.
Also, the above analysis requires positing "k" "INV" or else some kind of
rule that removes the "o" of "ok" "INV"...
"nitssiksipaawa" parses to "nit-siksip-aawa" and the correct analysis,
"nit-siksip-aa-wa" is the second most probable. The sparseness of the
language model might be to blame for this inaccurate ranking. However,
Frantz's (1997: 152) "nítssiksipawa" will not parse for two reasons: (1) the
pitch accent on the first syllable and (2) the short "a" in the penultimate
syllable (the direct theme suffix is stored as "aa" in the database.)
"nitsssikopii" correctly parses to "nit-ssikopii". "nítsssikópii" does
not parse.