cutlet

Open in Streamlit Current PyPI packages

cutlet

cutlet by Irasutoya

Cutlet is a tool to convert Japanese to romaji. Check out the interactive demo! Also see the docs and the original blog post.

issueを英語で書く必要はありません。

Features:

  • support for Modified Hepburn, Kunreisiki, Nihonsiki systems
  • custom overrides for individual mappings
  • custom overrides for specific words
  • built in exceptions list (Tokyo, Osaka, etc.)
  • uses foreign spelling when available in UniDic
  • proper nouns are capitalized
  • slug mode for url generation

Things not supported:

  • traditional Hepburn n-to-m: Shimbashi
  • macrons or circumflexes: Tōkyō, Tôkyô
  • passport Hepburn: Satoh (but you can use an exception)
  • hyphenating words
  • Traditional Hepburn in general is not supported

Internally, cutlet uses fugashi, so you can use the same dictionary you use for normal tokenization.

Installation

Cutlet can be installed through pip as usual.

pip install cutlet

Note that if you don't have a MeCab dictionary installed you'll also have to install one. If you're just getting started unidic-lite is a good choice.

pip install unidic-lite

Usage

A command-line script is included for quick testing. Just use cutlet and each line of stdin will be treated as a sentence. You can specify the system to use (hepburn, kunrei, nippon, or nihon) as the first argument.

$ cutlet
ローマ字変換プログラム作ってみた。
Roma ji henkan program tsukutte mita.

In code:

import cutlet
katsu = cutlet.Cutlet()
katsu.romaji("カツカレーは美味しい")
# => 'Cutlet curry wa oishii'

# you can print a slug suitable for urls
katsu.slug("カツカレーは美味しい")
# => 'cutlet-curry-wa-oishii'

# You can disable using foreign spelling too
katsu.use_foreign_spelling = False
katsu.romaji("カツカレーは美味しい")
# => 'Katsu karee wa oishii'

# kunreisiki, nihonsiki work too
katu = cutlet.Cutlet('kunrei')
katu.romaji("富士山")
# => 'Huzi yama'

# comparison
nkatu = cutlet.Cutlet('nihon')

sent = "彼女は王への手紙を読み上げた。"
katsu.romaji(sent)
# => 'Kanojo wa ou e no tegami wo yomiageta.'
katu.romaji(sent)
# => 'Kanozyo wa ou e no tegami o yomiageta.'
nkatu.romaji(sent)
# => 'Kanozyo ha ou he no tegami wo yomiageta.'

Alternatives

  • kakasi: Historically important, but not updated since 2014.
  • pykakasi: self contained, it does segmentation on its own and uses its own dictionary.
  • kuroshiro: Javascript based.
  • kana: Go based.
1"""
2
3.. include:: ../README.md
4"""
5
6from .cutlet import *
7
8__all__ = ("Cutlet",)
class Cutlet:
 96class Cutlet:
 97    def __init__(
 98            self,
 99            system = 'hepburn',
100            use_foreign_spelling = True,
101            ensure_ascii = True,
102            mecab_args = "",
103):
104        """Create a Cutlet object, which holds configuration as well as
105        tokenizer state.
106
107        `system` is `hepburn` by default, and may also be `kunrei` or
108        `nihon`. `nippon` is permitted as a synonym for `nihon`.
109
110        If `use_foreign_spelling` is true, output will use the foreign spelling
111        provided in a UniDic lemma when available. For example, "カツ" will
112        become "cutlet" instead of "katsu".
113
114        If `ensure_ascii` is true, any non-ASCII characters that can't be
115        romanized will be replaced with `?`. If false, they will be passed
116        through.
117
118        Typical usage:
119
120        ```python
121        katsu = Cutlet()
122        roma = katsu.romaji("カツカレーを食べた")
123        # "Cutlet curry wo tabeta"
124        ```
125        """
126        # allow 'nippon' for 'nihon'
127        if system == 'nippon': system = 'nihon'
128        self.system = system
129        try:
130            # make a copy so we can modify it
131            self.table = dict(SYSTEMS[system])
132        except KeyError:
133            print("unknown system: {}".format(system))
134            raise
135
136        self.tagger = fugashi.Tagger(mecab_args)
137        self.exceptions = load_exceptions()
138
139        # these are too minor to be worth exposing as arguments
140        self.use_tch = (self.system in ('hepburn',))
141        self.use_wa  = (self.system in ('hepburn', 'kunrei'))
142        self.use_he  = (self.system in ('nihon',))
143        self.use_wo  = (self.system in ('hepburn', 'nihon'))
144
145        self.use_foreign_spelling = use_foreign_spelling
146        self.ensure_ascii = ensure_ascii
147
148    def add_exception(self, key, val):
149        """Add an exception to the internal list.
150
151        An exception overrides a whole token, for example to replace "Toukyou"
152        with "Tokyo". Note that it must match the tokenizer output and be a
153        single token to work. To replace longer phrases, you'll need to use a
154        different strategy, like string replacement.
155        """
156        self.exceptions[key] = val
157
158    def update_mapping(self, key, val):
159        """Update mapping table for a single kana.
160
161        This can be used to mix common systems, or to modify particular
162        details. For example, you can use `update_mapping("ぢ", "di")` to
163        differentiate ぢ and じ in Hepburn.
164
165        Example usage:
166
167        ```
168        cut = Cutlet()
169        cut.romaji("お茶漬け") # Ochazuke
170        cut.update_mapping("づ", "du")
171        cut.romaji("お茶漬け") # Ochaduke
172        ```
173        """
174        self.table[key] = val
175
176    def slug(self, text):
177        """Generate a URL-friendly slug.
178
179        After converting the input to romaji using `Cutlet.romaji` and making
180        the result lower-case, any runs of non alpha-numeric characters are
181        replaced with a single hyphen. Any leading or trailing hyphens are
182        stripped.
183        """
184        roma = self.romaji(text).lower()
185        slug = re.sub(r'[^a-z0-9]+', '-', roma).strip('-')
186        return slug
187
188    def romaji_tokens(self, words, capitalize=True, title=False):
189        """Build a list of tokens from input nodes.
190
191        If `capitalize` is true, then the first letter of the first token will be
192        capitalized. This is typically the desired behavior if the input is a
193        complete sentence.
194
195        If `title` is true, then words will be capitalized as in a book title.
196        This means most words will be capitalized, but some parts of speech
197        (particles, endings) will not.
198
199        If the text was not normalized before being tokenized, the output is
200        undefined. For details of normalization, see `normalize_text`.
201
202        The number of output tokens will equal the number of input nodes.
203        """
204
205        out = []
206
207        for wi, word in enumerate(words):
208            po = out[-1] if out else None
209            pw = words[wi - 1] if wi > 0 else None
210            nw = words[wi + 1] if wi < len(words) - 1 else None
211
212            # handle possessive apostrophe as a special case
213            if (word.surface == "'" and
214                    (nw and nw.char_type == CHAR_ALPHA and not nw.white_space) and
215                    not word.white_space):
216                # remove preceeding space
217                if po:
218                    po.space = False
219                out.append(Token(word.surface, False))
220                continue
221
222            # resolve split verbs / adjectives
223            roma = self.romaji_word(word)
224            if roma and po and po.surface and po.surface[-1] == 'っ':
225                po.surface = po.surface[:-1] + roma[0]
226            if word.feature.pos2 == '固有名詞':
227                roma = roma.title()
228            if (title and
229                word.feature.pos1 not in ('助詞', '助動詞', '接尾辞') and
230                not (pw and pw.feature.pos1 == '接頭辞')):
231                roma = roma.title()
232
233            tok = Token(roma, False)
234            # handle punctuation with atypical spacing
235            if word.surface in '「『':
236                if po:
237                    po.space = True
238                out.append(tok)
239                continue
240            if roma in '([':
241                if po:
242                    po.space = True
243                out.append(tok)
244                continue
245            if roma == '/':
246                out.append(tok)
247                continue
248
249            out.append(tok)
250
251            # no space sometimes
252            # お酒 -> osake
253            if word.feature.pos1 == '接頭辞': continue
254            # 今日、 -> kyou, ; 図書館 -> toshokan
255            if nw and nw.feature.pos1 in ('補助記号', '接尾辞'): continue
256            # special case for half-width commas
257            if nw and nw.surface == ',': continue
258            # 思えば -> omoeba
259            if nw and nw.feature.pos2 in ('接続助詞'): continue
260            # 333 -> 333 ; this should probably be handled in mecab
261            if (word.surface.isdigit() and
262                    nw and nw.surface.isdigit()):
263                continue
264            # そうでした -> sou deshita
265            if (nw and word.feature.pos1 in ('動詞', '助動詞','形容詞')
266                   and nw.feature.pos1 == '助動詞'
267                   and nw.surface != 'です'):
268                continue
269
270            # if we get here, it does need a space
271            tok.space = True
272
273        # remove any leftover っ
274        for tok in out:
275            tok.surface = tok.surface.replace("っ", "")
276
277        # capitalize the first letter
278        if capitalize and out and out[0].surface:
279            ss = out[0].surface
280            out[0].surface = ss[0].capitalize() + ss[1:]
281        return out
282
283    def romaji(self, text, capitalize=True, title=False):
284        """Build a complete string from input text.
285
286        If `capitalize` is true, then the first letter of the text will be
287        capitalized. This is typically the desired behavior if the input is a
288        complete sentence.
289
290        If `title` is true, then words will be capitalized as in a book title.
291        This means most words will be capitalized, but some parts of speech
292        (particles, endings) will not.
293        """
294        if not text:
295            return ''
296
297        text = normalize_text(text)
298        words = self.tagger(text)
299
300        tokens = self.romaji_tokens(words, capitalize, title)
301        out = ''.join([str(tok) for tok in tokens]).strip()
302        return out
303
304    def romaji_word(self, word):
305        """Return the romaji for a single word (node)."""
306
307        if word.surface in self.exceptions:
308            return self.exceptions[word.surface]
309
310        if word.surface.isdigit():
311            return word.surface
312
313        if word.surface.isascii():
314            return word.surface
315
316        # deal with unks first
317        if word.is_unk:
318            # at this point is is presumably an unk
319            # Check character type using the values defined in char.def.
320            # This is constant across unidic versions so far but not guaranteed.
321            if word.char_type in (CHAR_HIRAGANA, CHAR_KATAKANA):
322                kana = jaconv.kata2hira(word.surface)
323                return self.map_kana(kana)
324
325            # At this point this is an unknown word and not kana. Could be
326            # unknown kanji, could be hangul, cyrillic, something else.
327            # By default ensure ascii by replacing with ?, but allow pass-through.
328            if self.ensure_ascii:
329                out = '?' * len(word.surface)
330                return out
331            else:
332                return word.surface
333
334        if word.feature.pos1 == '補助記号':
335            # If it's punctuation we don't recognize, just discard it
336            return self.table.get(word.surface, '')
337        elif (self.use_wa and
338                word.feature.pos1 == '助詞' and word.feature.pron == 'ワ'):
339            return 'wa'
340        elif (not self.use_he and
341                word.feature.pos1 == '助詞' and word.feature.pron == 'エ'):
342            return 'e'
343        elif (not self.use_wo and
344                word.feature.pos1 == '助詞' and word.feature.pron == 'オ'):
345            return 'o'
346        elif (self.use_foreign_spelling and
347                has_foreign_lemma(word)):
348            # this is a foreign word with known spelling
349            return word.feature.lemma.split('-')[-1]
350        elif word.feature.kana:
351            # for known words
352            kana = jaconv.kata2hira(word.feature.kana)
353            return self.map_kana(kana)
354        else:
355            # unclear when we would actually get here
356            return word.surface
357
358    def map_kana(self, kana):
359        """Given a list of kana, convert them to romaji.
360
361        The exact romaji resulting from a kana sequence depend on the preceding
362        or following kana, so this handles that conversion.
363        """
364        out = ''
365        for ki, char in enumerate(kana):
366            nk = kana[ki + 1] if ki < len(kana) - 1 else None
367            pk = kana[ki - 1] if ki > 0 else None
368            out += self.get_single_mapping(pk, char, nk)
369        return out
370
371    def get_single_mapping(self, pk, kk, nk):
372        """Given a single kana and its neighbors, return the mapped romaji."""
373        # handle odoriji
374        # NOTE: This is very rarely useful at present because odoriji are not
375        # left in readings for dictionary words, and we can't follow kana
376        # across word boundaries.
377        if kk in ODORI:
378            if kk in 'ゝヽ':
379                if pk: return pk
380                else: return '' # invalid but be nice
381            if kk in 'ゞヾ': # repeat with voicing
382                if not pk: return ''
383                vv = add_dakuten(pk)
384                if vv: return self.table[vv]
385                else: return ''
386            # remaining are 々 for kanji and 〃 for symbols, but we can't
387            # infer their span reliably (or handle rendaku)
388            return ''
389
390
391        # handle digraphs
392        if pk and (pk + kk) in self.table:
393            return self.table[pk + kk]
394        if nk and (kk + nk) in self.table:
395            return ''
396
397        if nk and nk in SUTEGANA:
398            if kk == 'っ': return '' # never valid, just ignore
399            return self.table[kk][:-1] + self.table[nk]
400        if kk in SUTEGANA:
401            return ''
402
403        if kk == 'ー': # 長音符
404            if pk and pk in self.table: return self.table[pk][-1]
405            else: return '-'
406
407        if kk == 'っ':
408            if nk:
409                if self.use_tch and nk == 'ち': return 't'
410                elif nk in 'あいうえおっ': return '-'
411                else: return self.table[nk][0] # first character
412            else:
413                # seems like it should never happen, but 乗っ|た is two tokens
414                # so leave this as is and pick it up at the word level
415                return 'っ'
416
417        if kk == 'ん':
418            if nk and nk in 'あいうえおやゆよ': return "n'"
419            else: return 'n'
420
421        return self.table[kk]
Cutlet( system='hepburn', use_foreign_spelling=True, ensure_ascii=True, mecab_args='')
 97    def __init__(
 98            self,
 99            system = 'hepburn',
100            use_foreign_spelling = True,
101            ensure_ascii = True,
102            mecab_args = "",
103):
104        """Create a Cutlet object, which holds configuration as well as
105        tokenizer state.
106
107        `system` is `hepburn` by default, and may also be `kunrei` or
108        `nihon`. `nippon` is permitted as a synonym for `nihon`.
109
110        If `use_foreign_spelling` is true, output will use the foreign spelling
111        provided in a UniDic lemma when available. For example, "カツ" will
112        become "cutlet" instead of "katsu".
113
114        If `ensure_ascii` is true, any non-ASCII characters that can't be
115        romanized will be replaced with `?`. If false, they will be passed
116        through.
117
118        Typical usage:
119
120        ```python
121        katsu = Cutlet()
122        roma = katsu.romaji("カツカレーを食べた")
123        # "Cutlet curry wo tabeta"
124        ```
125        """
126        # allow 'nippon' for 'nihon'
127        if system == 'nippon': system = 'nihon'
128        self.system = system
129        try:
130            # make a copy so we can modify it
131            self.table = dict(SYSTEMS[system])
132        except KeyError:
133            print("unknown system: {}".format(system))
134            raise
135
136        self.tagger = fugashi.Tagger(mecab_args)
137        self.exceptions = load_exceptions()
138
139        # these are too minor to be worth exposing as arguments
140        self.use_tch = (self.system in ('hepburn',))
141        self.use_wa  = (self.system in ('hepburn', 'kunrei'))
142        self.use_he  = (self.system in ('nihon',))
143        self.use_wo  = (self.system in ('hepburn', 'nihon'))
144
145        self.use_foreign_spelling = use_foreign_spelling
146        self.ensure_ascii = ensure_ascii

Create a Cutlet object, which holds configuration as well as tokenizer state.

system is hepburn by default, and may also be kunrei or nihon. nippon is permitted as a synonym for nihon.

If use_foreign_spelling is true, output will use the foreign spelling provided in a UniDic lemma when available. For example, "カツ" will become "cutlet" instead of "katsu".

If ensure_ascii is true, any non-ASCII characters that can't be romanized will be replaced with ?. If false, they will be passed through.

Typical usage:

katsu = Cutlet()
roma = katsu.romaji("カツカレーを食べた")
# "Cutlet curry wo tabeta"
system
tagger
exceptions
use_tch
use_wa
use_he
use_wo
use_foreign_spelling
ensure_ascii
def add_exception(self, key, val):
148    def add_exception(self, key, val):
149        """Add an exception to the internal list.
150
151        An exception overrides a whole token, for example to replace "Toukyou"
152        with "Tokyo". Note that it must match the tokenizer output and be a
153        single token to work. To replace longer phrases, you'll need to use a
154        different strategy, like string replacement.
155        """
156        self.exceptions[key] = val

Add an exception to the internal list.

An exception overrides a whole token, for example to replace "Toukyou" with "Tokyo". Note that it must match the tokenizer output and be a single token to work. To replace longer phrases, you'll need to use a different strategy, like string replacement.

def update_mapping(self, key, val):
158    def update_mapping(self, key, val):
159        """Update mapping table for a single kana.
160
161        This can be used to mix common systems, or to modify particular
162        details. For example, you can use `update_mapping("ぢ", "di")` to
163        differentiate ぢ and じ in Hepburn.
164
165        Example usage:
166
167        ```
168        cut = Cutlet()
169        cut.romaji("お茶漬け") # Ochazuke
170        cut.update_mapping("づ", "du")
171        cut.romaji("お茶漬け") # Ochaduke
172        ```
173        """
174        self.table[key] = val

Update mapping table for a single kana.

This can be used to mix common systems, or to modify particular details. For example, you can use update_mapping("ぢ", "di") to differentiate ぢ and じ in Hepburn.

Example usage:

cut = Cutlet()
cut.romaji("お茶漬け") # Ochazuke
cut.update_mapping("づ", "du")
cut.romaji("お茶漬け") # Ochaduke
def slug(self, text):
176    def slug(self, text):
177        """Generate a URL-friendly slug.
178
179        After converting the input to romaji using `Cutlet.romaji` and making
180        the result lower-case, any runs of non alpha-numeric characters are
181        replaced with a single hyphen. Any leading or trailing hyphens are
182        stripped.
183        """
184        roma = self.romaji(text).lower()
185        slug = re.sub(r'[^a-z0-9]+', '-', roma).strip('-')
186        return slug

Generate a URL-friendly slug.

After converting the input to romaji using Cutlet.romaji and making the result lower-case, any runs of non alpha-numeric characters are replaced with a single hyphen. Any leading or trailing hyphens are stripped.

def romaji_tokens(self, words, capitalize=True, title=False):
188    def romaji_tokens(self, words, capitalize=True, title=False):
189        """Build a list of tokens from input nodes.
190
191        If `capitalize` is true, then the first letter of the first token will be
192        capitalized. This is typically the desired behavior if the input is a
193        complete sentence.
194
195        If `title` is true, then words will be capitalized as in a book title.
196        This means most words will be capitalized, but some parts of speech
197        (particles, endings) will not.
198
199        If the text was not normalized before being tokenized, the output is
200        undefined. For details of normalization, see `normalize_text`.
201
202        The number of output tokens will equal the number of input nodes.
203        """
204
205        out = []
206
207        for wi, word in enumerate(words):
208            po = out[-1] if out else None
209            pw = words[wi - 1] if wi > 0 else None
210            nw = words[wi + 1] if wi < len(words) - 1 else None
211
212            # handle possessive apostrophe as a special case
213            if (word.surface == "'" and
214                    (nw and nw.char_type == CHAR_ALPHA and not nw.white_space) and
215                    not word.white_space):
216                # remove preceeding space
217                if po:
218                    po.space = False
219                out.append(Token(word.surface, False))
220                continue
221
222            # resolve split verbs / adjectives
223            roma = self.romaji_word(word)
224            if roma and po and po.surface and po.surface[-1] == 'っ':
225                po.surface = po.surface[:-1] + roma[0]
226            if word.feature.pos2 == '固有名詞':
227                roma = roma.title()
228            if (title and
229                word.feature.pos1 not in ('助詞', '助動詞', '接尾辞') and
230                not (pw and pw.feature.pos1 == '接頭辞')):
231                roma = roma.title()
232
233            tok = Token(roma, False)
234            # handle punctuation with atypical spacing
235            if word.surface in '「『':
236                if po:
237                    po.space = True
238                out.append(tok)
239                continue
240            if roma in '([':
241                if po:
242                    po.space = True
243                out.append(tok)
244                continue
245            if roma == '/':
246                out.append(tok)
247                continue
248
249            out.append(tok)
250
251            # no space sometimes
252            # お酒 -> osake
253            if word.feature.pos1 == '接頭辞': continue
254            # 今日、 -> kyou, ; 図書館 -> toshokan
255            if nw and nw.feature.pos1 in ('補助記号', '接尾辞'): continue
256            # special case for half-width commas
257            if nw and nw.surface == ',': continue
258            # 思えば -> omoeba
259            if nw and nw.feature.pos2 in ('接続助詞'): continue
260            # 333 -> 333 ; this should probably be handled in mecab
261            if (word.surface.isdigit() and
262                    nw and nw.surface.isdigit()):
263                continue
264            # そうでした -> sou deshita
265            if (nw and word.feature.pos1 in ('動詞', '助動詞','形容詞')
266                   and nw.feature.pos1 == '助動詞'
267                   and nw.surface != 'です'):
268                continue
269
270            # if we get here, it does need a space
271            tok.space = True
272
273        # remove any leftover っ
274        for tok in out:
275            tok.surface = tok.surface.replace("っ", "")
276
277        # capitalize the first letter
278        if capitalize and out and out[0].surface:
279            ss = out[0].surface
280            out[0].surface = ss[0].capitalize() + ss[1:]
281        return out

Build a list of tokens from input nodes.

If capitalize is true, then the first letter of the first token will be capitalized. This is typically the desired behavior if the input is a complete sentence.

If title is true, then words will be capitalized as in a book title. This means most words will be capitalized, but some parts of speech (particles, endings) will not.

If the text was not normalized before being tokenized, the output is undefined. For details of normalization, see normalize_text.

The number of output tokens will equal the number of input nodes.

def romaji(self, text, capitalize=True, title=False):
283    def romaji(self, text, capitalize=True, title=False):
284        """Build a complete string from input text.
285
286        If `capitalize` is true, then the first letter of the text will be
287        capitalized. This is typically the desired behavior if the input is a
288        complete sentence.
289
290        If `title` is true, then words will be capitalized as in a book title.
291        This means most words will be capitalized, but some parts of speech
292        (particles, endings) will not.
293        """
294        if not text:
295            return ''
296
297        text = normalize_text(text)
298        words = self.tagger(text)
299
300        tokens = self.romaji_tokens(words, capitalize, title)
301        out = ''.join([str(tok) for tok in tokens]).strip()
302        return out

Build a complete string from input text.

If capitalize is true, then the first letter of the text will be capitalized. This is typically the desired behavior if the input is a complete sentence.

If title is true, then words will be capitalized as in a book title. This means most words will be capitalized, but some parts of speech (particles, endings) will not.

def romaji_word(self, word):
304    def romaji_word(self, word):
305        """Return the romaji for a single word (node)."""
306
307        if word.surface in self.exceptions:
308            return self.exceptions[word.surface]
309
310        if word.surface.isdigit():
311            return word.surface
312
313        if word.surface.isascii():
314            return word.surface
315
316        # deal with unks first
317        if word.is_unk:
318            # at this point is is presumably an unk
319            # Check character type using the values defined in char.def.
320            # This is constant across unidic versions so far but not guaranteed.
321            if word.char_type in (CHAR_HIRAGANA, CHAR_KATAKANA):
322                kana = jaconv.kata2hira(word.surface)
323                return self.map_kana(kana)
324
325            # At this point this is an unknown word and not kana. Could be
326            # unknown kanji, could be hangul, cyrillic, something else.
327            # By default ensure ascii by replacing with ?, but allow pass-through.
328            if self.ensure_ascii:
329                out = '?' * len(word.surface)
330                return out
331            else:
332                return word.surface
333
334        if word.feature.pos1 == '補助記号':
335            # If it's punctuation we don't recognize, just discard it
336            return self.table.get(word.surface, '')
337        elif (self.use_wa and
338                word.feature.pos1 == '助詞' and word.feature.pron == 'ワ'):
339            return 'wa'
340        elif (not self.use_he and
341                word.feature.pos1 == '助詞' and word.feature.pron == 'エ'):
342            return 'e'
343        elif (not self.use_wo and
344                word.feature.pos1 == '助詞' and word.feature.pron == 'オ'):
345            return 'o'
346        elif (self.use_foreign_spelling and
347                has_foreign_lemma(word)):
348            # this is a foreign word with known spelling
349            return word.feature.lemma.split('-')[-1]
350        elif word.feature.kana:
351            # for known words
352            kana = jaconv.kata2hira(word.feature.kana)
353            return self.map_kana(kana)
354        else:
355            # unclear when we would actually get here
356            return word.surface

Return the romaji for a single word (node).

def map_kana(self, kana):
358    def map_kana(self, kana):
359        """Given a list of kana, convert them to romaji.
360
361        The exact romaji resulting from a kana sequence depend on the preceding
362        or following kana, so this handles that conversion.
363        """
364        out = ''
365        for ki, char in enumerate(kana):
366            nk = kana[ki + 1] if ki < len(kana) - 1 else None
367            pk = kana[ki - 1] if ki > 0 else None
368            out += self.get_single_mapping(pk, char, nk)
369        return out

Given a list of kana, convert them to romaji.

The exact romaji resulting from a kana sequence depend on the preceding or following kana, so this handles that conversion.

def get_single_mapping(self, pk, kk, nk):
371    def get_single_mapping(self, pk, kk, nk):
372        """Given a single kana and its neighbors, return the mapped romaji."""
373        # handle odoriji
374        # NOTE: This is very rarely useful at present because odoriji are not
375        # left in readings for dictionary words, and we can't follow kana
376        # across word boundaries.
377        if kk in ODORI:
378            if kk in 'ゝヽ':
379                if pk: return pk
380                else: return '' # invalid but be nice
381            if kk in 'ゞヾ': # repeat with voicing
382                if not pk: return ''
383                vv = add_dakuten(pk)
384                if vv: return self.table[vv]
385                else: return ''
386            # remaining are 々 for kanji and 〃 for symbols, but we can't
387            # infer their span reliably (or handle rendaku)
388            return ''
389
390
391        # handle digraphs
392        if pk and (pk + kk) in self.table:
393            return self.table[pk + kk]
394        if nk and (kk + nk) in self.table:
395            return ''
396
397        if nk and nk in SUTEGANA:
398            if kk == 'っ': return '' # never valid, just ignore
399            return self.table[kk][:-1] + self.table[nk]
400        if kk in SUTEGANA:
401            return ''
402
403        if kk == 'ー': # 長音符
404            if pk and pk in self.table: return self.table[pk][-1]
405            else: return '-'
406
407        if kk == 'っ':
408            if nk:
409                if self.use_tch and nk == 'ち': return 't'
410                elif nk in 'あいうえおっ': return '-'
411                else: return self.table[nk][0] # first character
412            else:
413                # seems like it should never happen, but 乗っ|た is two tokens
414                # so leave this as is and pick it up at the word level
415                return 'っ'
416
417        if kk == 'ん':
418            if nk and nk in 'あいうえおやゆよ': return "n'"
419            else: return 'n'
420
421        return self.table[kk]

Given a single kana and its neighbors, return the mapped romaji.