cutlet
cutlet
Cutlet is a tool to convert Japanese to romaji. Check out the interactive demo! Also see the docs and the original blog post.
issueを英語で書く必要はありません。
Features:
- support for Modified Hepburn, Kunreisiki, Nihonsiki systems
- custom overrides for individual mappings
- custom overrides for specific words
- built in exceptions list (Tokyo, Osaka, etc.)
- uses foreign spelling when available in UniDic
- proper nouns are capitalized
- slug mode for url generation
Things not supported:
- traditional Hepburn n-to-m: Shimbashi
- macrons or circumflexes: Tōkyō, Tôkyô
- passport Hepburn: Satoh (but you can use an exception)
- hyphenating words
- Traditional Hepburn in general is not supported
Internally, cutlet uses fugashi, so you can use the same dictionary you use for normal tokenization.
Installation
Cutlet can be installed through pip as usual.
pip install cutlet
Note that if you don't have a MeCab dictionary installed you'll also have to install one. If you're just getting started unidic-lite is a good choice.
pip install unidic-lite
Usage
A command-line script is included for quick testing. Just use cutlet
and each
line of stdin will be treated as a sentence. You can specify the system to use
(hepburn
, kunrei
, nippon
, or nihon
) as the first argument.
$ cutlet
ローマ字変換プログラム作ってみた。
Roma ji henkan program tsukutte mita.
In code:
import cutlet
katsu = cutlet.Cutlet()
katsu.romaji("カツカレーは美味しい")
# => 'Cutlet curry wa oishii'
# you can print a slug suitable for urls
katsu.slug("カツカレーは美味しい")
# => 'cutlet-curry-wa-oishii'
# You can disable using foreign spelling too
katsu.use_foreign_spelling = False
katsu.romaji("カツカレーは美味しい")
# => 'Katsu karee wa oishii'
# kunreisiki, nihonsiki work too
katu = cutlet.Cutlet('kunrei')
katu.romaji("富士山")
# => 'Huzi yama'
# comparison
nkatu = cutlet.Cutlet('nihon')
sent = "彼女は王への手紙を読み上げた。"
katsu.romaji(sent)
# => 'Kanojo wa ou e no tegami wo yomiageta.'
katu.romaji(sent)
# => 'Kanozyo wa ou e no tegami o yomiageta.'
nkatu.romaji(sent)
# => 'Kanozyo ha ou he no tegami wo yomiageta.'
Alternatives
96class Cutlet: 97 def __init__( 98 self, 99 system = 'hepburn', 100 use_foreign_spelling = True, 101 ensure_ascii = True, 102 mecab_args = "", 103): 104 """Create a Cutlet object, which holds configuration as well as 105 tokenizer state. 106 107 `system` is `hepburn` by default, and may also be `kunrei` or 108 `nihon`. `nippon` is permitted as a synonym for `nihon`. 109 110 If `use_foreign_spelling` is true, output will use the foreign spelling 111 provided in a UniDic lemma when available. For example, "カツ" will 112 become "cutlet" instead of "katsu". 113 114 If `ensure_ascii` is true, any non-ASCII characters that can't be 115 romanized will be replaced with `?`. If false, they will be passed 116 through. 117 118 Typical usage: 119 120 ```python 121 katsu = Cutlet() 122 roma = katsu.romaji("カツカレーを食べた") 123 # "Cutlet curry wo tabeta" 124 ``` 125 """ 126 # allow 'nippon' for 'nihon' 127 if system == 'nippon': system = 'nihon' 128 self.system = system 129 try: 130 # make a copy so we can modify it 131 self.table = dict(SYSTEMS[system]) 132 except KeyError: 133 print("unknown system: {}".format(system)) 134 raise 135 136 self.tagger = fugashi.Tagger(mecab_args) 137 self.exceptions = load_exceptions() 138 139 # these are too minor to be worth exposing as arguments 140 self.use_tch = (self.system in ('hepburn',)) 141 self.use_wa = (self.system in ('hepburn', 'kunrei')) 142 self.use_he = (self.system in ('nihon',)) 143 self.use_wo = (self.system in ('hepburn', 'nihon')) 144 145 self.use_foreign_spelling = use_foreign_spelling 146 self.ensure_ascii = ensure_ascii 147 148 def add_exception(self, key, val): 149 """Add an exception to the internal list. 150 151 An exception overrides a whole token, for example to replace "Toukyou" 152 with "Tokyo". Note that it must match the tokenizer output and be a 153 single token to work. To replace longer phrases, you'll need to use a 154 different strategy, like string replacement. 155 """ 156 self.exceptions[key] = val 157 158 def update_mapping(self, key, val): 159 """Update mapping table for a single kana. 160 161 This can be used to mix common systems, or to modify particular 162 details. For example, you can use `update_mapping("ぢ", "di")` to 163 differentiate ぢ and じ in Hepburn. 164 165 Example usage: 166 167 ``` 168 cut = Cutlet() 169 cut.romaji("お茶漬け") # Ochazuke 170 cut.update_mapping("づ", "du") 171 cut.romaji("お茶漬け") # Ochaduke 172 ``` 173 """ 174 self.table[key] = val 175 176 def slug(self, text): 177 """Generate a URL-friendly slug. 178 179 After converting the input to romaji using `Cutlet.romaji` and making 180 the result lower-case, any runs of non alpha-numeric characters are 181 replaced with a single hyphen. Any leading or trailing hyphens are 182 stripped. 183 """ 184 roma = self.romaji(text).lower() 185 slug = re.sub(r'[^a-z0-9]+', '-', roma).strip('-') 186 return slug 187 188 def romaji_tokens(self, words, capitalize=True, title=False): 189 """Build a list of tokens from input nodes. 190 191 If `capitalize` is true, then the first letter of the first token will be 192 capitalized. This is typically the desired behavior if the input is a 193 complete sentence. 194 195 If `title` is true, then words will be capitalized as in a book title. 196 This means most words will be capitalized, but some parts of speech 197 (particles, endings) will not. 198 199 If the text was not normalized before being tokenized, the output is 200 undefined. For details of normalization, see `normalize_text`. 201 202 The number of output tokens will equal the number of input nodes. 203 """ 204 205 out = [] 206 207 for wi, word in enumerate(words): 208 po = out[-1] if out else None 209 pw = words[wi - 1] if wi > 0 else None 210 nw = words[wi + 1] if wi < len(words) - 1 else None 211 212 # handle possessive apostrophe as a special case 213 if (word.surface == "'" and 214 (nw and nw.char_type == CHAR_ALPHA and not nw.white_space) and 215 not word.white_space): 216 # remove preceeding space 217 if po: 218 po.space = False 219 out.append(Token(word.surface, False)) 220 continue 221 222 # resolve split verbs / adjectives 223 roma = self.romaji_word(word) 224 if roma and po and po.surface and po.surface[-1] == 'っ': 225 po.surface = po.surface[:-1] + roma[0] 226 if word.feature.pos2 == '固有名詞': 227 roma = roma.title() 228 if (title and 229 word.feature.pos1 not in ('助詞', '助動詞', '接尾辞') and 230 not (pw and pw.feature.pos1 == '接頭辞')): 231 roma = roma.title() 232 233 tok = Token(roma, False) 234 # handle punctuation with atypical spacing 235 if word.surface in '「『': 236 if po: 237 po.space = True 238 out.append(tok) 239 continue 240 if roma in '([': 241 if po: 242 po.space = True 243 out.append(tok) 244 continue 245 if roma == '/': 246 out.append(tok) 247 continue 248 249 out.append(tok) 250 251 # no space sometimes 252 # お酒 -> osake 253 if word.feature.pos1 == '接頭辞': continue 254 # 今日、 -> kyou, ; 図書館 -> toshokan 255 if nw and nw.feature.pos1 in ('補助記号', '接尾辞'): continue 256 # special case for half-width commas 257 if nw and nw.surface == ',': continue 258 # 思えば -> omoeba 259 if nw and nw.feature.pos2 in ('接続助詞'): continue 260 # 333 -> 333 ; this should probably be handled in mecab 261 if (word.surface.isdigit() and 262 nw and nw.surface.isdigit()): 263 continue 264 # そうでした -> sou deshita 265 if (nw and word.feature.pos1 in ('動詞', '助動詞','形容詞') 266 and nw.feature.pos1 == '助動詞' 267 and nw.surface != 'です'): 268 continue 269 270 # if we get here, it does need a space 271 tok.space = True 272 273 # remove any leftover っ 274 for tok in out: 275 tok.surface = tok.surface.replace("っ", "") 276 277 # capitalize the first letter 278 if capitalize and out and out[0].surface: 279 ss = out[0].surface 280 out[0].surface = ss[0].capitalize() + ss[1:] 281 return out 282 283 def romaji(self, text, capitalize=True, title=False): 284 """Build a complete string from input text. 285 286 If `capitalize` is true, then the first letter of the text will be 287 capitalized. This is typically the desired behavior if the input is a 288 complete sentence. 289 290 If `title` is true, then words will be capitalized as in a book title. 291 This means most words will be capitalized, but some parts of speech 292 (particles, endings) will not. 293 """ 294 if not text: 295 return '' 296 297 text = normalize_text(text) 298 words = self.tagger(text) 299 300 tokens = self.romaji_tokens(words, capitalize, title) 301 out = ''.join([str(tok) for tok in tokens]).strip() 302 return out 303 304 def romaji_word(self, word): 305 """Return the romaji for a single word (node).""" 306 307 if word.surface in self.exceptions: 308 return self.exceptions[word.surface] 309 310 if word.surface.isdigit(): 311 return word.surface 312 313 if word.surface.isascii(): 314 return word.surface 315 316 # deal with unks first 317 if word.is_unk: 318 # at this point is is presumably an unk 319 # Check character type using the values defined in char.def. 320 # This is constant across unidic versions so far but not guaranteed. 321 if word.char_type in (CHAR_HIRAGANA, CHAR_KATAKANA): 322 kana = jaconv.kata2hira(word.surface) 323 return self.map_kana(kana) 324 325 # At this point this is an unknown word and not kana. Could be 326 # unknown kanji, could be hangul, cyrillic, something else. 327 # By default ensure ascii by replacing with ?, but allow pass-through. 328 if self.ensure_ascii: 329 out = '?' * len(word.surface) 330 return out 331 else: 332 return word.surface 333 334 if word.feature.pos1 == '補助記号': 335 # If it's punctuation we don't recognize, just discard it 336 return self.table.get(word.surface, '') 337 elif (self.use_wa and 338 word.feature.pos1 == '助詞' and word.feature.pron == 'ワ'): 339 return 'wa' 340 elif (not self.use_he and 341 word.feature.pos1 == '助詞' and word.feature.pron == 'エ'): 342 return 'e' 343 elif (not self.use_wo and 344 word.feature.pos1 == '助詞' and word.feature.pron == 'オ'): 345 return 'o' 346 elif (self.use_foreign_spelling and 347 has_foreign_lemma(word)): 348 # this is a foreign word with known spelling 349 return word.feature.lemma.split('-')[-1] 350 elif word.feature.kana: 351 # for known words 352 kana = jaconv.kata2hira(word.feature.kana) 353 return self.map_kana(kana) 354 else: 355 # unclear when we would actually get here 356 return word.surface 357 358 def map_kana(self, kana): 359 """Given a list of kana, convert them to romaji. 360 361 The exact romaji resulting from a kana sequence depend on the preceding 362 or following kana, so this handles that conversion. 363 """ 364 out = '' 365 for ki, char in enumerate(kana): 366 nk = kana[ki + 1] if ki < len(kana) - 1 else None 367 pk = kana[ki - 1] if ki > 0 else None 368 out += self.get_single_mapping(pk, char, nk) 369 return out 370 371 def get_single_mapping(self, pk, kk, nk): 372 """Given a single kana and its neighbors, return the mapped romaji.""" 373 # handle odoriji 374 # NOTE: This is very rarely useful at present because odoriji are not 375 # left in readings for dictionary words, and we can't follow kana 376 # across word boundaries. 377 if kk in ODORI: 378 if kk in 'ゝヽ': 379 if pk: return pk 380 else: return '' # invalid but be nice 381 if kk in 'ゞヾ': # repeat with voicing 382 if not pk: return '' 383 vv = add_dakuten(pk) 384 if vv: return self.table[vv] 385 else: return '' 386 # remaining are 々 for kanji and 〃 for symbols, but we can't 387 # infer their span reliably (or handle rendaku) 388 return '' 389 390 391 # handle digraphs 392 if pk and (pk + kk) in self.table: 393 return self.table[pk + kk] 394 if nk and (kk + nk) in self.table: 395 return '' 396 397 if nk and nk in SUTEGANA: 398 if kk == 'っ': return '' # never valid, just ignore 399 return self.table[kk][:-1] + self.table[nk] 400 if kk in SUTEGANA: 401 return '' 402 403 if kk == 'ー': # 長音符 404 if pk and pk in self.table: return self.table[pk][-1] 405 else: return '-' 406 407 if kk == 'っ': 408 if nk: 409 if self.use_tch and nk == 'ち': return 't' 410 elif nk in 'あいうえおっ': return '-' 411 else: return self.table[nk][0] # first character 412 else: 413 # seems like it should never happen, but 乗っ|た is two tokens 414 # so leave this as is and pick it up at the word level 415 return 'っ' 416 417 if kk == 'ん': 418 if nk and nk in 'あいうえおやゆよ': return "n'" 419 else: return 'n' 420 421 return self.table[kk]
97 def __init__( 98 self, 99 system = 'hepburn', 100 use_foreign_spelling = True, 101 ensure_ascii = True, 102 mecab_args = "", 103): 104 """Create a Cutlet object, which holds configuration as well as 105 tokenizer state. 106 107 `system` is `hepburn` by default, and may also be `kunrei` or 108 `nihon`. `nippon` is permitted as a synonym for `nihon`. 109 110 If `use_foreign_spelling` is true, output will use the foreign spelling 111 provided in a UniDic lemma when available. For example, "カツ" will 112 become "cutlet" instead of "katsu". 113 114 If `ensure_ascii` is true, any non-ASCII characters that can't be 115 romanized will be replaced with `?`. If false, they will be passed 116 through. 117 118 Typical usage: 119 120 ```python 121 katsu = Cutlet() 122 roma = katsu.romaji("カツカレーを食べた") 123 # "Cutlet curry wo tabeta" 124 ``` 125 """ 126 # allow 'nippon' for 'nihon' 127 if system == 'nippon': system = 'nihon' 128 self.system = system 129 try: 130 # make a copy so we can modify it 131 self.table = dict(SYSTEMS[system]) 132 except KeyError: 133 print("unknown system: {}".format(system)) 134 raise 135 136 self.tagger = fugashi.Tagger(mecab_args) 137 self.exceptions = load_exceptions() 138 139 # these are too minor to be worth exposing as arguments 140 self.use_tch = (self.system in ('hepburn',)) 141 self.use_wa = (self.system in ('hepburn', 'kunrei')) 142 self.use_he = (self.system in ('nihon',)) 143 self.use_wo = (self.system in ('hepburn', 'nihon')) 144 145 self.use_foreign_spelling = use_foreign_spelling 146 self.ensure_ascii = ensure_ascii
Create a Cutlet object, which holds configuration as well as tokenizer state.
system
is hepburn
by default, and may also be kunrei
or
nihon
. nippon
is permitted as a synonym for nihon
.
If use_foreign_spelling
is true, output will use the foreign spelling
provided in a UniDic lemma when available. For example, "カツ" will
become "cutlet" instead of "katsu".
If ensure_ascii
is true, any non-ASCII characters that can't be
romanized will be replaced with ?
. If false, they will be passed
through.
Typical usage:
katsu = Cutlet()
roma = katsu.romaji("カツカレーを食べた")
# "Cutlet curry wo tabeta"
148 def add_exception(self, key, val): 149 """Add an exception to the internal list. 150 151 An exception overrides a whole token, for example to replace "Toukyou" 152 with "Tokyo". Note that it must match the tokenizer output and be a 153 single token to work. To replace longer phrases, you'll need to use a 154 different strategy, like string replacement. 155 """ 156 self.exceptions[key] = val
Add an exception to the internal list.
An exception overrides a whole token, for example to replace "Toukyou" with "Tokyo". Note that it must match the tokenizer output and be a single token to work. To replace longer phrases, you'll need to use a different strategy, like string replacement.
158 def update_mapping(self, key, val): 159 """Update mapping table for a single kana. 160 161 This can be used to mix common systems, or to modify particular 162 details. For example, you can use `update_mapping("ぢ", "di")` to 163 differentiate ぢ and じ in Hepburn. 164 165 Example usage: 166 167 ``` 168 cut = Cutlet() 169 cut.romaji("お茶漬け") # Ochazuke 170 cut.update_mapping("づ", "du") 171 cut.romaji("お茶漬け") # Ochaduke 172 ``` 173 """ 174 self.table[key] = val
Update mapping table for a single kana.
This can be used to mix common systems, or to modify particular
details. For example, you can use update_mapping("ぢ", "di")
to
differentiate ぢ and じ in Hepburn.
Example usage:
cut = Cutlet()
cut.romaji("お茶漬け") # Ochazuke
cut.update_mapping("づ", "du")
cut.romaji("お茶漬け") # Ochaduke
176 def slug(self, text): 177 """Generate a URL-friendly slug. 178 179 After converting the input to romaji using `Cutlet.romaji` and making 180 the result lower-case, any runs of non alpha-numeric characters are 181 replaced with a single hyphen. Any leading or trailing hyphens are 182 stripped. 183 """ 184 roma = self.romaji(text).lower() 185 slug = re.sub(r'[^a-z0-9]+', '-', roma).strip('-') 186 return slug
Generate a URL-friendly slug.
After converting the input to romaji using Cutlet.romaji
and making
the result lower-case, any runs of non alpha-numeric characters are
replaced with a single hyphen. Any leading or trailing hyphens are
stripped.
188 def romaji_tokens(self, words, capitalize=True, title=False): 189 """Build a list of tokens from input nodes. 190 191 If `capitalize` is true, then the first letter of the first token will be 192 capitalized. This is typically the desired behavior if the input is a 193 complete sentence. 194 195 If `title` is true, then words will be capitalized as in a book title. 196 This means most words will be capitalized, but some parts of speech 197 (particles, endings) will not. 198 199 If the text was not normalized before being tokenized, the output is 200 undefined. For details of normalization, see `normalize_text`. 201 202 The number of output tokens will equal the number of input nodes. 203 """ 204 205 out = [] 206 207 for wi, word in enumerate(words): 208 po = out[-1] if out else None 209 pw = words[wi - 1] if wi > 0 else None 210 nw = words[wi + 1] if wi < len(words) - 1 else None 211 212 # handle possessive apostrophe as a special case 213 if (word.surface == "'" and 214 (nw and nw.char_type == CHAR_ALPHA and not nw.white_space) and 215 not word.white_space): 216 # remove preceeding space 217 if po: 218 po.space = False 219 out.append(Token(word.surface, False)) 220 continue 221 222 # resolve split verbs / adjectives 223 roma = self.romaji_word(word) 224 if roma and po and po.surface and po.surface[-1] == 'っ': 225 po.surface = po.surface[:-1] + roma[0] 226 if word.feature.pos2 == '固有名詞': 227 roma = roma.title() 228 if (title and 229 word.feature.pos1 not in ('助詞', '助動詞', '接尾辞') and 230 not (pw and pw.feature.pos1 == '接頭辞')): 231 roma = roma.title() 232 233 tok = Token(roma, False) 234 # handle punctuation with atypical spacing 235 if word.surface in '「『': 236 if po: 237 po.space = True 238 out.append(tok) 239 continue 240 if roma in '([': 241 if po: 242 po.space = True 243 out.append(tok) 244 continue 245 if roma == '/': 246 out.append(tok) 247 continue 248 249 out.append(tok) 250 251 # no space sometimes 252 # お酒 -> osake 253 if word.feature.pos1 == '接頭辞': continue 254 # 今日、 -> kyou, ; 図書館 -> toshokan 255 if nw and nw.feature.pos1 in ('補助記号', '接尾辞'): continue 256 # special case for half-width commas 257 if nw and nw.surface == ',': continue 258 # 思えば -> omoeba 259 if nw and nw.feature.pos2 in ('接続助詞'): continue 260 # 333 -> 333 ; this should probably be handled in mecab 261 if (word.surface.isdigit() and 262 nw and nw.surface.isdigit()): 263 continue 264 # そうでした -> sou deshita 265 if (nw and word.feature.pos1 in ('動詞', '助動詞','形容詞') 266 and nw.feature.pos1 == '助動詞' 267 and nw.surface != 'です'): 268 continue 269 270 # if we get here, it does need a space 271 tok.space = True 272 273 # remove any leftover っ 274 for tok in out: 275 tok.surface = tok.surface.replace("っ", "") 276 277 # capitalize the first letter 278 if capitalize and out and out[0].surface: 279 ss = out[0].surface 280 out[0].surface = ss[0].capitalize() + ss[1:] 281 return out
Build a list of tokens from input nodes.
If capitalize
is true, then the first letter of the first token will be
capitalized. This is typically the desired behavior if the input is a
complete sentence.
If title
is true, then words will be capitalized as in a book title.
This means most words will be capitalized, but some parts of speech
(particles, endings) will not.
If the text was not normalized before being tokenized, the output is
undefined. For details of normalization, see normalize_text
.
The number of output tokens will equal the number of input nodes.
283 def romaji(self, text, capitalize=True, title=False): 284 """Build a complete string from input text. 285 286 If `capitalize` is true, then the first letter of the text will be 287 capitalized. This is typically the desired behavior if the input is a 288 complete sentence. 289 290 If `title` is true, then words will be capitalized as in a book title. 291 This means most words will be capitalized, but some parts of speech 292 (particles, endings) will not. 293 """ 294 if not text: 295 return '' 296 297 text = normalize_text(text) 298 words = self.tagger(text) 299 300 tokens = self.romaji_tokens(words, capitalize, title) 301 out = ''.join([str(tok) for tok in tokens]).strip() 302 return out
Build a complete string from input text.
If capitalize
is true, then the first letter of the text will be
capitalized. This is typically the desired behavior if the input is a
complete sentence.
If title
is true, then words will be capitalized as in a book title.
This means most words will be capitalized, but some parts of speech
(particles, endings) will not.
304 def romaji_word(self, word): 305 """Return the romaji for a single word (node).""" 306 307 if word.surface in self.exceptions: 308 return self.exceptions[word.surface] 309 310 if word.surface.isdigit(): 311 return word.surface 312 313 if word.surface.isascii(): 314 return word.surface 315 316 # deal with unks first 317 if word.is_unk: 318 # at this point is is presumably an unk 319 # Check character type using the values defined in char.def. 320 # This is constant across unidic versions so far but not guaranteed. 321 if word.char_type in (CHAR_HIRAGANA, CHAR_KATAKANA): 322 kana = jaconv.kata2hira(word.surface) 323 return self.map_kana(kana) 324 325 # At this point this is an unknown word and not kana. Could be 326 # unknown kanji, could be hangul, cyrillic, something else. 327 # By default ensure ascii by replacing with ?, but allow pass-through. 328 if self.ensure_ascii: 329 out = '?' * len(word.surface) 330 return out 331 else: 332 return word.surface 333 334 if word.feature.pos1 == '補助記号': 335 # If it's punctuation we don't recognize, just discard it 336 return self.table.get(word.surface, '') 337 elif (self.use_wa and 338 word.feature.pos1 == '助詞' and word.feature.pron == 'ワ'): 339 return 'wa' 340 elif (not self.use_he and 341 word.feature.pos1 == '助詞' and word.feature.pron == 'エ'): 342 return 'e' 343 elif (not self.use_wo and 344 word.feature.pos1 == '助詞' and word.feature.pron == 'オ'): 345 return 'o' 346 elif (self.use_foreign_spelling and 347 has_foreign_lemma(word)): 348 # this is a foreign word with known spelling 349 return word.feature.lemma.split('-')[-1] 350 elif word.feature.kana: 351 # for known words 352 kana = jaconv.kata2hira(word.feature.kana) 353 return self.map_kana(kana) 354 else: 355 # unclear when we would actually get here 356 return word.surface
Return the romaji for a single word (node).
358 def map_kana(self, kana): 359 """Given a list of kana, convert them to romaji. 360 361 The exact romaji resulting from a kana sequence depend on the preceding 362 or following kana, so this handles that conversion. 363 """ 364 out = '' 365 for ki, char in enumerate(kana): 366 nk = kana[ki + 1] if ki < len(kana) - 1 else None 367 pk = kana[ki - 1] if ki > 0 else None 368 out += self.get_single_mapping(pk, char, nk) 369 return out
Given a list of kana, convert them to romaji.
The exact romaji resulting from a kana sequence depend on the preceding or following kana, so this handles that conversion.
371 def get_single_mapping(self, pk, kk, nk): 372 """Given a single kana and its neighbors, return the mapped romaji.""" 373 # handle odoriji 374 # NOTE: This is very rarely useful at present because odoriji are not 375 # left in readings for dictionary words, and we can't follow kana 376 # across word boundaries. 377 if kk in ODORI: 378 if kk in 'ゝヽ': 379 if pk: return pk 380 else: return '' # invalid but be nice 381 if kk in 'ゞヾ': # repeat with voicing 382 if not pk: return '' 383 vv = add_dakuten(pk) 384 if vv: return self.table[vv] 385 else: return '' 386 # remaining are 々 for kanji and 〃 for symbols, but we can't 387 # infer their span reliably (or handle rendaku) 388 return '' 389 390 391 # handle digraphs 392 if pk and (pk + kk) in self.table: 393 return self.table[pk + kk] 394 if nk and (kk + nk) in self.table: 395 return '' 396 397 if nk and nk in SUTEGANA: 398 if kk == 'っ': return '' # never valid, just ignore 399 return self.table[kk][:-1] + self.table[nk] 400 if kk in SUTEGANA: 401 return '' 402 403 if kk == 'ー': # 長音符 404 if pk and pk in self.table: return self.table[pk][-1] 405 else: return '-' 406 407 if kk == 'っ': 408 if nk: 409 if self.use_tch and nk == 'ち': return 't' 410 elif nk in 'あいうえおっ': return '-' 411 else: return self.table[nk][0] # first character 412 else: 413 # seems like it should never happen, but 乗っ|た is two tokens 414 # so leave this as is and pick it up at the word level 415 return 'っ' 416 417 if kk == 'ん': 418 if nk and nk in 'あいうえおやゆよ': return "n'" 419 else: return 'n' 420 421 return self.table[kk]
Given a single kana and its neighbors, return the mapped romaji.