cutlet
cutlet
Cutlet is a tool to convert Japanese to romaji. Check out the interactive demo!
issueを英語で書く必要はありません。
Features:
- support for Modified Hepburn, Kunreisiki, Nihonsiki systems
- custom overrides for individual mappings
- custom overrides for specific words
- built in exceptions list (Tokyo, Osaka, etc.)
- uses foreign spelling when available in UniDic
- proper nouns are capitalized
- slug mode for url generation
Things not supported:
- traditional Hepburn n-to-m: Shimbashi
- macrons or circumflexes: Tōkyō, Tôkyô
- passport Hepburn: Satoh (but you can use an exception)
- hyphenating words
- Traditional Hepburn in general is not supported
Internally, cutlet uses fugashi, so you can use the same dictionary you use for normal tokenization.
Installation
Cutlet can be installed through pip as usual.
pip install cutlet
Note that if you don't have a MeCab dictionary installed you'll also have to install one. If you're just getting started unidic-lite is a good choice.
pip install unidic-lite
Usage
A command-line script is included for quick testing. Just use cutlet
and each
line of stdin will be treated as a sentence. You can specify the system to use
(hepburn
, kunrei
, nippon
, or nihon
) as the first argument.
$ cutlet
ローマ字変換プログラム作ってみた。
Roma ji henkan program tsukutte mita.
In code:
import cutlet
katsu = cutlet.Cutlet()
katsu.romaji("カツカレーは美味しい")
# => 'Cutlet curry wa oishii'
# you can print a slug suitable for urls
katsu.slug("カツカレーは美味しい")
# => 'cutlet-curry-wa-oishii'
# You can disable using foreign spelling too
katsu.use_foreign_spelling = False
katsu.romaji("カツカレーは美味しい")
# => 'Katsu karee wa oishii'
# kunreisiki, nihonsiki work too
katu = cutlet.Cutlet('kunrei')
katu.romaji("富士山")
# => 'Huzi yama'
# comparison
nkatu = cutlet.Cutlet('nihon')
sent = "彼女は王への手紙を読み上げた。"
katsu.romaji(sent)
# => 'Kanojo wa ou e no tegami wo yomiageta.'
katu.romaji(sent)
# => 'Kanozyo wa ou e no tegami o yomiageta.'
nkatu.romaji(sent)
# => 'Kanozyo ha ou he no tegami wo yomiageta.'
Alternatives
76class Cutlet: 77 # TODO add mecab args 78 def __init__( 79 self, 80 system = 'hepburn', 81 use_foreign_spelling = True, 82 ensure_ascii = True, 83): 84 """Create a Cutlet object, which holds configuration as well as 85 tokenizer state. 86 87 `system` is `hepburn` by default, and may also be `kunrei` or 88 `nihon`. `nippon` is permitted as a synonym for `nihon`. 89 90 If `use_foreign_spelling` is true, output will use the foreign spelling 91 provided in a UniDic lemma when available. For example, "カツ" will 92 become "cutlet" instead of "katsu". 93 94 If `ensure_ascii` is true, any non-ASCII characters that can't be 95 romanized will be replaced with `?`. If false, they will be passed 96 through. 97 98 Typical usage: 99 100 ```python 101 katsu = Cutlet() 102 roma = katsu.romaji("カツカレーを食べた") 103 # "Cutlet curry wo tabeta" 104 ``` 105 """ 106 # allow 'nippon' for 'nihon' 107 if system == 'nippon': system = 'nihon' 108 self.system = system 109 try: 110 # make a copy so we can modify it 111 self.table = dict(SYSTEMS[system]) 112 # TODO fix this 113 except KeyError: 114 print("unknown system: {}".format(system)) 115 raise 116 117 self.tagger = fugashi.Tagger() 118 self.exceptions = load_exceptions() 119 120 # these are too minor to be worth exposing as arguments 121 self.use_tch = (self.system in ('hepburn',)) 122 self.use_wa = (self.system in ('hepburn', 'kunrei')) 123 self.use_he = (self.system in ('nihon',)) 124 self.use_wo = (self.system in ('hepburn', 'nihon')) 125 126 self.use_foreign_spelling = True 127 self.ensure_ascii = True 128 129 def add_exception(self, key, val): 130 """Add an exception to the internal list. 131 132 An exception overrides a whole token, for example to replace "Toukyou" 133 with "Tokyo". Note that it must match the tokenizer output and be a 134 single token to work. To replace longer phrases, you'll need to use a 135 different strategy, like string replacement. 136 """ 137 self.exceptions[key] = val 138 139 def update_mapping(self, key, val): 140 """Update mapping table for a single kana. 141 142 This can be used to mix common systems, or to modify particular 143 details. For example, you can use `update_mapping("ぢ", "di")` to 144 differentiate ぢ and じ in Hepburn. 145 146 Example usage: 147 148 ``` 149 cut = Cutlet() 150 cut.romaji("お茶漬け") # Ochazuke 151 cut.update_mapping("づ", "du") 152 cut.romaji("お茶漬け") # Ochaduke 153 ``` 154 """ 155 self.table[key] = val 156 157 def slug(self, text): 158 """Generate a URL-friendly slug. 159 160 After converting the input to romaji using `Cutlet.romaji` and making 161 the result lower-case, any runs of non alpha-numeric characters are 162 replaced with a single hyphen. Any leading or trailing hyphens are 163 stripped. 164 """ 165 roma = self.romaji(text).lower() 166 slug = re.sub(r'[^a-z0-9]+', '-', roma).strip('-') 167 return slug 168 169 def romaji(self, text, capitalize=True, title=False): 170 """Build a complete string from input text. 171 172 If `capitalize` is true, then the first letter of the text will be 173 capitalized. This is typically the desired behavior if the input is a 174 complete sentence. 175 176 If `title` is true, then words will be capitalized as in a book title. 177 This means most words will be capitalized, but some parts of speech 178 (particles, endings) will not. 179 """ 180 if not text: 181 return '' 182 183 # perform unicode normalization 184 text = unicodedata.normalize('NFKC', text) 185 # convert all full-width alphanum to half-width, since it can go out as-is 186 text = mojimoji.zen_to_han(text, kana=False) 187 # replace half-width katakana with full-width 188 text = mojimoji.han_to_zen(text, digit=False, ascii=False) 189 190 words = self.tagger(text) 191 192 # TODO make a list and join to avoid weirdness with string building 193 out = '' 194 195 for wi, word in enumerate(words): 196 pw = words[wi - 1] if wi > 0 else None 197 nw = words[wi + 1] if wi < len(words) - 1 else None 198 199 # handle possessive apostrophe as a special case 200 if (word.surface == "'" and 201 (nw and nw.char_type == 5 and not nw.white_space) and 202 not word.white_space): 203 # remove preceeding space 204 out = out[:-1] 205 out += word.surface 206 continue 207 208 # resolve split verbs / adjectives 209 roma = self.romaji_word(word) 210 if roma and out and out[-1] == 'っ': 211 out = out[:-1] + roma[0] 212 if word.feature.pos2 == '固有名詞': 213 roma = roma.title() 214 if (title and 215 word.feature.pos1 not in ('助詞', '助動詞', '接尾辞') and 216 not (pw and pw.feature.pos1 == '接頭辞')): 217 roma = roma.title() 218 # handle punctuation with atypical spacing 219 if word.surface in '「『': 220 out += ' ' + roma 221 continue 222 if roma in '([': 223 out += ' ' + roma 224 continue 225 if roma == '/': 226 out += '/' 227 continue 228 out += roma 229 230 # no space sometimes 231 # お酒 -> osake 232 if word.feature.pos1 == '接頭辞': continue 233 # 今日、 -> kyou, ; 図書館 -> toshokan 234 if nw and nw.feature.pos1 in ('補助記号', '接尾辞'): continue 235 # special case for half-width commas 236 if nw and nw.surface == ',': continue 237 # 思えば -> omoeba 238 if nw and nw.feature.pos2 in ('接続助詞'): continue 239 # 333 -> 333 ; this should probably be handled in mecab 240 if (word.surface.isdigit() and 241 nw and nw.surface.isdigit()): 242 continue 243 # そうでした -> sou deshita 244 if (nw and word.feature.pos1 in ('動詞', '助動詞','形容詞') 245 and nw.feature.pos1 == '助動詞' 246 and nw.surface != 'です'): 247 continue 248 out += ' ' 249 # remove any leftover っ 250 out = out.replace('っ', '').strip() 251 # capitalize the first letter 252 if capitalize and len(out) > 0: 253 tmp = out[0].capitalize() 254 if len(out) > 1: 255 tmp += out[1:] 256 out = tmp 257 return out 258 259 def romaji_word(self, word): 260 """Return the romaji for a single word (node).""" 261 262 if word.surface in self.exceptions: 263 return self.exceptions[word.surface] 264 265 if word.surface.isdigit(): 266 return word.surface 267 268 if is_ascii(word.surface): 269 return word.surface 270 271 # deal with unks first 272 if word.is_unk: 273 # at this point is is presumably an unk 274 # Check character type using the values defined in char.def. 275 # This is constant across unidic versions so far but not guaranteed. 276 if word.char_type == 6 or word.char_type == 7: # hiragana/katakana 277 kana = jaconv.kata2hira(word.surface) 278 return self.map_kana(kana) 279 280 # At this point this is an unknown word and not kana. Could be 281 # unknown kanji, could be hangul, cyrillic, something else. 282 # By default ensure ascii by replacing with ?, but allow pass-through. 283 if self.ensure_ascii: 284 out = '?' * len(word.surface) 285 return out 286 else: 287 return word.surface 288 289 if word.feature.pos1 == '補助記号': 290 # If it's punctuation we don't recognize, just discard it 291 return self.table.get(word.surface, '') 292 elif (self.use_wa and 293 word.feature.pos1 == '助詞' and word.feature.pron == 'ワ'): 294 return 'wa' 295 elif (not self.use_he and 296 word.feature.pos1 == '助詞' and word.feature.pron == 'エ'): 297 return 'e' 298 elif (not self.use_wo and 299 word.feature.pos1 == '助詞' and word.feature.pron == 'オ'): 300 return 'o' 301 elif (self.use_foreign_spelling and 302 has_foreign_lemma(word)): 303 # this is a foreign word with known spelling 304 return word.feature.lemma.split('-')[-1] 305 elif word.feature.kana: 306 # for known words 307 kana = jaconv.kata2hira(word.feature.kana) 308 return self.map_kana(kana) 309 else: 310 # unclear when we would actually get here 311 return word.surface 312 313 def map_kana(self, kana): 314 """Given a list of kana, convert them to romaji. 315 316 The exact romaji resulting from a kana sequence depend on the preceding 317 or following kana, so this handles that conversion. 318 """ 319 out = '' 320 for ki, char in enumerate(kana): 321 nk = kana[ki + 1] if ki < len(kana) - 1 else None 322 pk = kana[ki - 1] if ki > 0 else None 323 out += self.get_single_mapping(pk, char, nk) 324 return out 325 326 def get_single_mapping(self, pk, kk, nk): 327 """Given a single kana and its neighbors, return the mapped romaji.""" 328 # handle odoriji 329 # NOTE: This is very rarely useful at present because odoriji are not 330 # left in readings for dictionary words, and we can't follow kana 331 # across word boundaries. 332 if kk in ODORI: 333 if kk in 'ゝヽ': 334 if pk: return pk 335 else: return '' # invalid but be nice 336 if kk in 'ゞヾ': # repeat with voicing 337 if not pk: return '' 338 vv = add_dakuten(pk) 339 if vv: return self.table[vv] 340 else: return '' 341 # remaining are 々 for kanji and 〃 for symbols, but we can't 342 # infer their span reliably (or handle rendaku) 343 return '' 344 345 346 # handle digraphs 347 if pk and (pk + kk) in self.table: 348 return self.table[pk + kk] 349 if nk and (kk + nk) in self.table: 350 return '' 351 352 if nk and nk in SUTEGANA: 353 if kk == 'っ': return '' # never valid, just ignore 354 return self.table[kk][:-1] + self.table[nk] 355 if kk in SUTEGANA: 356 return '' 357 358 if kk == 'ー': # 長音符 359 if pk and pk in self.table: return self.table[pk][-1] 360 else: return '-' 361 362 if kk == 'っ': 363 if nk: 364 if self.use_tch and nk == 'ち': return 't' 365 elif nk in 'あいうえおっ': return '-' 366 else: return self.table[nk][0] # first character 367 else: 368 # seems like it should never happen, but 乗っ|た is two tokens 369 # so leave this as is and pick it up at the word level 370 return 'っ' 371 372 if kk == 'ん': 373 if nk and nk in 'あいうえおやゆよ': return "n'" 374 else: return 'n' 375 376 return self.table[kk]
78 def __init__( 79 self, 80 system = 'hepburn', 81 use_foreign_spelling = True, 82 ensure_ascii = True, 83): 84 """Create a Cutlet object, which holds configuration as well as 85 tokenizer state. 86 87 `system` is `hepburn` by default, and may also be `kunrei` or 88 `nihon`. `nippon` is permitted as a synonym for `nihon`. 89 90 If `use_foreign_spelling` is true, output will use the foreign spelling 91 provided in a UniDic lemma when available. For example, "カツ" will 92 become "cutlet" instead of "katsu". 93 94 If `ensure_ascii` is true, any non-ASCII characters that can't be 95 romanized will be replaced with `?`. If false, they will be passed 96 through. 97 98 Typical usage: 99 100 ```python 101 katsu = Cutlet() 102 roma = katsu.romaji("カツカレーを食べた") 103 # "Cutlet curry wo tabeta" 104 ``` 105 """ 106 # allow 'nippon' for 'nihon' 107 if system == 'nippon': system = 'nihon' 108 self.system = system 109 try: 110 # make a copy so we can modify it 111 self.table = dict(SYSTEMS[system]) 112 # TODO fix this 113 except KeyError: 114 print("unknown system: {}".format(system)) 115 raise 116 117 self.tagger = fugashi.Tagger() 118 self.exceptions = load_exceptions() 119 120 # these are too minor to be worth exposing as arguments 121 self.use_tch = (self.system in ('hepburn',)) 122 self.use_wa = (self.system in ('hepburn', 'kunrei')) 123 self.use_he = (self.system in ('nihon',)) 124 self.use_wo = (self.system in ('hepburn', 'nihon')) 125 126 self.use_foreign_spelling = True 127 self.ensure_ascii = True
Create a Cutlet object, which holds configuration as well as tokenizer state.
system
is hepburn
by default, and may also be kunrei
or
nihon
. nippon
is permitted as a synonym for nihon
.
If use_foreign_spelling
is true, output will use the foreign spelling
provided in a UniDic lemma when available. For example, "カツ" will
become "cutlet" instead of "katsu".
If ensure_ascii
is true, any non-ASCII characters that can't be
romanized will be replaced with ?
. If false, they will be passed
through.
Typical usage:
katsu = Cutlet()
roma = katsu.romaji("カツカレーを食べた")
# "Cutlet curry wo tabeta"
129 def add_exception(self, key, val): 130 """Add an exception to the internal list. 131 132 An exception overrides a whole token, for example to replace "Toukyou" 133 with "Tokyo". Note that it must match the tokenizer output and be a 134 single token to work. To replace longer phrases, you'll need to use a 135 different strategy, like string replacement. 136 """ 137 self.exceptions[key] = val
Add an exception to the internal list.
An exception overrides a whole token, for example to replace "Toukyou" with "Tokyo". Note that it must match the tokenizer output and be a single token to work. To replace longer phrases, you'll need to use a different strategy, like string replacement.
139 def update_mapping(self, key, val): 140 """Update mapping table for a single kana. 141 142 This can be used to mix common systems, or to modify particular 143 details. For example, you can use `update_mapping("ぢ", "di")` to 144 differentiate ぢ and じ in Hepburn. 145 146 Example usage: 147 148 ``` 149 cut = Cutlet() 150 cut.romaji("お茶漬け") # Ochazuke 151 cut.update_mapping("づ", "du") 152 cut.romaji("お茶漬け") # Ochaduke 153 ``` 154 """ 155 self.table[key] = val
Update mapping table for a single kana.
This can be used to mix common systems, or to modify particular
details. For example, you can use update_mapping("ぢ", "di")
to
differentiate ぢ and じ in Hepburn.
Example usage:
cut = Cutlet()
cut.romaji("お茶漬け") # Ochazuke
cut.update_mapping("づ", "du")
cut.romaji("お茶漬け") # Ochaduke
157 def slug(self, text): 158 """Generate a URL-friendly slug. 159 160 After converting the input to romaji using `Cutlet.romaji` and making 161 the result lower-case, any runs of non alpha-numeric characters are 162 replaced with a single hyphen. Any leading or trailing hyphens are 163 stripped. 164 """ 165 roma = self.romaji(text).lower() 166 slug = re.sub(r'[^a-z0-9]+', '-', roma).strip('-') 167 return slug
Generate a URL-friendly slug.
After converting the input to romaji using Cutlet.romaji
and making
the result lower-case, any runs of non alpha-numeric characters are
replaced with a single hyphen. Any leading or trailing hyphens are
stripped.
169 def romaji(self, text, capitalize=True, title=False): 170 """Build a complete string from input text. 171 172 If `capitalize` is true, then the first letter of the text will be 173 capitalized. This is typically the desired behavior if the input is a 174 complete sentence. 175 176 If `title` is true, then words will be capitalized as in a book title. 177 This means most words will be capitalized, but some parts of speech 178 (particles, endings) will not. 179 """ 180 if not text: 181 return '' 182 183 # perform unicode normalization 184 text = unicodedata.normalize('NFKC', text) 185 # convert all full-width alphanum to half-width, since it can go out as-is 186 text = mojimoji.zen_to_han(text, kana=False) 187 # replace half-width katakana with full-width 188 text = mojimoji.han_to_zen(text, digit=False, ascii=False) 189 190 words = self.tagger(text) 191 192 # TODO make a list and join to avoid weirdness with string building 193 out = '' 194 195 for wi, word in enumerate(words): 196 pw = words[wi - 1] if wi > 0 else None 197 nw = words[wi + 1] if wi < len(words) - 1 else None 198 199 # handle possessive apostrophe as a special case 200 if (word.surface == "'" and 201 (nw and nw.char_type == 5 and not nw.white_space) and 202 not word.white_space): 203 # remove preceeding space 204 out = out[:-1] 205 out += word.surface 206 continue 207 208 # resolve split verbs / adjectives 209 roma = self.romaji_word(word) 210 if roma and out and out[-1] == 'っ': 211 out = out[:-1] + roma[0] 212 if word.feature.pos2 == '固有名詞': 213 roma = roma.title() 214 if (title and 215 word.feature.pos1 not in ('助詞', '助動詞', '接尾辞') and 216 not (pw and pw.feature.pos1 == '接頭辞')): 217 roma = roma.title() 218 # handle punctuation with atypical spacing 219 if word.surface in '「『': 220 out += ' ' + roma 221 continue 222 if roma in '([': 223 out += ' ' + roma 224 continue 225 if roma == '/': 226 out += '/' 227 continue 228 out += roma 229 230 # no space sometimes 231 # お酒 -> osake 232 if word.feature.pos1 == '接頭辞': continue 233 # 今日、 -> kyou, ; 図書館 -> toshokan 234 if nw and nw.feature.pos1 in ('補助記号', '接尾辞'): continue 235 # special case for half-width commas 236 if nw and nw.surface == ',': continue 237 # 思えば -> omoeba 238 if nw and nw.feature.pos2 in ('接続助詞'): continue 239 # 333 -> 333 ; this should probably be handled in mecab 240 if (word.surface.isdigit() and 241 nw and nw.surface.isdigit()): 242 continue 243 # そうでした -> sou deshita 244 if (nw and word.feature.pos1 in ('動詞', '助動詞','形容詞') 245 and nw.feature.pos1 == '助動詞' 246 and nw.surface != 'です'): 247 continue 248 out += ' ' 249 # remove any leftover っ 250 out = out.replace('っ', '').strip() 251 # capitalize the first letter 252 if capitalize and len(out) > 0: 253 tmp = out[0].capitalize() 254 if len(out) > 1: 255 tmp += out[1:] 256 out = tmp 257 return out
Build a complete string from input text.
If capitalize
is true, then the first letter of the text will be
capitalized. This is typically the desired behavior if the input is a
complete sentence.
If title
is true, then words will be capitalized as in a book title.
This means most words will be capitalized, but some parts of speech
(particles, endings) will not.
259 def romaji_word(self, word): 260 """Return the romaji for a single word (node).""" 261 262 if word.surface in self.exceptions: 263 return self.exceptions[word.surface] 264 265 if word.surface.isdigit(): 266 return word.surface 267 268 if is_ascii(word.surface): 269 return word.surface 270 271 # deal with unks first 272 if word.is_unk: 273 # at this point is is presumably an unk 274 # Check character type using the values defined in char.def. 275 # This is constant across unidic versions so far but not guaranteed. 276 if word.char_type == 6 or word.char_type == 7: # hiragana/katakana 277 kana = jaconv.kata2hira(word.surface) 278 return self.map_kana(kana) 279 280 # At this point this is an unknown word and not kana. Could be 281 # unknown kanji, could be hangul, cyrillic, something else. 282 # By default ensure ascii by replacing with ?, but allow pass-through. 283 if self.ensure_ascii: 284 out = '?' * len(word.surface) 285 return out 286 else: 287 return word.surface 288 289 if word.feature.pos1 == '補助記号': 290 # If it's punctuation we don't recognize, just discard it 291 return self.table.get(word.surface, '') 292 elif (self.use_wa and 293 word.feature.pos1 == '助詞' and word.feature.pron == 'ワ'): 294 return 'wa' 295 elif (not self.use_he and 296 word.feature.pos1 == '助詞' and word.feature.pron == 'エ'): 297 return 'e' 298 elif (not self.use_wo and 299 word.feature.pos1 == '助詞' and word.feature.pron == 'オ'): 300 return 'o' 301 elif (self.use_foreign_spelling and 302 has_foreign_lemma(word)): 303 # this is a foreign word with known spelling 304 return word.feature.lemma.split('-')[-1] 305 elif word.feature.kana: 306 # for known words 307 kana = jaconv.kata2hira(word.feature.kana) 308 return self.map_kana(kana) 309 else: 310 # unclear when we would actually get here 311 return word.surface
Return the romaji for a single word (node).
313 def map_kana(self, kana): 314 """Given a list of kana, convert them to romaji. 315 316 The exact romaji resulting from a kana sequence depend on the preceding 317 or following kana, so this handles that conversion. 318 """ 319 out = '' 320 for ki, char in enumerate(kana): 321 nk = kana[ki + 1] if ki < len(kana) - 1 else None 322 pk = kana[ki - 1] if ki > 0 else None 323 out += self.get_single_mapping(pk, char, nk) 324 return out
Given a list of kana, convert them to romaji.
The exact romaji resulting from a kana sequence depend on the preceding or following kana, so this handles that conversion.
326 def get_single_mapping(self, pk, kk, nk): 327 """Given a single kana and its neighbors, return the mapped romaji.""" 328 # handle odoriji 329 # NOTE: This is very rarely useful at present because odoriji are not 330 # left in readings for dictionary words, and we can't follow kana 331 # across word boundaries. 332 if kk in ODORI: 333 if kk in 'ゝヽ': 334 if pk: return pk 335 else: return '' # invalid but be nice 336 if kk in 'ゞヾ': # repeat with voicing 337 if not pk: return '' 338 vv = add_dakuten(pk) 339 if vv: return self.table[vv] 340 else: return '' 341 # remaining are 々 for kanji and 〃 for symbols, but we can't 342 # infer their span reliably (or handle rendaku) 343 return '' 344 345 346 # handle digraphs 347 if pk and (pk + kk) in self.table: 348 return self.table[pk + kk] 349 if nk and (kk + nk) in self.table: 350 return '' 351 352 if nk and nk in SUTEGANA: 353 if kk == 'っ': return '' # never valid, just ignore 354 return self.table[kk][:-1] + self.table[nk] 355 if kk in SUTEGANA: 356 return '' 357 358 if kk == 'ー': # 長音符 359 if pk and pk in self.table: return self.table[pk][-1] 360 else: return '-' 361 362 if kk == 'っ': 363 if nk: 364 if self.use_tch and nk == 'ち': return 't' 365 elif nk in 'あいうえおっ': return '-' 366 else: return self.table[nk][0] # first character 367 else: 368 # seems like it should never happen, but 乗っ|た is two tokens 369 # so leave this as is and pick it up at the word level 370 return 'っ' 371 372 if kk == 'ん': 373 if nk and nk in 'あいうえおやゆよ': return "n'" 374 else: return 'n' 375 376 return self.table[kk]
Given a single kana and its neighbors, return the mapped romaji.