Hide keyboard shortcuts

Hot-keys on this page

r m x p   toggle line displays

j k   next/prev highlighted chunk

0   (zero) top of page

1   (one) first highlighted chunk

1######################## BEGIN LICENSE BLOCK ######################## 

2# The Original Code is Mozilla Universal charset detector code. 

3# 

4# The Initial Developer of the Original Code is 

5# Shy Shalom 

6# Portions created by the Initial Developer are Copyright (C) 2005 

7# the Initial Developer. All Rights Reserved. 

8# 

9# Contributor(s): 

10# Mark Pilgrim - port to Python 

11# 

12# This library is free software; you can redistribute it and/or 

13# modify it under the terms of the GNU Lesser General Public 

14# License as published by the Free Software Foundation; either 

15# version 2.1 of the License, or (at your option) any later version. 

16# 

17# This library is distributed in the hope that it will be useful, 

18# but WITHOUT ANY WARRANTY; without even the implied warranty of 

19# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU 

20# Lesser General Public License for more details. 

21# 

22# You should have received a copy of the GNU Lesser General Public 

23# License along with this library; if not, write to the Free Software 

24# Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 

25# 02110-1301 USA 

26######################### END LICENSE BLOCK ######################### 

27 

28from .charsetprober import CharSetProber 

29from .enums import ProbingState 

30 

31# This prober doesn't actually recognize a language or a charset. 

32# It is a helper prober for the use of the Hebrew model probers 

33 

34### General ideas of the Hebrew charset recognition ### 

35# 

36# Four main charsets exist in Hebrew: 

37# "ISO-8859-8" - Visual Hebrew 

38# "windows-1255" - Logical Hebrew 

39# "ISO-8859-8-I" - Logical Hebrew 

40# "x-mac-hebrew" - ?? Logical Hebrew ?? 

41# 

42# Both "ISO" charsets use a completely identical set of code points, whereas 

43# "windows-1255" and "x-mac-hebrew" are two different proper supersets of 

44# these code points. windows-1255 defines additional characters in the range 

45# 0x80-0x9F as some misc punctuation marks as well as some Hebrew-specific 

46# diacritics and additional 'Yiddish' ligature letters in the range 0xc0-0xd6. 

47# x-mac-hebrew defines similar additional code points but with a different 

48# mapping. 

49# 

50# As far as an average Hebrew text with no diacritics is concerned, all four 

51# charsets are identical with respect to code points. Meaning that for the 

52# main Hebrew alphabet, all four map the same values to all 27 Hebrew letters 

53# (including final letters). 

54# 

55# The dominant difference between these charsets is their directionality. 

56# "Visual" directionality means that the text is ordered as if the renderer is 

57# not aware of a BIDI rendering algorithm. The renderer sees the text and 

58# draws it from left to right. The text itself when ordered naturally is read 

59# backwards. A buffer of Visual Hebrew generally looks like so: 

60# "[last word of first line spelled backwards] [whole line ordered backwards 

61# and spelled backwards] [first word of first line spelled backwards] 

62# [end of line] [last word of second line] ... etc' " 

63# adding punctuation marks, numbers and English text to visual text is 

64# naturally also "visual" and from left to right. 

65# 

66# "Logical" directionality means the text is ordered "naturally" according to 

67# the order it is read. It is the responsibility of the renderer to display 

68# the text from right to left. A BIDI algorithm is used to place general 

69# punctuation marks, numbers and English text in the text. 

70# 

71# Texts in x-mac-hebrew are almost impossible to find on the Internet. From 

72# what little evidence I could find, it seems that its general directionality 

73# is Logical. 

74# 

75# To sum up all of the above, the Hebrew probing mechanism knows about two 

76# charsets: 

77# Visual Hebrew - "ISO-8859-8" - backwards text - Words and sentences are 

78# backwards while line order is natural. For charset recognition purposes 

79# the line order is unimportant (In fact, for this implementation, even 

80# word order is unimportant). 

81# Logical Hebrew - "windows-1255" - normal, naturally ordered text. 

82# 

83# "ISO-8859-8-I" is a subset of windows-1255 and doesn't need to be 

84# specifically identified. 

85# "x-mac-hebrew" is also identified as windows-1255. A text in x-mac-hebrew 

86# that contain special punctuation marks or diacritics is displayed with 

87# some unconverted characters showing as question marks. This problem might 

88# be corrected using another model prober for x-mac-hebrew. Due to the fact 

89# that x-mac-hebrew texts are so rare, writing another model prober isn't 

90# worth the effort and performance hit. 

91# 

92#### The Prober #### 

93# 

94# The prober is divided between two SBCharSetProbers and a HebrewProber, 

95# all of which are managed, created, fed data, inquired and deleted by the 

96# SBCSGroupProber. The two SBCharSetProbers identify that the text is in 

97# fact some kind of Hebrew, Logical or Visual. The final decision about which 

98# one is it is made by the HebrewProber by combining final-letter scores 

99# with the scores of the two SBCharSetProbers to produce a final answer. 

100# 

101# The SBCSGroupProber is responsible for stripping the original text of HTML 

102# tags, English characters, numbers, low-ASCII punctuation characters, spaces 

103# and new lines. It reduces any sequence of such characters to a single space. 

104# The buffer fed to each prober in the SBCS group prober is pure text in 

105# high-ASCII. 

106# The two SBCharSetProbers (model probers) share the same language model: 

107# Win1255Model. 

108# The first SBCharSetProber uses the model normally as any other 

109# SBCharSetProber does, to recognize windows-1255, upon which this model was 

110# built. The second SBCharSetProber is told to make the pair-of-letter 

111# lookup in the language model backwards. This in practice exactly simulates 

112# a visual Hebrew model using the windows-1255 logical Hebrew model. 

113# 

114# The HebrewProber is not using any language model. All it does is look for 

115# final-letter evidence suggesting the text is either logical Hebrew or visual 

116# Hebrew. Disjointed from the model probers, the results of the HebrewProber 

117# alone are meaningless. HebrewProber always returns 0.00 as confidence 

118# since it never identifies a charset by itself. Instead, the pointer to the 

119# HebrewProber is passed to the model probers as a helper "Name Prober". 

120# When the Group prober receives a positive identification from any prober, 

121# it asks for the name of the charset identified. If the prober queried is a 

122# Hebrew model prober, the model prober forwards the call to the 

123# HebrewProber to make the final decision. In the HebrewProber, the 

124# decision is made according to the final-letters scores maintained and Both 

125# model probers scores. The answer is returned in the form of the name of the 

126# charset identified, either "windows-1255" or "ISO-8859-8". 

127 

128class HebrewProber(CharSetProber): 

129 # windows-1255 / ISO-8859-8 code points of interest 

130 FINAL_KAF = 0xea 

131 NORMAL_KAF = 0xeb 

132 FINAL_MEM = 0xed 

133 NORMAL_MEM = 0xee 

134 FINAL_NUN = 0xef 

135 NORMAL_NUN = 0xf0 

136 FINAL_PE = 0xf3 

137 NORMAL_PE = 0xf4 

138 FINAL_TSADI = 0xf5 

139 NORMAL_TSADI = 0xf6 

140 

141 # Minimum Visual vs Logical final letter score difference. 

142 # If the difference is below this, don't rely solely on the final letter score 

143 # distance. 

144 MIN_FINAL_CHAR_DISTANCE = 5 

145 

146 # Minimum Visual vs Logical model score difference. 

147 # If the difference is below this, don't rely at all on the model score 

148 # distance. 

149 MIN_MODEL_DISTANCE = 0.01 

150 

151 VISUAL_HEBREW_NAME = "ISO-8859-8" 

152 LOGICAL_HEBREW_NAME = "windows-1255" 

153 

154 def __init__(self): 

155 super(HebrewProber, self).__init__() 

156 self._final_char_logical_score = None 

157 self._final_char_visual_score = None 

158 self._prev = None 

159 self._before_prev = None 

160 self._logical_prober = None 

161 self._visual_prober = None 

162 self.reset() 

163 

164 def reset(self): 

165 self._final_char_logical_score = 0 

166 self._final_char_visual_score = 0 

167 # The two last characters seen in the previous buffer, 

168 # mPrev and mBeforePrev are initialized to space in order to simulate 

169 # a word delimiter at the beginning of the data 

170 self._prev = ' ' 

171 self._before_prev = ' ' 

172 # These probers are owned by the group prober. 

173 

174 def set_model_probers(self, logicalProber, visualProber): 

175 self._logical_prober = logicalProber 

176 self._visual_prober = visualProber 

177 

178 def is_final(self, c): 

179 return c in [self.FINAL_KAF, self.FINAL_MEM, self.FINAL_NUN, 

180 self.FINAL_PE, self.FINAL_TSADI] 

181 

182 def is_non_final(self, c): 

183 # The normal Tsadi is not a good Non-Final letter due to words like 

184 # 'lechotet' (to chat) containing an apostrophe after the tsadi. This 

185 # apostrophe is converted to a space in FilterWithoutEnglishLetters 

186 # causing the Non-Final tsadi to appear at an end of a word even 

187 # though this is not the case in the original text. 

188 # The letters Pe and Kaf rarely display a related behavior of not being 

189 # a good Non-Final letter. Words like 'Pop', 'Winamp' and 'Mubarak' 

190 # for example legally end with a Non-Final Pe or Kaf. However, the 

191 # benefit of these letters as Non-Final letters outweighs the damage 

192 # since these words are quite rare. 

193 return c in [self.NORMAL_KAF, self.NORMAL_MEM, 

194 self.NORMAL_NUN, self.NORMAL_PE] 

195 

196 def feed(self, byte_str): 

197 # Final letter analysis for logical-visual decision. 

198 # Look for evidence that the received buffer is either logical Hebrew 

199 # or visual Hebrew. 

200 # The following cases are checked: 

201 # 1) A word longer than 1 letter, ending with a final letter. This is 

202 # an indication that the text is laid out "naturally" since the 

203 # final letter really appears at the end. +1 for logical score. 

204 # 2) A word longer than 1 letter, ending with a Non-Final letter. In 

205 # normal Hebrew, words ending with Kaf, Mem, Nun, Pe or Tsadi, 

206 # should not end with the Non-Final form of that letter. Exceptions 

207 # to this rule are mentioned above in isNonFinal(). This is an 

208 # indication that the text is laid out backwards. +1 for visual 

209 # score 

210 # 3) A word longer than 1 letter, starting with a final letter. Final 

211 # letters should not appear at the beginning of a word. This is an 

212 # indication that the text is laid out backwards. +1 for visual 

213 # score. 

214 # 

215 # The visual score and logical score are accumulated throughout the 

216 # text and are finally checked against each other in GetCharSetName(). 

217 # No checking for final letters in the middle of words is done since 

218 # that case is not an indication for either Logical or Visual text. 

219 # 

220 # We automatically filter out all 7-bit characters (replace them with 

221 # spaces) so the word boundary detection works properly. [MAP] 

222 

223 if self.state == ProbingState.NOT_ME: 

224 # Both model probers say it's not them. No reason to continue. 

225 return ProbingState.NOT_ME 

226 

227 byte_str = self.filter_high_byte_only(byte_str) 

228 

229 for cur in byte_str: 

230 if cur == ' ': 

231 # We stand on a space - a word just ended 

232 if self._before_prev != ' ': 

233 # next-to-last char was not a space so self._prev is not a 

234 # 1 letter word 

235 if self.is_final(self._prev): 

236 # case (1) [-2:not space][-1:final letter][cur:space] 

237 self._final_char_logical_score += 1 

238 elif self.is_non_final(self._prev): 

239 # case (2) [-2:not space][-1:Non-Final letter][ 

240 # cur:space] 

241 self._final_char_visual_score += 1 

242 else: 

243 # Not standing on a space 

244 if ((self._before_prev == ' ') and 

245 (self.is_final(self._prev)) and (cur != ' ')): 

246 # case (3) [-2:space][-1:final letter][cur:not space] 

247 self._final_char_visual_score += 1 

248 self._before_prev = self._prev 

249 self._prev = cur 

250 

251 # Forever detecting, till the end or until both model probers return 

252 # ProbingState.NOT_ME (handled above) 

253 return ProbingState.DETECTING 

254 

255 @property 

256 def charset_name(self): 

257 # Make the decision: is it Logical or Visual? 

258 # If the final letter score distance is dominant enough, rely on it. 

259 finalsub = self._final_char_logical_score - self._final_char_visual_score 

260 if finalsub >= self.MIN_FINAL_CHAR_DISTANCE: 

261 return self.LOGICAL_HEBREW_NAME 

262 if finalsub <= -self.MIN_FINAL_CHAR_DISTANCE: 

263 return self.VISUAL_HEBREW_NAME 

264 

265 # It's not dominant enough, try to rely on the model scores instead. 

266 modelsub = (self._logical_prober.get_confidence() 

267 - self._visual_prober.get_confidence()) 

268 if modelsub > self.MIN_MODEL_DISTANCE: 

269 return self.LOGICAL_HEBREW_NAME 

270 if modelsub < -self.MIN_MODEL_DISTANCE: 

271 return self.VISUAL_HEBREW_NAME 

272 

273 # Still no good, back to final letter distance, maybe it'll save the 

274 # day. 

275 if finalsub < 0.0: 

276 return self.VISUAL_HEBREW_NAME 

277 

278 # (finalsub > 0 - Logical) or (don't know what to do) default to 

279 # Logical. 

280 return self.LOGICAL_HEBREW_NAME 

281 

282 @property 

283 def language(self): 

284 return 'Hebrew' 

285 

286 @property 

287 def state(self): 

288 # Remain active as long as any of the model probers are active. 

289 if (self._logical_prober.state == ProbingState.NOT_ME) and \ 

290 (self._visual_prober.state == ProbingState.NOT_ME): 

291 return ProbingState.NOT_ME 

292 return ProbingState.DETECTING