Coverage for nlp_manager/parse_medex.py: 35%

222 statements  

« prev     ^ index     » next       coverage.py v7.8.0, created at 2025-08-27 10:34 -0500

1r""" 

2crate_anon/nlp_manager/parse_medex.py 

3 

4=============================================================================== 

5 

6 Copyright (C) 2015, University of Cambridge, Department of Psychiatry. 

7 Created by Rudolf Cardinal (rnc1001@cam.ac.uk). 

8 

9 This file is part of CRATE. 

10 

11 CRATE is free software: you can redistribute it and/or modify 

12 it under the terms of the GNU General Public License as published by 

13 the Free Software Foundation, either version 3 of the License, or 

14 (at your option) any later version. 

15 

16 CRATE is distributed in the hope that it will be useful, 

17 but WITHOUT ANY WARRANTY; without even the implied warranty of 

18 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 

19 GNU General Public License for more details. 

20 

21 You should have received a copy of the GNU General Public License 

22 along with CRATE. If not, see <https://www.gnu.org/licenses/>. 

23 

24=============================================================================== 

25 

26**NLP handler for the external MedEx-UIMA tool, to find references to 

27drugs (medication.** 

28 

29- MedEx-UIMA 

30 

31 - can't find Python version of MedEx (which preceded MedEx-UIMA) 

32 - paper on Python version is 

33 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2995636/; uses Python NLTK 

34 - see notes in Documents/CRATE directory 

35 - MedEx-UIMA is in Java, and resolutely uses a file-based processing system; 

36 ``Main.java`` calls ``MedTagger.java`` (``MedTagger.run_batch_medtag``), 

37 and even in its core ``MedTagger.medtagging()`` function it's making files 

38 in directories; that's deep in the core of its NLP thinking so we can't 

39 change that behaviour without creating a fork. So the obvious way to turn 

40 this into a proper "live" pipeline would be for the calling code to 

41 

42 - fire up a receiving process - Python launching custom Java 

43 - create its own temporary directory - Python 

44 - receive data - Python 

45 - stash it on disk - Python 

46 - call the MedEx function - Python -> stdout -> custom Java -> MedEx 

47 - return the results - custom Java signals "done" -> Python reads stdin? 

48 - and clean up - Python 

49 

50 Not terribly elegant, but might be fast enough (and almost certainly much 

51 faster than reloading Java regularly!). 

52 

53 - output comes from its ``MedTagger.print_result()`` function 

54 - would need a per-process-unique temporary directory, since it scans all 

55 files in the input directory (and similarly one output directory); would do 

56 that in Python 

57 

58MedEx-UIMA is firmly (and internally) wedded to a file-based processing 

59system. So we need to: 

60 

61- create a process-specific pair of temporary directories; 

62- fire up a receiving process 

63- pass data (1) to file and (2) signal that there's data available; 

64- await a "data ready" reply and read the data from disk; 

65- clean up (delete files) in readiness for next data chunk. 

66 

67NOTE ALSO that MedEx's ``MedTagger`` class writes to ``stdout`` (though not 

68``stderr``). Option 1: move our logs to ``stdout`` and use ``stderr`` for 

69signalling. Option 2: keep things as they are and just use a ``stdout`` signal 

70that's not used by MedEx. Went with option 2; simpler and more consistent esp. 

71for logging. 

72 

73How do we clean up the temporary directories? 

74 

75- ``__del__`` is not the opposite of ``__init__``; 

76 https://www.algorithm.co.il/blogs/programming/python-gotchas-1-__del__-is-not-the-opposite-of-__init__/ 

77- https://eli.thegreenplace.net/2009/06/12/safely-using-destructors-in-python 

78 

79PROBLEMS: 

80 

81- NLP works fine, but UK-style abbreviations e.g. "qds" not recognized where 

82 "q.i.d." is. US abbreviations: e.g. 

83 https://www.d.umn.edu/medweb/Modules/Prescription/Abbreviations.html 

84 

85 - Places to look, and things to try adding: 

86 

87 .. code-block:: none 

88 

89 resources/TIMEX/norm_patterns/NormFREQword 

90 

91 qds=>R1P6H 

92 

93 resources/TIMEX/rules/frequency_rules 

94 

95 //QID ( 4 times a day 

96 expression="[Qq]\.?[Ii]\.?[Dd]\.?[ ]*\((.*?)\)",val="R1P6H" 

97 

98 // RNC: qds 

99 expression="[Qq]\.?[Dd]\.?[Ss]\.?[ ]*\((.*?)\)",val="R1P6H" 

100 

101 ... looked like it was correct, but not working 

102 ... are this files compiled in, rather than being read live? 

103 ... do I have the user or the developer version? 

104 

105 ... not there yet. 

106 Probably need to recompile. See MedEx's Readme.txt 

107 

108 - reference to expression/val (as in frequency_rules): 

109 

110 .. code-block:: none 

111 

112 TIMEX.Rule._add_rule() 

113 ... from TIMEX.Rule.Rule via a directory walker 

114 ... from TIMEX.ProcessingEngine.ProcessingEngine() 

115 ... via semi-hardcoded file location relative to class's location 

116 ... via rule_dir, set to .../TIMEX/rules 

117 

118 - Detect a file being accessed: 

119 

120 .. code-block:: bash 

121 

122 sudo apt install inotify-tools 

123 inotifywait -m FILE 

124 

125 ... frequency_rules IS opened. 

126 

127 - OVERALL SEQUENCE: 

128 

129 .. code-block:: none 

130 

131 org.apache.medex.Main [OR: CrateNedexPipeline.java] 

132 org.apache.medex.MedTagger.run_batch_medtag 

133 ... creeates an org.apache.NLPTools.Document 

134 ... not obviously doing frequency stuff, or drug recognition 

135 ... then runs org.apache.medex.MedTagger.medtagging(doc) 

136 ... this does most of the heavy lifting, I think 

137 ... uses ProcessingEngine freq_norm_engine 

138 ... org.apache.TIMEX.ProcessingEngine 

139 ... but it may be that this just does frequency NORMALIZATION, not frequency finding 

140 ... uses SemanticRuleEngine rule_engine 

141 ... which is org.apache.medex.SemanticRuleEngine 

142 ... see all the regexlist.put(..., "FREQ") calls 

143 ... note double-escaping \\ for Java's benefit 

144 

145- Rebuilding MedEx: 

146 

147 .. code-block:; bash 

148 

149 export MEDEX_DIR=~/dev/MedEx_UIMA_1.3.6 # or similar 

150 cd ${MEDEX_DIR} 

151 # OPTIONAL # find . -name "*.class" -exec rm {} \; # remove old compiled files 

152 javac \ 

153 -classpath "${MEDEX_DIR}/src:${MEDEX_DIR}/lib/*" \ 

154 src/org/apache/medex/Main.java \ 

155 -d bin 

156 

157 # ... will also compile dependencies 

158 

159 See build_medex_itself.py 

160 

161- YES. If you add to ``org.apache.medex.SemanticRuleEngine``, with extra 

162 entries in the ``regexlist.put(...)`` sequence, new frequencies appear in the 

163 output. 

164 

165 To get them normalized as well, add them to frequency_rules. 

166 

167 Specifics: 

168 

169 (a) SemanticRuleEngine.java 

170 

171 .. code-block:: java 

172 

173 // EXTRA FOR UK FREQUENCIES (see https://www.evidence.nhs.uk/formulary/bnf/current/general-reference/latin-abbreviations) 

174 // NB case-insensitive regexes in SemanticRuleEngine.java, so ignore case here 

175 regexlist.put("^(q\\.?q\\.?h\\.?)( |$)", "FREQ"); // qqh, quarta quaque hora (RNC) 

176 regexlist.put("^(q\\.?d\\.?s\\.?)( |$)", "FREQ"); // qds, quater die sumendum (RNC); must go before existing competing expression: regexlist.put("^q(\\.|)\\d+( |$)","FREQ"); 

177 regexlist.put("^(t\\.?d\\.?s\\.?)( |$)", "FREQ"); // tds, ter die sumendum (RNC) 

178 regexlist.put("^(b\\.?d\\.?)( |$)", "FREQ"); // bd, bis die (RNC) 

179 regexlist.put("^(o\\.?d\\.?)( |$)", "FREQ"); // od, omni die (RNC) 

180 regexlist.put("^(mane)( |$)", "FREQ"); // mane (RNC) 

181 regexlist.put("^(o\\.?m\\.?)( |$)", "FREQ"); // om, omni mane (RNC) 

182 regexlist.put("^(nocte)( |$)", "FREQ"); // nocte (RNC) 

183 regexlist.put("^(o\\.?n\\.?)( |$)", "FREQ"); // on, omni nocte (RNC) 

184 regexlist.put("^(fortnightly)( |$)", "FREQ"); // fortnightly (RNC) 

185 regexlist.put("^((?:2|two)\s+weekly)\b", "FREQ"); // fortnightly (RNC) 

186 regexlist.put("argh", "FREQ"); // fortnightly (RNC) 

187 // ALREADY IMPLEMENTED BY MedEx: tid (ter in die) 

188 // NECESSITY, NOT FREQUENCY: prn (pro re nata) 

189 // TIMING, NOT FREQUENCY: ac (ante cibum); pc (post cibum) 

190 

191 (b) frequency_rules 

192 

193 .. code-block:: none 

194 

195 // EXTRA FOR UK FREQUENCIES (see https://www.evidence.nhs.uk/formulary/bnf/current/general-reference/latin-abbreviations) 

196 // NB case-sensitive regexes in Rule.java, so offer upper- and lower-case alternatives here 

197 // qqh, quarta quaque hora (RNC) 

198 expression="\b[Qq]\.?[Qq]\.?[Hh]\.?\b",val="R1P4H" 

199 // qds, quater die sumendum (RNC); MUST BE BEFORE COMPETING "qd" (= per day) expression: expression="[Qq]\.?[ ]?[Dd]\.?",val="R1P24H" 

200 expression="\b[Qq]\.?[Dd]\.?[Ss]\.?\b",val="R1P6H" 

201 // tds, ter die sumendum (RNC) 

202 expression="\b[Tt]\.?[Dd]\.?[Ss]\.?\b",val="R1P8H" 

203 // bd, bis die (RNC) 

204 expression="\b[Bb]\.?[Dd]\.?\b",val="R1P12H" 

205 // od, omni die (RNC) 

206 expression="\b[Oo]\.?[Dd]\.?\b",val="R1P24H" 

207 // mane (RNC) 

208 expression="\b[Mm][Aa][Nn][Ee]\b",val="R1P24H" 

209 // om, omni mane (RNC) 

210 expression="\b[Oo]\.?[Mm]\.?\b",val="R1P24H" 

211 // nocte (RNC) 

212 expression="\b[Nn][Oo][Cc][Tt][Ee]\b",val="R1P24H" 

213 // on, omni nocte (RNC) 

214 expression="\b[Oo]\.?[Nn]\.?\b",val="R1P24H" 

215 // fortnightly and variants (RNC); unsure if TIMEX3 format is right 

216 expression="\b[Ff][Oo][Rr][Tt][Nn][Ii][Gg][Hh][Tt][Ll][Yy]\b",val="R1P2WEEK" 

217 expression="\b(?:2|[Tt][Ww][Oo])\s+[Ww][Ee][Ee][Kk][Ll][Yy]\b",val="R1P2WEEK" 

218 // monthly (RNC) 

219 expression="\b[Mm][Oo][Nn][Tt][Hh][Ll][Yy]\b",val="R1P1MONTH" 

220 // 

221 // ALREADY IMPLEMENTED BY MedEx: tid (ter in die) 

222 // NECESSITY, NOT FREQUENCY: prn (pro re nata) 

223 // TIMING, NOT FREQUENCY: ac (ante cibum); pc (post cibum) 

224 

225 (c) source: 

226 

227 - https://www.evidence.nhs.uk/formulary/bnf/current/general-reference/latin-abbreviations 

228 

229- How about routes of administration? 

230 

231 .. code-block:: none 

232 

233 MedTagger.printResult() 

234 route is in FStr_list[5] 

235 ... called from MedTagger.medtagging() 

236 route is in FStr_list_final[5] 

237 before that, is in FStr (separated by \n) 

238 ... from formatDruglist 

239 ... 

240 ... from logs, appears first next to "input for tagger" at 

241 which point it's in 

242 sent_token_array[j] (e.g. "po") 

243 sent_tag_array[j] (e.g. "RUT" = route) 

244 ... from tag_dict 

245 ... from filter_tags 

246 ... from (Document) doc.filtered_drug_tag() 

247 ... 

248 ... ?from MedTagger.medtagging() calling doc.add_drug_tag() 

249 ... no, not really; is in this bit: 

250 SuffixArray sa = new SuffixArray(...); 

251 Vector<SuffixArrayResult> result = sa.search(); 

252 ... and then each element of result has a "semantic_type" 

253 member that can be "RUT" 

254 ... SuffixArray.search() 

255 semantic_type=this.lex.sem_list().get(i); 

256 

257 ... where lex comes from MedTagger: 

258 this.lex = new Lexicon(this.lex_fname); 

259 ... Lexicon.sem_list() returns Lexicon.semantic_list 

260 ... Lexicon.Lexicon() constructs using MedTagger's this.lex_fname 

261 ... which is lexicon.cfg 

262 

263 ... aha! There it is. If a line in lexicon.cfg has a RUT tag, it'll 

264 appear as a route. So: 

265 grep "RUT$" lexicon.cfg | sort # and replace tabs with spaces 

266 

267 bedside RUT 

268 by mouth RUT 

269 drip RUT 

270 gt RUT 

271 g tube RUT 

272 g-tube RUT 

273 gtube RUT 

274 im injection RUT 

275 im RUT 

276 inhalation RUT 

277 inhalatn RUT 

278 inhaled RUT 

279 intramuscular RUT 

280 intravenously RUT 

281 intravenous RUT 

282 iv RUT 

283 j tube RUT 

284 j-tube RUT 

285 jtube RUT 

286 nare RUT 

287 nares RUT 

288 naris RUT 

289 neb RUT 

290 nostril RUT 

291 orally RUT 

292 oral RUT 

293 ou RUT 

294 patch DDF-DOSEUNIT-RUT 

295 per gt RUT 

296 per mouth RUT 

297 per os RUT 

298 per rectum RUT 

299 per tube RUT 

300 p. g RUT 

301 pgt RUT 

302 png RUT 

303 pnj RUT 

304 p.o RUT 

305 po RUT 

306 sc RUT 

307 sl RUT 

308 sq RUT 

309 subc RUT 

310 subcu RUT 

311 subcutaneously RUT 

312 subcutaneous RUT 

313 subcut RUT 

314 subling RUT 

315 sublingual RUT 

316 sub q RUT 

317 subq RUT 

318 swallow RUT 

319 swish and spit RUT 

320 sw&spit RUT 

321 sw&swall RUT 

322 topically RUT 

323 topical RUT 

324 topical tp RUT 

325 trans RUT 

326 with spacer RUT 

327 

328 Looks like these are not using synonyms. Note also format is ``route\tRUT`` 

329 

330 Note also that the first element is always forced to lower case (in 

331 Lexicon.Lexicon()), so presumably it's case-insensitive. 

332 

333 There's no specific comment format (though any line that doesn't resolve to 

334 two items when split on a tab looks like it's ignored). 

335 

336 So we might want to add more; use 

337 

338 .. code-block:: bash 

339 

340 build_medex_itself.py --extraroutes >> lexicon.cfg 

341 

342- Note that all frequencies and routes must be in the lexicon. 

343 And all frequencies must be in ``SemanticRuleEngine.java`` (and, to be 

344 normalized, frequency_rules). 

345 

346- USEFUL BIT FOR CHECKING RESULTS: 

347 

348 .. code-block:: sql 

349 

350 SELECT 

351 sentence_text, 

352 drug, generic_name, 

353 form, strength, dose_amount, 

354 route, frequency, frequency_timex3, 

355 duration, necessity 

356 FROM anonymous_output.drugs; 

357 

358- ENCODING 

359 

360 - Pipe encoding (to Java's ``stdin``, from Java's ``stdout``) encoding is the 

361 less important as we're only likely to send/receive ASCII. It's hard-coded 

362 to UTF-8. 

363 

364 - File encoding is vital and is hard-coded to UTF-8 here and in the 

365 receiving Java. 

366 

367 - We have no direct influence over the MedTagger code for output (unless we 

368 modify it). The output function is ``MedTagger.print_result()``, which 

369 (line 2040 of ``MedTagger.java``) calls ``out.write(stuff)``. 

370 

371 The out variable is set by 

372 

373 .. code-block:: java 

374 

375 this.out = new BufferedWriter(new FileWriter(output_dir 

376 + File.separator + doc.fname())); 

377 

378 That form of the FileWriter constructor, ``FileWriter(String fileName)``, 

379 uses the "default character encoding", as per 

380 https://docs.oracle.com/javase/7/docs/api/java/io/FileWriter.html 

381 

382 That default is given by ``System.getProperty("file.encoding")``. However, 

383 we don't have to do something daft like asking the Java to report its file 

384 encoding to Python through a pipe; instead, we can set the Java default 

385 encoding. It can't be done dynamically, but it can be done at JVM launch: 

386 https://stackoverflow.com/questions/361975/setting-the-default-java-character-encoding. 

387 

388 Therefore, we should have a Java parameter specified in the config file as 

389 ``-Dfile.encoding=UTF-8``. 

390 

391""" # noqa: E501 

392 

393import logging 

394import os 

395import shlex 

396import subprocess 

397import tempfile 

398from typing import Any, Dict, Generator, List, Optional, Tuple 

399 

400from cardinal_pythonlib.cmdline import cmdline_quote 

401from cardinal_pythonlib.fileops import mkdir_p 

402from sqlalchemy import Column, Index, Integer, String, Text 

403 

404from crate_anon.nlp_manager.base_nlp_parser import ( 

405 BaseNlpParser, 

406 TextProcessingFailed, 

407) 

408from crate_anon.nlp_manager.constants import ( 

409 MEDEX_DATA_READY_SIGNAL, 

410 MEDEX_RESULTS_READY_SIGNAL, 

411 ProcessorConfigKeys, 

412) 

413from crate_anon.nlp_manager.nlp_definition import ( 

414 NlpDefinition, 

415) 

416 

417log = logging.getLogger(__name__) 

418 

419 

420# ============================================================================= 

421# Constants 

422# ============================================================================= 

423 

424DATA_FILENAME = "crate_medex.txt" 

425DATA_FILENAME_KEEP = "crate_medex_{}.txt" 

426 

427USE_TEMP_DIRS = True 

428# ... True for production; False to see e.g. logs afterwards, by keeping 

429# everything in a subdirectory of the user's home directory (see hard-coded 

430# nastiness -- for debugging only) 

431 

432SKIP_IF_NO_GENERIC = True 

433# ... Probably should be True. MedEx returns hits for drug "Thu" with no 

434# generic drug; this from its weekday lexicon, I think. 

435 

436# ----------------------------------------------------------------------------- 

437# Maximum field lengths 

438# ----------------------------------------------------------------------------- 

439# https://phekb.org/sites/phenotype/files/MedEx_UIMA_eMERGE_short.pdf 

440# 

441# RxNorm: https://www.nlm.nih.gov/research/umls/rxnorm/overview.html 

442# 

443# UMLS: https://www.nlm.nih.gov/research/umls/new_users/glossary.html 

444# UMLS CUI max length: https://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/release/columns_data_elements.html # noqa: E501 

445UMLS_CUI_MAX_LENGTH = 8 # definite 

446 

447# TIMEX3: 

448# - http://www.timeml.org/tempeval2/tempeval2-trial/guidelines/timex3guidelines-072009.pdf # noqa: E501 

449# - http://www.timeml.org/publications/timeMLdocs/timeml_1.2.1.html#timex3 # noqa: E501 

450TIMEX3_MAX_LENGTH = 50 # guess 

451 

452# Drug length: 

453# There are long ones, like 

454# "influenza virus vaccine, inactivated a-brisbane-59-2007, ivr-148 (h1n1) strain" (78) # noqa: E501 

455# See e.g. resources/rxcui_generic.cfg, and: 

456# $ wc -L filename # shows length of longest line 

457# $ egrep -n "^.{$(wc -L < filename)}$" filename # shows longest line 

458# ... possibly this gets a bit confused by tabs, can also put a length in: 

459# $ egrep -n "^.{302}$" filename # shows lines of length 302 

460# And we find a drug of length 286: 

461# 1-oxa-7-azacyclopentadecan-15-one,13-((2,6-dideoy-3-c-methyl-3-o-methyl-alpha-l-ribo-hexopyranosyl)oxy)-2-ethyl-3,4,10-trihydroxy 3,5,8,10,12,14-hexamethyl-7-propyl-11-((3,4,6-trideoxy-3-(dimethylamino)-beta-d-xylo-hexopyranosyl)oxy)-, ((2r*, 3s*,4r*,5s*,8r*,10r*,11r*,12s*,13s*,14r*))- # noqa: E501 

462# Then there are multivitamin things in brand_generic with length >600. 

463# So we should use an unlimited field; SQLAlchemy helpfully seems to translate 

464# Text to VARCHAR(MAX) under SQL Server, which is the more efficient: 

465# https://stackoverflow.com/questions/834788/using-varcharmax-vs-text-on-sql-server # noqa: E501 

466MEDEX_MAX_FORM_LENGTH = 255 # guess; "Powder For Oral Suspension" (26) is one 

467MEDEX_MAX_STRENGTH_LENGTH = 50 # guess 

468MEDEX_MAX_DOSE_AMOUNT_LENGTH = 50 # guess 

469MEDEX_MAX_ROUTE_LENGTH = 50 # guess 

470MEDEX_MAX_FREQUENCY_LENGTH = 50 # guess 

471MEDEX_MAX_DURATION_LENGTH = 50 # guess 

472MEDEX_MAX_NECESSITY_LENGTH = 50 # guess 

473 

474 

475# ============================================================================= 

476# Medex 

477# ============================================================================= 

478 

479 

480class PseudoTempDir: 

481 """ 

482 This class exists so that a TemporaryDirectory and a manually specified 

483 directory can be addressed via the same (very simple!) interface. 

484 """ 

485 

486 def __init__(self, name: str) -> None: 

487 self.name = name 

488 

489 

490class Medex(BaseNlpParser): 

491 """ 

492 EXTERNAL. 

493 

494 Class controlling a Medex-UIMA external process, via our custom Java 

495 interface, ``CrateMedexPipeline.java``. 

496 

497 MedEx-UIMA is a medication-finding tool: 

498 https://www.ncbi.nlm.nih.gov/pubmed/25954575. 

499 """ 

500 

501 uses_external_tool = True 

502 

503 def __init__( 

504 self, 

505 nlpdef: NlpDefinition, 

506 cfg_processor_name: str, 

507 commit: bool = False, 

508 ) -> None: 

509 """ 

510 Args: 

511 nlpdef: 

512 a :class:`crate_anon.nlp_manager.nlp_definition.NlpDefinition` 

513 cfg_processor_name: 

514 the name of a CRATE NLP config file section (from which we may 

515 choose to get extra config information) 

516 commit: 

517 force a COMMIT whenever we insert data? You should specify this 

518 in multiprocess mode, or you may get database deadlocks. 

519 """ 

520 super().__init__( 

521 nlpdef=nlpdef, 

522 cfg_processor_name=cfg_processor_name, 

523 commit=commit, 

524 friendly_name="MedEx", 

525 ) 

526 

527 if nlpdef is None: # only None for debugging! 

528 self._debug_mode = True 

529 self._tablename = self.classname().lower() 

530 self._max_external_prog_uses = 1 

531 self._progenvsection = "" 

532 self._env = {} # type: Dict[str, str] 

533 progargs = "" 

534 else: 

535 self._debug_mode = False 

536 

537 self._tablename = self._cfgsection.opt_str( 

538 ProcessorConfigKeys.DESTTABLE, required=True 

539 ) 

540 

541 self._max_external_prog_uses = self._cfgsection.opt_int_positive( 

542 ProcessorConfigKeys.MAX_EXTERNAL_PROG_USES, default=0 

543 ) 

544 

545 self._progenvsection = self._cfgsection.opt_str( 

546 ProcessorConfigKeys.PROGENVSECTION 

547 ) 

548 

549 if self._progenvsection: 

550 # noinspection PyTypeChecker 

551 self._env = nlpdef.get_env_dict( 

552 self._progenvsection, os.environ 

553 ) 

554 else: 

555 self._env = os.environ.copy() 

556 self._env["NLPLOGTAG"] = nlpdef.logtag or "." 

557 # ... because passing a "-lt" switch with no parameter will make 

558 # CrateMedexPipeline.java complain and stop 

559 

560 progargs = self._cfgsection.opt_str( 

561 ProcessorConfigKeys.PROGARGS, required=True 

562 ) 

563 

564 if USE_TEMP_DIRS: 

565 self._inputdir = tempfile.TemporaryDirectory() 

566 self._outputdir = tempfile.TemporaryDirectory() 

567 self._workingdir = tempfile.TemporaryDirectory() 

568 # ... these are autodeleted when the object goes out of scope; see 

569 # https://docs.python.org/3/library/tempfile.html 

570 # ... which manages it using weakref.finalize 

571 else: 

572 homedir = os.path.expanduser("~") 

573 self._inputdir = PseudoTempDir( 

574 os.path.join(homedir, "medextemp", "input") 

575 ) 

576 mkdir_p(self._inputdir.name) 

577 self._outputdir = PseudoTempDir( 

578 os.path.join(homedir, "medextemp", "output") 

579 ) 

580 mkdir_p(self._outputdir.name) 

581 self._workingdir = PseudoTempDir( 

582 os.path.join(homedir, "medextemp", "working") 

583 ) 

584 mkdir_p(self._workingdir.name) 

585 

586 formatted_progargs = progargs.format(**self._env) 

587 self._progargs = shlex.split(formatted_progargs) 

588 self._progargs.extend( 

589 [ 

590 "-data_ready_signal", 

591 MEDEX_DATA_READY_SIGNAL, 

592 "-results_ready_signal", 

593 MEDEX_RESULTS_READY_SIGNAL, 

594 "-i", 

595 self._inputdir.name, 

596 "-o", 

597 self._outputdir.name, 

598 ] 

599 ) 

600 

601 self._n_uses = 0 

602 self._pipe_encoding = "utf8" 

603 self._file_encoding = "utf8" 

604 self._p = None # the subprocess 

605 self._started = False 

606 

607 # ------------------------------------------------------------------------- 

608 # External process control 

609 # ------------------------------------------------------------------------- 

610 

611 def _start(self) -> None: 

612 """ 

613 Launch the external process. We will save and retrieve data via files, 

614 and send signals ("data ready", "results ready) via stdin/stout. 

615 """ 

616 if self._started or self._debug_mode: 

617 return 

618 args = self._progargs 

619 

620 # Nasty MedEx hacks 

621 cwd = os.getcwd() 

622 log.info( 

623 f"For MedEx's benefit, changing to directory: " 

624 f"{self._workingdir.name}" 

625 ) 

626 os.chdir(self._workingdir.name) 

627 sentsdir = os.path.join(self._workingdir.name, "sents") 

628 log.info(f"Making temporary sentences directory: {sentsdir}") 

629 mkdir_p(sentsdir) 

630 logdir = os.path.join(self._workingdir.name, "log") 

631 log.info(f"Making temporary log directory: {logdir}") 

632 mkdir_p(logdir) 

633 

634 log.info(f"Launching command: {cmdline_quote(args)}") 

635 self._p = subprocess.Popen( 

636 args, 

637 stdin=subprocess.PIPE, 

638 stdout=subprocess.PIPE, 

639 # stderr=subprocess.PIPE, 

640 shell=False, 

641 bufsize=1, 

642 ) 

643 # ... don't ask for stderr to be piped if you don't want it; firstly, 

644 # there's a risk that if you don't consume it, something hangs, and 

645 # secondly if you don't consume it, you see it on the console, which is 

646 # helpful. 

647 self._started = True 

648 log.info(f"Returning to working directory {cwd}") 

649 os.chdir(cwd) 

650 

651 def _encode_to_subproc_stdin(self, text: str) -> None: 

652 """ 

653 Send text to the external program (via its stdin), encoding it in 

654 the process (typically to UTF-8). 

655 """ 

656 log.debug("SENDING: " + text) 

657 bytes_ = text.encode(self._pipe_encoding) 

658 self._p.stdin.write(bytes_) 

659 

660 def _flush_subproc_stdin(self) -> None: 

661 """ 

662 Flushes what we're sending to the external program via its stdin. 

663 """ 

664 self._p.stdin.flush() 

665 

666 def _decode_from_subproc_stdout(self) -> str: 

667 """ 

668 Decode what we've received from the external program's stdout, 

669 from its specific encoding (usually UTF-8) to a Python string. 

670 """ 

671 bytes_ = self._p.stdout.readline() 

672 text = bytes_.decode(self._pipe_encoding) 

673 log.debug("RECEIVING: " + repr(text)) 

674 return text 

675 

676 def _finish(self) -> None: 

677 """ 

678 Close down the external process. 

679 """ 

680 if not self._started: 

681 return 

682 self._p.communicate() # close p.stdout, wait for the subprocess to exit # noqa: E501 

683 self._started = False 

684 

685 def _signal_data_ready(self) -> bool: 

686 """ 

687 Signals to the child process that we have written data to files, and 

688 it's now ready for reading by MedEx. 

689 

690 Returns: OK? 

691 """ 

692 if self._finished(): 

693 return False 

694 self._encode_to_subproc_stdin(MEDEX_DATA_READY_SIGNAL + os.linesep) 

695 self._flush_subproc_stdin() 

696 return True 

697 

698 def _await_results_ready(self) -> bool: 

699 """ 

700 Waits until MedEx has signalled us that results are ready. 

701 

702 Returns: OK? 

703 """ 

704 while True: 

705 if self._finished(): 

706 return False 

707 line = self._decode_from_subproc_stdout() 

708 if line == MEDEX_RESULTS_READY_SIGNAL + os.linesep: 

709 return True 

710 

711 def _finished(self) -> bool: 

712 """ 

713 Has MedEx finished? 

714 """ 

715 if not self._started: 

716 return True 

717 self._p.poll() 

718 finished = self._p.returncode is not None 

719 if finished: 

720 self._started = False 

721 return finished 

722 

723 def _restart(self) -> None: 

724 """ 

725 Close down the external process and restart it. 

726 """ 

727 self._finish() 

728 self._start() 

729 

730 # ------------------------------------------------------------------------- 

731 # Input processing 

732 # ------------------------------------------------------------------------- 

733 

734 def parse( 

735 self, text: str 

736 ) -> Generator[Tuple[str, Dict[str, Any]], None, None]: 

737 """ 

738 - Send text to the external process, and receive the result. 

739 - Note that associated data is not passed into this function, and is 

740 kept in the Python environment, so we can't run into any problems 

741 with the transfer to/from the Java program garbling important data. 

742 All we send to the subprocess is the text (and an input_terminator). 

743 Then, we may receive MULTIPLE sets of data back ("your text contains 

744 the following 7 people/drug references/whatever"), followed 

745 eventually by the output_terminator, at which point this set is 

746 complete. 

747 """ 

748 self._n_uses += 1 

749 self._start() # ensure started 

750 if USE_TEMP_DIRS: 

751 basefilename = DATA_FILENAME 

752 else: 

753 basefilename = DATA_FILENAME_KEEP.format(self._n_uses) 

754 inputfilename = os.path.join(self._inputdir.name, basefilename) 

755 outputfilename = os.path.join(self._outputdir.name, basefilename) 

756 # ... MedEx gives output files the SAME NAME as input files. 

757 

758 try: 

759 with open( 

760 inputfilename, mode="w", encoding=self._file_encoding 

761 ) as infile: 

762 # log.info(f"text: {text!r}") 

763 infile.write(text) 

764 

765 if ( 

766 not self._signal_data_ready() 

767 or not self._await_results_ready() # send 

768 ): # receive 

769 log.critical("Subprocess terminated unexpectedly") 

770 os.remove(inputfilename) 

771 # We were using "log.critical()" and "return", but if the Medex 

772 # processor is misconfigured, the failed processor can be run 

773 # over thousands of records over many hours before the failure 

774 # is obvious. Changed 2017-03-17. 

775 raise ValueError( 

776 "Java interface to Medex failed - miconfigured?" 

777 ) 

778 

779 with open( 

780 outputfilename, mode="r", encoding=self._file_encoding 

781 ) as infile: 

782 resultlines = infile.readlines() 

783 for line in resultlines: 

784 # log.critical(f"received: {line}") 

785 # Output code, from MedTagger.print_result(): 

786 # out.write( 

787 # index + 1 + "\t" + sent_text + "|" + 

788 # drug + "|" + brand + "|" + dose_form + "|" + 

789 # strength + "|" + dose_amt + "|" + 

790 # route + "|" + frequency + "|" + duration + "|" + 

791 # necessity + "|" + 

792 # umls_code + "|" + rx_code + "|" + generic_code + "|" + 

793 # generic_name + "\n"); 

794 # NOTE that the text can contain | characters. So work from the 

795 # right. 

796 line = line.rstrip() # remove any trailing newline 

797 fields = line.split("|") 

798 if len(fields) < 14: 

799 log.warning(f"Bad result received: {line!r}") 

800 continue 

801 generic_name = self.str_or_none(fields[-1]) 

802 if not generic_name and SKIP_IF_NO_GENERIC: 

803 continue 

804 generic_code = self.int_or_none(fields[-2]) 

805 rx_code = self.int_or_none(fields[-3]) 

806 umls_code = self.str_or_none(fields[-4]) 

807 ( 

808 necessity, 

809 necessity_startpos, 

810 necessity_endpos, 

811 ) = self.get_text_start_end(fields[-5]) 

812 ( 

813 duration, 

814 duration_startpos, 

815 duration_endpos, 

816 ) = self.get_text_start_end(fields[-6]) 

817 ( 

818 _freq_text, 

819 frequency_startpos, 

820 frequency_endpos, 

821 ) = self.get_text_start_end(fields[-7]) 

822 frequency, frequency_timex = self.frequency_and_timex( 

823 _freq_text 

824 ) 

825 ( 

826 route, 

827 route_startpos, 

828 route_endpos, 

829 ) = self.get_text_start_end(fields[-8]) 

830 ( 

831 dose_amount, 

832 dose_amount_startpos, 

833 dose_amount_endpos, 

834 ) = self.get_text_start_end(fields[-9]) 

835 ( 

836 strength, 

837 strength_startpos, 

838 strength_endpos, 

839 ) = self.get_text_start_end(fields[-10]) 

840 (form, form_startpos, form_endpos) = self.get_text_start_end( 

841 fields[-11] 

842 ) 

843 ( 

844 brand, 

845 brand_startpos, 

846 brand_endpos, 

847 ) = self.get_text_start_end(fields[-12]) 

848 (drug, drug_startpos, drug_endpos) = self.get_text_start_end( 

849 fields[-13] 

850 ) 

851 _start_bit = "|".join(fields[0:-13]) 

852 _index_text, sent_text = _start_bit.split("\t", maxsplit=1) 

853 index = self.int_or_none(_index_text) 

854 yield self._tablename, { 

855 "sentence_index": index, 

856 "sentence_text": sent_text, 

857 "drug": drug, 

858 "drug_startpos": drug_startpos, 

859 "drug_endpos": drug_endpos, 

860 "brand": brand, 

861 "brand_startpos": brand_startpos, 

862 "brand_endpos": brand_endpos, 

863 "form": form, 

864 "form_startpos": form_startpos, 

865 "form_endpos": form_endpos, 

866 "strength": strength, 

867 "strength_startpos": strength_startpos, 

868 "strength_endpos": strength_endpos, 

869 "dose_amount": dose_amount, 

870 "dose_amount_startpos": dose_amount_startpos, 

871 "dose_amount_endpos": dose_amount_endpos, 

872 "route": route, 

873 "route_startpos": route_startpos, 

874 "route_endpos": route_endpos, 

875 "frequency": frequency, 

876 "frequency_startpos": frequency_startpos, 

877 "frequency_endpos": frequency_endpos, 

878 "frequency_timex3": frequency_timex, 

879 "duration": duration, 

880 "duration_startpos": duration_startpos, 

881 "duration_endpos": duration_endpos, 

882 "necessity": necessity, 

883 "necessity_startpos": necessity_startpos, 

884 "necessity_endpos": necessity_endpos, 

885 "umls_code": umls_code, 

886 "rx_code": rx_code, 

887 "generic_code": generic_code, 

888 "generic_name": generic_name, 

889 } 

890 

891 # Since MedEx scans all files in the input directory, then if we're 

892 # not using temporary directories (and are therefore using a new 

893 # filename per item), we should remove the old one. 

894 os.remove(inputfilename) 

895 

896 # Restart subprocess? 

897 if ( 

898 self._max_external_prog_uses > 0 

899 and self._n_uses % self._max_external_prog_uses == 0 

900 ): 

901 log.info( 

902 f"relaunching app after " 

903 f"{self._max_external_prog_uses} uses" 

904 ) 

905 self._restart() 

906 

907 except BrokenPipeError: 

908 log.error("Broken pipe; relaunching app") 

909 self._restart() 

910 raise TextProcessingFailed() 

911 

912 @staticmethod 

913 def get_text_start_end( 

914 medex_str: Optional[str], 

915 ) -> Tuple[Optional[str], Optional[int], Optional[int]]: 

916 """ 

917 MedEx returns "drug", "strength", etc. as ``aspirin[7,14]``, where the 

918 text is followed by the start position (zero-indexed) and the end 

919 position (one beyond the last character) (zero-indexed). This function 

920 converts a string like ``aspirin[7,14]`` to a tuple like ``"aspirin", 

921 7, 14``. 

922 

923 Args: 

924 medex_str: string from MedEx 

925 

926 Returns: 

927 tuple: ``text, start_pos, end_pos``; values may be ``None`` 

928 """ 

929 if not medex_str: 

930 return None, None, None 

931 lbracket = medex_str.rfind("[") # -1 for not found 

932 comma = medex_str.rfind(",") 

933 rbracket = medex_str.rfind("]") 

934 try: 

935 if lbracket == -1 or not (lbracket < comma < rbracket): 

936 raise ValueError() 

937 text = medex_str[:lbracket] 

938 lpos = int(medex_str[lbracket + 1 : comma]) 

939 rpos = int(medex_str[comma + 1 : rbracket]) 

940 return text, lpos, rpos 

941 except (TypeError, ValueError): 

942 log.warning(f"Bad string[left, right] format: {medex_str!r}") 

943 return None, None, None 

944 

945 @staticmethod 

946 def int_or_none(text: Optional[str]) -> Optional[int]: 

947 """ 

948 Takes text and returns an integer version or ``None``. 

949 """ 

950 try: 

951 return int(text) 

952 except (TypeError, ValueError): 

953 return None 

954 

955 @staticmethod 

956 def str_or_none(text: Optional[str]) -> Optional[str]: 

957 """ 

958 If the string is non-empty, return the string; otherwise return 

959 ``None``. 

960 """ 

961 return None if not text else text 

962 

963 @staticmethod 

964 def frequency_and_timex(text: str) -> Tuple[Optional[str], Optional[str]]: 

965 """ 

966 Splits a MedEx frequency/TIMEX strings to its frequency and TIMEX 

967 parts; e.g. splits ``b.i.d.(R1P12H)`` to ``"b.i.d.", "R1P12H"``. 

968 """ 

969 if not text: 

970 return None, None 

971 lbracket = text.rfind("(") 

972 rbracket = text.rfind(")") 

973 if ( 

974 lbracket == -1 

975 or not (lbracket < rbracket) 

976 or rbracket != len(text) - 1 

977 ): 

978 return None, None 

979 return text[0:lbracket], text[lbracket + 1 : rbracket] 

980 

981 # ------------------------------------------------------------------------- 

982 # Test 

983 # ------------------------------------------------------------------------- 

984 

985 def test(self, verbose: bool = False) -> None: 

986 """ 

987 Test the send function. 

988 """ 

989 if self._debug_mode: 

990 return 

991 self.test_parser( 

992 [ 

993 "Bob Hope visited Seattle and took venlafaxine M/R 375mg od.", 

994 "James Joyce wrote Ulysses whilst taking aspirin 75mg mane.", 

995 ] 

996 ) 

997 

998 # ------------------------------------------------------------------------- 

999 # Database structure 

1000 # ------------------------------------------------------------------------- 

1001 

1002 def dest_tables_columns(self) -> Dict[str, List[Column]]: 

1003 # docstring in superclass 

1004 startposdef = "Start position (zero-based) of " 

1005 endposdef = ( 

1006 "End position (zero-based index of one beyond last character) of " 

1007 ) 

1008 return { 

1009 self._tablename: [ 

1010 Column( 

1011 "sentence_index", 

1012 Integer, 

1013 comment="One-based index of sentence in text", 

1014 ), 

1015 Column( 

1016 "sentence_text", 

1017 Text, 

1018 comment="Text recognized as a sentence by MedEx", 

1019 ), 

1020 Column("drug", Text, comment="Drug name, as in the text"), 

1021 Column("drug_startpos", Integer, comment=startposdef + "drug"), 

1022 Column("drug_endpos", Integer, comment=endposdef + "drug"), 

1023 Column( 

1024 "brand", 

1025 Text, 

1026 comment="Drug brand name (?lookup ?only if given)", 

1027 ), 

1028 Column( 

1029 "brand_startpos", Integer, comment=startposdef + "brand" 

1030 ), 

1031 Column("brand_endpos", Integer, comment=endposdef + "brand"), 

1032 Column( 

1033 "form", 

1034 String(MEDEX_MAX_FORM_LENGTH), 

1035 comment="Drug/dose form (e.g. 'tablet')", 

1036 ), 

1037 Column("form_startpos", Integer, comment=startposdef + "form"), 

1038 Column("form_endpos", Integer, comment=endposdef + "form"), 

1039 Column( 

1040 "strength", 

1041 String(MEDEX_MAX_STRENGTH_LENGTH), 

1042 comment="Strength (e.g. '75mg')", 

1043 ), 

1044 Column( 

1045 "strength_startpos", 

1046 Integer, 

1047 comment=startposdef + "strength", 

1048 ), 

1049 Column( 

1050 "strength_endpos", Integer, comment=endposdef + "strength" 

1051 ), 

1052 Column( 

1053 "dose_amount", 

1054 String(MEDEX_MAX_DOSE_AMOUNT_LENGTH), 

1055 comment="Dose amount (e.g. '2 tablets')", 

1056 ), 

1057 Column( 

1058 "dose_amount_startpos", 

1059 Integer, 

1060 comment=startposdef + "dose_amount", 

1061 ), 

1062 Column( 

1063 "dose_amount_endpos", 

1064 Integer, 

1065 comment=endposdef + "dose_amount", 

1066 ), 

1067 Column( 

1068 "route", 

1069 String(MEDEX_MAX_ROUTE_LENGTH), 

1070 comment="Route (e.g. 'by mouth')", 

1071 ), 

1072 Column( 

1073 "route_startpos", Integer, comment=startposdef + "route" 

1074 ), 

1075 Column("route_endpos", Integer, comment=endposdef + "route"), 

1076 Column( 

1077 "frequency", 

1078 String(MEDEX_MAX_FREQUENCY_LENGTH), 

1079 comment="Frequency (e.g. 'b.i.d.')", 

1080 ), 

1081 Column( 

1082 "frequency_startpos", 

1083 Integer, 

1084 comment=startposdef + "frequency", 

1085 ), 

1086 Column( 

1087 "frequency_endpos", 

1088 Integer, 

1089 comment=endposdef + "frequency", 

1090 ), 

1091 Column( 

1092 "frequency_timex3", 

1093 String(TIMEX3_MAX_LENGTH), 

1094 comment=( 

1095 "Normalized frequency in TIMEX3 format " 

1096 "(e.g. 'R1P12H')" 

1097 ), 

1098 ), 

1099 Column( 

1100 "duration", 

1101 String(MEDEX_MAX_DURATION_LENGTH), 

1102 comment="Duration (e.g. 'for 10 days')", 

1103 ), 

1104 Column( 

1105 "duration_startpos", 

1106 Integer, 

1107 comment=startposdef + "duration", 

1108 ), 

1109 Column( 

1110 "duration_endpos", Integer, comment=endposdef + "duration" 

1111 ), 

1112 Column( 

1113 "necessity", 

1114 String(MEDEX_MAX_NECESSITY_LENGTH), 

1115 comment="Necessity (e.g. 'prn')", 

1116 ), 

1117 Column( 

1118 "necessity_startpos", 

1119 Integer, 

1120 comment=startposdef + "necessity", 

1121 ), 

1122 Column( 

1123 "necessity_endpos", 

1124 Integer, 

1125 comment=endposdef + "necessity", 

1126 ), 

1127 Column( 

1128 "umls_code", 

1129 String(UMLS_CUI_MAX_LENGTH), 

1130 comment="UMLS CUI", 

1131 ), 

1132 Column("rx_code", Integer, comment="RxNorm RxCUI for drug"), 

1133 Column( 

1134 "generic_code", 

1135 Integer, 

1136 comment="RxNorm RxCUI for generic name", 

1137 ), 

1138 Column( 

1139 "generic_name", 

1140 Text, 

1141 comment="Generic drug name (associated with RxCUI code)", 

1142 ), 

1143 ] 

1144 } 

1145 

1146 def dest_tables_indexes(self) -> Dict[str, List[Index]]: 

1147 # docstring in superclass 

1148 return {} 

1149 # return { 

1150 # self._tablename: [ 

1151 # Index('idx_generic_name', 'generic_name'), 

1152 # ] 

1153 # }