Coverage for nlp_manager/parse_medex.py: 35%
222 statements
« prev ^ index » next coverage.py v7.8.0, created at 2025-08-27 10:34 -0500
« prev ^ index » next coverage.py v7.8.0, created at 2025-08-27 10:34 -0500
1r"""
2crate_anon/nlp_manager/parse_medex.py
4===============================================================================
6 Copyright (C) 2015, University of Cambridge, Department of Psychiatry.
7 Created by Rudolf Cardinal (rnc1001@cam.ac.uk).
9 This file is part of CRATE.
11 CRATE is free software: you can redistribute it and/or modify
12 it under the terms of the GNU General Public License as published by
13 the Free Software Foundation, either version 3 of the License, or
14 (at your option) any later version.
16 CRATE is distributed in the hope that it will be useful,
17 but WITHOUT ANY WARRANTY; without even the implied warranty of
18 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
19 GNU General Public License for more details.
21 You should have received a copy of the GNU General Public License
22 along with CRATE. If not, see <https://www.gnu.org/licenses/>.
24===============================================================================
26**NLP handler for the external MedEx-UIMA tool, to find references to
27drugs (medication.**
29- MedEx-UIMA
31 - can't find Python version of MedEx (which preceded MedEx-UIMA)
32 - paper on Python version is
33 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2995636/; uses Python NLTK
34 - see notes in Documents/CRATE directory
35 - MedEx-UIMA is in Java, and resolutely uses a file-based processing system;
36 ``Main.java`` calls ``MedTagger.java`` (``MedTagger.run_batch_medtag``),
37 and even in its core ``MedTagger.medtagging()`` function it's making files
38 in directories; that's deep in the core of its NLP thinking so we can't
39 change that behaviour without creating a fork. So the obvious way to turn
40 this into a proper "live" pipeline would be for the calling code to
42 - fire up a receiving process - Python launching custom Java
43 - create its own temporary directory - Python
44 - receive data - Python
45 - stash it on disk - Python
46 - call the MedEx function - Python -> stdout -> custom Java -> MedEx
47 - return the results - custom Java signals "done" -> Python reads stdin?
48 - and clean up - Python
50 Not terribly elegant, but might be fast enough (and almost certainly much
51 faster than reloading Java regularly!).
53 - output comes from its ``MedTagger.print_result()`` function
54 - would need a per-process-unique temporary directory, since it scans all
55 files in the input directory (and similarly one output directory); would do
56 that in Python
58MedEx-UIMA is firmly (and internally) wedded to a file-based processing
59system. So we need to:
61- create a process-specific pair of temporary directories;
62- fire up a receiving process
63- pass data (1) to file and (2) signal that there's data available;
64- await a "data ready" reply and read the data from disk;
65- clean up (delete files) in readiness for next data chunk.
67NOTE ALSO that MedEx's ``MedTagger`` class writes to ``stdout`` (though not
68``stderr``). Option 1: move our logs to ``stdout`` and use ``stderr`` for
69signalling. Option 2: keep things as they are and just use a ``stdout`` signal
70that's not used by MedEx. Went with option 2; simpler and more consistent esp.
71for logging.
73How do we clean up the temporary directories?
75- ``__del__`` is not the opposite of ``__init__``;
76 https://www.algorithm.co.il/blogs/programming/python-gotchas-1-__del__-is-not-the-opposite-of-__init__/
77- https://eli.thegreenplace.net/2009/06/12/safely-using-destructors-in-python
79PROBLEMS:
81- NLP works fine, but UK-style abbreviations e.g. "qds" not recognized where
82 "q.i.d." is. US abbreviations: e.g.
83 https://www.d.umn.edu/medweb/Modules/Prescription/Abbreviations.html
85 - Places to look, and things to try adding:
87 .. code-block:: none
89 resources/TIMEX/norm_patterns/NormFREQword
91 qds=>R1P6H
93 resources/TIMEX/rules/frequency_rules
95 //QID ( 4 times a day
96 expression="[Qq]\.?[Ii]\.?[Dd]\.?[ ]*\((.*?)\)",val="R1P6H"
98 // RNC: qds
99 expression="[Qq]\.?[Dd]\.?[Ss]\.?[ ]*\((.*?)\)",val="R1P6H"
101 ... looked like it was correct, but not working
102 ... are this files compiled in, rather than being read live?
103 ... do I have the user or the developer version?
105 ... not there yet.
106 Probably need to recompile. See MedEx's Readme.txt
108 - reference to expression/val (as in frequency_rules):
110 .. code-block:: none
112 TIMEX.Rule._add_rule()
113 ... from TIMEX.Rule.Rule via a directory walker
114 ... from TIMEX.ProcessingEngine.ProcessingEngine()
115 ... via semi-hardcoded file location relative to class's location
116 ... via rule_dir, set to .../TIMEX/rules
118 - Detect a file being accessed:
120 .. code-block:: bash
122 sudo apt install inotify-tools
123 inotifywait -m FILE
125 ... frequency_rules IS opened.
127 - OVERALL SEQUENCE:
129 .. code-block:: none
131 org.apache.medex.Main [OR: CrateNedexPipeline.java]
132 org.apache.medex.MedTagger.run_batch_medtag
133 ... creeates an org.apache.NLPTools.Document
134 ... not obviously doing frequency stuff, or drug recognition
135 ... then runs org.apache.medex.MedTagger.medtagging(doc)
136 ... this does most of the heavy lifting, I think
137 ... uses ProcessingEngine freq_norm_engine
138 ... org.apache.TIMEX.ProcessingEngine
139 ... but it may be that this just does frequency NORMALIZATION, not frequency finding
140 ... uses SemanticRuleEngine rule_engine
141 ... which is org.apache.medex.SemanticRuleEngine
142 ... see all the regexlist.put(..., "FREQ") calls
143 ... note double-escaping \\ for Java's benefit
145- Rebuilding MedEx:
147 .. code-block:; bash
149 export MEDEX_DIR=~/dev/MedEx_UIMA_1.3.6 # or similar
150 cd ${MEDEX_DIR}
151 # OPTIONAL # find . -name "*.class" -exec rm {} \; # remove old compiled files
152 javac \
153 -classpath "${MEDEX_DIR}/src:${MEDEX_DIR}/lib/*" \
154 src/org/apache/medex/Main.java \
155 -d bin
157 # ... will also compile dependencies
159 See build_medex_itself.py
161- YES. If you add to ``org.apache.medex.SemanticRuleEngine``, with extra
162 entries in the ``regexlist.put(...)`` sequence, new frequencies appear in the
163 output.
165 To get them normalized as well, add them to frequency_rules.
167 Specifics:
169 (a) SemanticRuleEngine.java
171 .. code-block:: java
173 // EXTRA FOR UK FREQUENCIES (see https://www.evidence.nhs.uk/formulary/bnf/current/general-reference/latin-abbreviations)
174 // NB case-insensitive regexes in SemanticRuleEngine.java, so ignore case here
175 regexlist.put("^(q\\.?q\\.?h\\.?)( |$)", "FREQ"); // qqh, quarta quaque hora (RNC)
176 regexlist.put("^(q\\.?d\\.?s\\.?)( |$)", "FREQ"); // qds, quater die sumendum (RNC); must go before existing competing expression: regexlist.put("^q(\\.|)\\d+( |$)","FREQ");
177 regexlist.put("^(t\\.?d\\.?s\\.?)( |$)", "FREQ"); // tds, ter die sumendum (RNC)
178 regexlist.put("^(b\\.?d\\.?)( |$)", "FREQ"); // bd, bis die (RNC)
179 regexlist.put("^(o\\.?d\\.?)( |$)", "FREQ"); // od, omni die (RNC)
180 regexlist.put("^(mane)( |$)", "FREQ"); // mane (RNC)
181 regexlist.put("^(o\\.?m\\.?)( |$)", "FREQ"); // om, omni mane (RNC)
182 regexlist.put("^(nocte)( |$)", "FREQ"); // nocte (RNC)
183 regexlist.put("^(o\\.?n\\.?)( |$)", "FREQ"); // on, omni nocte (RNC)
184 regexlist.put("^(fortnightly)( |$)", "FREQ"); // fortnightly (RNC)
185 regexlist.put("^((?:2|two)\s+weekly)\b", "FREQ"); // fortnightly (RNC)
186 regexlist.put("argh", "FREQ"); // fortnightly (RNC)
187 // ALREADY IMPLEMENTED BY MedEx: tid (ter in die)
188 // NECESSITY, NOT FREQUENCY: prn (pro re nata)
189 // TIMING, NOT FREQUENCY: ac (ante cibum); pc (post cibum)
191 (b) frequency_rules
193 .. code-block:: none
195 // EXTRA FOR UK FREQUENCIES (see https://www.evidence.nhs.uk/formulary/bnf/current/general-reference/latin-abbreviations)
196 // NB case-sensitive regexes in Rule.java, so offer upper- and lower-case alternatives here
197 // qqh, quarta quaque hora (RNC)
198 expression="\b[Qq]\.?[Qq]\.?[Hh]\.?\b",val="R1P4H"
199 // qds, quater die sumendum (RNC); MUST BE BEFORE COMPETING "qd" (= per day) expression: expression="[Qq]\.?[ ]?[Dd]\.?",val="R1P24H"
200 expression="\b[Qq]\.?[Dd]\.?[Ss]\.?\b",val="R1P6H"
201 // tds, ter die sumendum (RNC)
202 expression="\b[Tt]\.?[Dd]\.?[Ss]\.?\b",val="R1P8H"
203 // bd, bis die (RNC)
204 expression="\b[Bb]\.?[Dd]\.?\b",val="R1P12H"
205 // od, omni die (RNC)
206 expression="\b[Oo]\.?[Dd]\.?\b",val="R1P24H"
207 // mane (RNC)
208 expression="\b[Mm][Aa][Nn][Ee]\b",val="R1P24H"
209 // om, omni mane (RNC)
210 expression="\b[Oo]\.?[Mm]\.?\b",val="R1P24H"
211 // nocte (RNC)
212 expression="\b[Nn][Oo][Cc][Tt][Ee]\b",val="R1P24H"
213 // on, omni nocte (RNC)
214 expression="\b[Oo]\.?[Nn]\.?\b",val="R1P24H"
215 // fortnightly and variants (RNC); unsure if TIMEX3 format is right
216 expression="\b[Ff][Oo][Rr][Tt][Nn][Ii][Gg][Hh][Tt][Ll][Yy]\b",val="R1P2WEEK"
217 expression="\b(?:2|[Tt][Ww][Oo])\s+[Ww][Ee][Ee][Kk][Ll][Yy]\b",val="R1P2WEEK"
218 // monthly (RNC)
219 expression="\b[Mm][Oo][Nn][Tt][Hh][Ll][Yy]\b",val="R1P1MONTH"
220 //
221 // ALREADY IMPLEMENTED BY MedEx: tid (ter in die)
222 // NECESSITY, NOT FREQUENCY: prn (pro re nata)
223 // TIMING, NOT FREQUENCY: ac (ante cibum); pc (post cibum)
225 (c) source:
227 - https://www.evidence.nhs.uk/formulary/bnf/current/general-reference/latin-abbreviations
229- How about routes of administration?
231 .. code-block:: none
233 MedTagger.printResult()
234 route is in FStr_list[5]
235 ... called from MedTagger.medtagging()
236 route is in FStr_list_final[5]
237 before that, is in FStr (separated by \n)
238 ... from formatDruglist
239 ...
240 ... from logs, appears first next to "input for tagger" at
241 which point it's in
242 sent_token_array[j] (e.g. "po")
243 sent_tag_array[j] (e.g. "RUT" = route)
244 ... from tag_dict
245 ... from filter_tags
246 ... from (Document) doc.filtered_drug_tag()
247 ...
248 ... ?from MedTagger.medtagging() calling doc.add_drug_tag()
249 ... no, not really; is in this bit:
250 SuffixArray sa = new SuffixArray(...);
251 Vector<SuffixArrayResult> result = sa.search();
252 ... and then each element of result has a "semantic_type"
253 member that can be "RUT"
254 ... SuffixArray.search()
255 semantic_type=this.lex.sem_list().get(i);
257 ... where lex comes from MedTagger:
258 this.lex = new Lexicon(this.lex_fname);
259 ... Lexicon.sem_list() returns Lexicon.semantic_list
260 ... Lexicon.Lexicon() constructs using MedTagger's this.lex_fname
261 ... which is lexicon.cfg
263 ... aha! There it is. If a line in lexicon.cfg has a RUT tag, it'll
264 appear as a route. So:
265 grep "RUT$" lexicon.cfg | sort # and replace tabs with spaces
267 bedside RUT
268 by mouth RUT
269 drip RUT
270 gt RUT
271 g tube RUT
272 g-tube RUT
273 gtube RUT
274 im injection RUT
275 im RUT
276 inhalation RUT
277 inhalatn RUT
278 inhaled RUT
279 intramuscular RUT
280 intravenously RUT
281 intravenous RUT
282 iv RUT
283 j tube RUT
284 j-tube RUT
285 jtube RUT
286 nare RUT
287 nares RUT
288 naris RUT
289 neb RUT
290 nostril RUT
291 orally RUT
292 oral RUT
293 ou RUT
294 patch DDF-DOSEUNIT-RUT
295 per gt RUT
296 per mouth RUT
297 per os RUT
298 per rectum RUT
299 per tube RUT
300 p. g RUT
301 pgt RUT
302 png RUT
303 pnj RUT
304 p.o RUT
305 po RUT
306 sc RUT
307 sl RUT
308 sq RUT
309 subc RUT
310 subcu RUT
311 subcutaneously RUT
312 subcutaneous RUT
313 subcut RUT
314 subling RUT
315 sublingual RUT
316 sub q RUT
317 subq RUT
318 swallow RUT
319 swish and spit RUT
320 sw&spit RUT
321 sw&swall RUT
322 topically RUT
323 topical RUT
324 topical tp RUT
325 trans RUT
326 with spacer RUT
328 Looks like these are not using synonyms. Note also format is ``route\tRUT``
330 Note also that the first element is always forced to lower case (in
331 Lexicon.Lexicon()), so presumably it's case-insensitive.
333 There's no specific comment format (though any line that doesn't resolve to
334 two items when split on a tab looks like it's ignored).
336 So we might want to add more; use
338 .. code-block:: bash
340 build_medex_itself.py --extraroutes >> lexicon.cfg
342- Note that all frequencies and routes must be in the lexicon.
343 And all frequencies must be in ``SemanticRuleEngine.java`` (and, to be
344 normalized, frequency_rules).
346- USEFUL BIT FOR CHECKING RESULTS:
348 .. code-block:: sql
350 SELECT
351 sentence_text,
352 drug, generic_name,
353 form, strength, dose_amount,
354 route, frequency, frequency_timex3,
355 duration, necessity
356 FROM anonymous_output.drugs;
358- ENCODING
360 - Pipe encoding (to Java's ``stdin``, from Java's ``stdout``) encoding is the
361 less important as we're only likely to send/receive ASCII. It's hard-coded
362 to UTF-8.
364 - File encoding is vital and is hard-coded to UTF-8 here and in the
365 receiving Java.
367 - We have no direct influence over the MedTagger code for output (unless we
368 modify it). The output function is ``MedTagger.print_result()``, which
369 (line 2040 of ``MedTagger.java``) calls ``out.write(stuff)``.
371 The out variable is set by
373 .. code-block:: java
375 this.out = new BufferedWriter(new FileWriter(output_dir
376 + File.separator + doc.fname()));
378 That form of the FileWriter constructor, ``FileWriter(String fileName)``,
379 uses the "default character encoding", as per
380 https://docs.oracle.com/javase/7/docs/api/java/io/FileWriter.html
382 That default is given by ``System.getProperty("file.encoding")``. However,
383 we don't have to do something daft like asking the Java to report its file
384 encoding to Python through a pipe; instead, we can set the Java default
385 encoding. It can't be done dynamically, but it can be done at JVM launch:
386 https://stackoverflow.com/questions/361975/setting-the-default-java-character-encoding.
388 Therefore, we should have a Java parameter specified in the config file as
389 ``-Dfile.encoding=UTF-8``.
391""" # noqa: E501
393import logging
394import os
395import shlex
396import subprocess
397import tempfile
398from typing import Any, Dict, Generator, List, Optional, Tuple
400from cardinal_pythonlib.cmdline import cmdline_quote
401from cardinal_pythonlib.fileops import mkdir_p
402from sqlalchemy import Column, Index, Integer, String, Text
404from crate_anon.nlp_manager.base_nlp_parser import (
405 BaseNlpParser,
406 TextProcessingFailed,
407)
408from crate_anon.nlp_manager.constants import (
409 MEDEX_DATA_READY_SIGNAL,
410 MEDEX_RESULTS_READY_SIGNAL,
411 ProcessorConfigKeys,
412)
413from crate_anon.nlp_manager.nlp_definition import (
414 NlpDefinition,
415)
417log = logging.getLogger(__name__)
420# =============================================================================
421# Constants
422# =============================================================================
424DATA_FILENAME = "crate_medex.txt"
425DATA_FILENAME_KEEP = "crate_medex_{}.txt"
427USE_TEMP_DIRS = True
428# ... True for production; False to see e.g. logs afterwards, by keeping
429# everything in a subdirectory of the user's home directory (see hard-coded
430# nastiness -- for debugging only)
432SKIP_IF_NO_GENERIC = True
433# ... Probably should be True. MedEx returns hits for drug "Thu" with no
434# generic drug; this from its weekday lexicon, I think.
436# -----------------------------------------------------------------------------
437# Maximum field lengths
438# -----------------------------------------------------------------------------
439# https://phekb.org/sites/phenotype/files/MedEx_UIMA_eMERGE_short.pdf
440#
441# RxNorm: https://www.nlm.nih.gov/research/umls/rxnorm/overview.html
442#
443# UMLS: https://www.nlm.nih.gov/research/umls/new_users/glossary.html
444# UMLS CUI max length: https://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/release/columns_data_elements.html # noqa: E501
445UMLS_CUI_MAX_LENGTH = 8 # definite
447# TIMEX3:
448# - http://www.timeml.org/tempeval2/tempeval2-trial/guidelines/timex3guidelines-072009.pdf # noqa: E501
449# - http://www.timeml.org/publications/timeMLdocs/timeml_1.2.1.html#timex3 # noqa: E501
450TIMEX3_MAX_LENGTH = 50 # guess
452# Drug length:
453# There are long ones, like
454# "influenza virus vaccine, inactivated a-brisbane-59-2007, ivr-148 (h1n1) strain" (78) # noqa: E501
455# See e.g. resources/rxcui_generic.cfg, and:
456# $ wc -L filename # shows length of longest line
457# $ egrep -n "^.{$(wc -L < filename)}$" filename # shows longest line
458# ... possibly this gets a bit confused by tabs, can also put a length in:
459# $ egrep -n "^.{302}$" filename # shows lines of length 302
460# And we find a drug of length 286:
461# 1-oxa-7-azacyclopentadecan-15-one,13-((2,6-dideoy-3-c-methyl-3-o-methyl-alpha-l-ribo-hexopyranosyl)oxy)-2-ethyl-3,4,10-trihydroxy 3,5,8,10,12,14-hexamethyl-7-propyl-11-((3,4,6-trideoxy-3-(dimethylamino)-beta-d-xylo-hexopyranosyl)oxy)-, ((2r*, 3s*,4r*,5s*,8r*,10r*,11r*,12s*,13s*,14r*))- # noqa: E501
462# Then there are multivitamin things in brand_generic with length >600.
463# So we should use an unlimited field; SQLAlchemy helpfully seems to translate
464# Text to VARCHAR(MAX) under SQL Server, which is the more efficient:
465# https://stackoverflow.com/questions/834788/using-varcharmax-vs-text-on-sql-server # noqa: E501
466MEDEX_MAX_FORM_LENGTH = 255 # guess; "Powder For Oral Suspension" (26) is one
467MEDEX_MAX_STRENGTH_LENGTH = 50 # guess
468MEDEX_MAX_DOSE_AMOUNT_LENGTH = 50 # guess
469MEDEX_MAX_ROUTE_LENGTH = 50 # guess
470MEDEX_MAX_FREQUENCY_LENGTH = 50 # guess
471MEDEX_MAX_DURATION_LENGTH = 50 # guess
472MEDEX_MAX_NECESSITY_LENGTH = 50 # guess
475# =============================================================================
476# Medex
477# =============================================================================
480class PseudoTempDir:
481 """
482 This class exists so that a TemporaryDirectory and a manually specified
483 directory can be addressed via the same (very simple!) interface.
484 """
486 def __init__(self, name: str) -> None:
487 self.name = name
490class Medex(BaseNlpParser):
491 """
492 EXTERNAL.
494 Class controlling a Medex-UIMA external process, via our custom Java
495 interface, ``CrateMedexPipeline.java``.
497 MedEx-UIMA is a medication-finding tool:
498 https://www.ncbi.nlm.nih.gov/pubmed/25954575.
499 """
501 uses_external_tool = True
503 def __init__(
504 self,
505 nlpdef: NlpDefinition,
506 cfg_processor_name: str,
507 commit: bool = False,
508 ) -> None:
509 """
510 Args:
511 nlpdef:
512 a :class:`crate_anon.nlp_manager.nlp_definition.NlpDefinition`
513 cfg_processor_name:
514 the name of a CRATE NLP config file section (from which we may
515 choose to get extra config information)
516 commit:
517 force a COMMIT whenever we insert data? You should specify this
518 in multiprocess mode, or you may get database deadlocks.
519 """
520 super().__init__(
521 nlpdef=nlpdef,
522 cfg_processor_name=cfg_processor_name,
523 commit=commit,
524 friendly_name="MedEx",
525 )
527 if nlpdef is None: # only None for debugging!
528 self._debug_mode = True
529 self._tablename = self.classname().lower()
530 self._max_external_prog_uses = 1
531 self._progenvsection = ""
532 self._env = {} # type: Dict[str, str]
533 progargs = ""
534 else:
535 self._debug_mode = False
537 self._tablename = self._cfgsection.opt_str(
538 ProcessorConfigKeys.DESTTABLE, required=True
539 )
541 self._max_external_prog_uses = self._cfgsection.opt_int_positive(
542 ProcessorConfigKeys.MAX_EXTERNAL_PROG_USES, default=0
543 )
545 self._progenvsection = self._cfgsection.opt_str(
546 ProcessorConfigKeys.PROGENVSECTION
547 )
549 if self._progenvsection:
550 # noinspection PyTypeChecker
551 self._env = nlpdef.get_env_dict(
552 self._progenvsection, os.environ
553 )
554 else:
555 self._env = os.environ.copy()
556 self._env["NLPLOGTAG"] = nlpdef.logtag or "."
557 # ... because passing a "-lt" switch with no parameter will make
558 # CrateMedexPipeline.java complain and stop
560 progargs = self._cfgsection.opt_str(
561 ProcessorConfigKeys.PROGARGS, required=True
562 )
564 if USE_TEMP_DIRS:
565 self._inputdir = tempfile.TemporaryDirectory()
566 self._outputdir = tempfile.TemporaryDirectory()
567 self._workingdir = tempfile.TemporaryDirectory()
568 # ... these are autodeleted when the object goes out of scope; see
569 # https://docs.python.org/3/library/tempfile.html
570 # ... which manages it using weakref.finalize
571 else:
572 homedir = os.path.expanduser("~")
573 self._inputdir = PseudoTempDir(
574 os.path.join(homedir, "medextemp", "input")
575 )
576 mkdir_p(self._inputdir.name)
577 self._outputdir = PseudoTempDir(
578 os.path.join(homedir, "medextemp", "output")
579 )
580 mkdir_p(self._outputdir.name)
581 self._workingdir = PseudoTempDir(
582 os.path.join(homedir, "medextemp", "working")
583 )
584 mkdir_p(self._workingdir.name)
586 formatted_progargs = progargs.format(**self._env)
587 self._progargs = shlex.split(formatted_progargs)
588 self._progargs.extend(
589 [
590 "-data_ready_signal",
591 MEDEX_DATA_READY_SIGNAL,
592 "-results_ready_signal",
593 MEDEX_RESULTS_READY_SIGNAL,
594 "-i",
595 self._inputdir.name,
596 "-o",
597 self._outputdir.name,
598 ]
599 )
601 self._n_uses = 0
602 self._pipe_encoding = "utf8"
603 self._file_encoding = "utf8"
604 self._p = None # the subprocess
605 self._started = False
607 # -------------------------------------------------------------------------
608 # External process control
609 # -------------------------------------------------------------------------
611 def _start(self) -> None:
612 """
613 Launch the external process. We will save and retrieve data via files,
614 and send signals ("data ready", "results ready) via stdin/stout.
615 """
616 if self._started or self._debug_mode:
617 return
618 args = self._progargs
620 # Nasty MedEx hacks
621 cwd = os.getcwd()
622 log.info(
623 f"For MedEx's benefit, changing to directory: "
624 f"{self._workingdir.name}"
625 )
626 os.chdir(self._workingdir.name)
627 sentsdir = os.path.join(self._workingdir.name, "sents")
628 log.info(f"Making temporary sentences directory: {sentsdir}")
629 mkdir_p(sentsdir)
630 logdir = os.path.join(self._workingdir.name, "log")
631 log.info(f"Making temporary log directory: {logdir}")
632 mkdir_p(logdir)
634 log.info(f"Launching command: {cmdline_quote(args)}")
635 self._p = subprocess.Popen(
636 args,
637 stdin=subprocess.PIPE,
638 stdout=subprocess.PIPE,
639 # stderr=subprocess.PIPE,
640 shell=False,
641 bufsize=1,
642 )
643 # ... don't ask for stderr to be piped if you don't want it; firstly,
644 # there's a risk that if you don't consume it, something hangs, and
645 # secondly if you don't consume it, you see it on the console, which is
646 # helpful.
647 self._started = True
648 log.info(f"Returning to working directory {cwd}")
649 os.chdir(cwd)
651 def _encode_to_subproc_stdin(self, text: str) -> None:
652 """
653 Send text to the external program (via its stdin), encoding it in
654 the process (typically to UTF-8).
655 """
656 log.debug("SENDING: " + text)
657 bytes_ = text.encode(self._pipe_encoding)
658 self._p.stdin.write(bytes_)
660 def _flush_subproc_stdin(self) -> None:
661 """
662 Flushes what we're sending to the external program via its stdin.
663 """
664 self._p.stdin.flush()
666 def _decode_from_subproc_stdout(self) -> str:
667 """
668 Decode what we've received from the external program's stdout,
669 from its specific encoding (usually UTF-8) to a Python string.
670 """
671 bytes_ = self._p.stdout.readline()
672 text = bytes_.decode(self._pipe_encoding)
673 log.debug("RECEIVING: " + repr(text))
674 return text
676 def _finish(self) -> None:
677 """
678 Close down the external process.
679 """
680 if not self._started:
681 return
682 self._p.communicate() # close p.stdout, wait for the subprocess to exit # noqa: E501
683 self._started = False
685 def _signal_data_ready(self) -> bool:
686 """
687 Signals to the child process that we have written data to files, and
688 it's now ready for reading by MedEx.
690 Returns: OK?
691 """
692 if self._finished():
693 return False
694 self._encode_to_subproc_stdin(MEDEX_DATA_READY_SIGNAL + os.linesep)
695 self._flush_subproc_stdin()
696 return True
698 def _await_results_ready(self) -> bool:
699 """
700 Waits until MedEx has signalled us that results are ready.
702 Returns: OK?
703 """
704 while True:
705 if self._finished():
706 return False
707 line = self._decode_from_subproc_stdout()
708 if line == MEDEX_RESULTS_READY_SIGNAL + os.linesep:
709 return True
711 def _finished(self) -> bool:
712 """
713 Has MedEx finished?
714 """
715 if not self._started:
716 return True
717 self._p.poll()
718 finished = self._p.returncode is not None
719 if finished:
720 self._started = False
721 return finished
723 def _restart(self) -> None:
724 """
725 Close down the external process and restart it.
726 """
727 self._finish()
728 self._start()
730 # -------------------------------------------------------------------------
731 # Input processing
732 # -------------------------------------------------------------------------
734 def parse(
735 self, text: str
736 ) -> Generator[Tuple[str, Dict[str, Any]], None, None]:
737 """
738 - Send text to the external process, and receive the result.
739 - Note that associated data is not passed into this function, and is
740 kept in the Python environment, so we can't run into any problems
741 with the transfer to/from the Java program garbling important data.
742 All we send to the subprocess is the text (and an input_terminator).
743 Then, we may receive MULTIPLE sets of data back ("your text contains
744 the following 7 people/drug references/whatever"), followed
745 eventually by the output_terminator, at which point this set is
746 complete.
747 """
748 self._n_uses += 1
749 self._start() # ensure started
750 if USE_TEMP_DIRS:
751 basefilename = DATA_FILENAME
752 else:
753 basefilename = DATA_FILENAME_KEEP.format(self._n_uses)
754 inputfilename = os.path.join(self._inputdir.name, basefilename)
755 outputfilename = os.path.join(self._outputdir.name, basefilename)
756 # ... MedEx gives output files the SAME NAME as input files.
758 try:
759 with open(
760 inputfilename, mode="w", encoding=self._file_encoding
761 ) as infile:
762 # log.info(f"text: {text!r}")
763 infile.write(text)
765 if (
766 not self._signal_data_ready()
767 or not self._await_results_ready() # send
768 ): # receive
769 log.critical("Subprocess terminated unexpectedly")
770 os.remove(inputfilename)
771 # We were using "log.critical()" and "return", but if the Medex
772 # processor is misconfigured, the failed processor can be run
773 # over thousands of records over many hours before the failure
774 # is obvious. Changed 2017-03-17.
775 raise ValueError(
776 "Java interface to Medex failed - miconfigured?"
777 )
779 with open(
780 outputfilename, mode="r", encoding=self._file_encoding
781 ) as infile:
782 resultlines = infile.readlines()
783 for line in resultlines:
784 # log.critical(f"received: {line}")
785 # Output code, from MedTagger.print_result():
786 # out.write(
787 # index + 1 + "\t" + sent_text + "|" +
788 # drug + "|" + brand + "|" + dose_form + "|" +
789 # strength + "|" + dose_amt + "|" +
790 # route + "|" + frequency + "|" + duration + "|" +
791 # necessity + "|" +
792 # umls_code + "|" + rx_code + "|" + generic_code + "|" +
793 # generic_name + "\n");
794 # NOTE that the text can contain | characters. So work from the
795 # right.
796 line = line.rstrip() # remove any trailing newline
797 fields = line.split("|")
798 if len(fields) < 14:
799 log.warning(f"Bad result received: {line!r}")
800 continue
801 generic_name = self.str_or_none(fields[-1])
802 if not generic_name and SKIP_IF_NO_GENERIC:
803 continue
804 generic_code = self.int_or_none(fields[-2])
805 rx_code = self.int_or_none(fields[-3])
806 umls_code = self.str_or_none(fields[-4])
807 (
808 necessity,
809 necessity_startpos,
810 necessity_endpos,
811 ) = self.get_text_start_end(fields[-5])
812 (
813 duration,
814 duration_startpos,
815 duration_endpos,
816 ) = self.get_text_start_end(fields[-6])
817 (
818 _freq_text,
819 frequency_startpos,
820 frequency_endpos,
821 ) = self.get_text_start_end(fields[-7])
822 frequency, frequency_timex = self.frequency_and_timex(
823 _freq_text
824 )
825 (
826 route,
827 route_startpos,
828 route_endpos,
829 ) = self.get_text_start_end(fields[-8])
830 (
831 dose_amount,
832 dose_amount_startpos,
833 dose_amount_endpos,
834 ) = self.get_text_start_end(fields[-9])
835 (
836 strength,
837 strength_startpos,
838 strength_endpos,
839 ) = self.get_text_start_end(fields[-10])
840 (form, form_startpos, form_endpos) = self.get_text_start_end(
841 fields[-11]
842 )
843 (
844 brand,
845 brand_startpos,
846 brand_endpos,
847 ) = self.get_text_start_end(fields[-12])
848 (drug, drug_startpos, drug_endpos) = self.get_text_start_end(
849 fields[-13]
850 )
851 _start_bit = "|".join(fields[0:-13])
852 _index_text, sent_text = _start_bit.split("\t", maxsplit=1)
853 index = self.int_or_none(_index_text)
854 yield self._tablename, {
855 "sentence_index": index,
856 "sentence_text": sent_text,
857 "drug": drug,
858 "drug_startpos": drug_startpos,
859 "drug_endpos": drug_endpos,
860 "brand": brand,
861 "brand_startpos": brand_startpos,
862 "brand_endpos": brand_endpos,
863 "form": form,
864 "form_startpos": form_startpos,
865 "form_endpos": form_endpos,
866 "strength": strength,
867 "strength_startpos": strength_startpos,
868 "strength_endpos": strength_endpos,
869 "dose_amount": dose_amount,
870 "dose_amount_startpos": dose_amount_startpos,
871 "dose_amount_endpos": dose_amount_endpos,
872 "route": route,
873 "route_startpos": route_startpos,
874 "route_endpos": route_endpos,
875 "frequency": frequency,
876 "frequency_startpos": frequency_startpos,
877 "frequency_endpos": frequency_endpos,
878 "frequency_timex3": frequency_timex,
879 "duration": duration,
880 "duration_startpos": duration_startpos,
881 "duration_endpos": duration_endpos,
882 "necessity": necessity,
883 "necessity_startpos": necessity_startpos,
884 "necessity_endpos": necessity_endpos,
885 "umls_code": umls_code,
886 "rx_code": rx_code,
887 "generic_code": generic_code,
888 "generic_name": generic_name,
889 }
891 # Since MedEx scans all files in the input directory, then if we're
892 # not using temporary directories (and are therefore using a new
893 # filename per item), we should remove the old one.
894 os.remove(inputfilename)
896 # Restart subprocess?
897 if (
898 self._max_external_prog_uses > 0
899 and self._n_uses % self._max_external_prog_uses == 0
900 ):
901 log.info(
902 f"relaunching app after "
903 f"{self._max_external_prog_uses} uses"
904 )
905 self._restart()
907 except BrokenPipeError:
908 log.error("Broken pipe; relaunching app")
909 self._restart()
910 raise TextProcessingFailed()
912 @staticmethod
913 def get_text_start_end(
914 medex_str: Optional[str],
915 ) -> Tuple[Optional[str], Optional[int], Optional[int]]:
916 """
917 MedEx returns "drug", "strength", etc. as ``aspirin[7,14]``, where the
918 text is followed by the start position (zero-indexed) and the end
919 position (one beyond the last character) (zero-indexed). This function
920 converts a string like ``aspirin[7,14]`` to a tuple like ``"aspirin",
921 7, 14``.
923 Args:
924 medex_str: string from MedEx
926 Returns:
927 tuple: ``text, start_pos, end_pos``; values may be ``None``
928 """
929 if not medex_str:
930 return None, None, None
931 lbracket = medex_str.rfind("[") # -1 for not found
932 comma = medex_str.rfind(",")
933 rbracket = medex_str.rfind("]")
934 try:
935 if lbracket == -1 or not (lbracket < comma < rbracket):
936 raise ValueError()
937 text = medex_str[:lbracket]
938 lpos = int(medex_str[lbracket + 1 : comma])
939 rpos = int(medex_str[comma + 1 : rbracket])
940 return text, lpos, rpos
941 except (TypeError, ValueError):
942 log.warning(f"Bad string[left, right] format: {medex_str!r}")
943 return None, None, None
945 @staticmethod
946 def int_or_none(text: Optional[str]) -> Optional[int]:
947 """
948 Takes text and returns an integer version or ``None``.
949 """
950 try:
951 return int(text)
952 except (TypeError, ValueError):
953 return None
955 @staticmethod
956 def str_or_none(text: Optional[str]) -> Optional[str]:
957 """
958 If the string is non-empty, return the string; otherwise return
959 ``None``.
960 """
961 return None if not text else text
963 @staticmethod
964 def frequency_and_timex(text: str) -> Tuple[Optional[str], Optional[str]]:
965 """
966 Splits a MedEx frequency/TIMEX strings to its frequency and TIMEX
967 parts; e.g. splits ``b.i.d.(R1P12H)`` to ``"b.i.d.", "R1P12H"``.
968 """
969 if not text:
970 return None, None
971 lbracket = text.rfind("(")
972 rbracket = text.rfind(")")
973 if (
974 lbracket == -1
975 or not (lbracket < rbracket)
976 or rbracket != len(text) - 1
977 ):
978 return None, None
979 return text[0:lbracket], text[lbracket + 1 : rbracket]
981 # -------------------------------------------------------------------------
982 # Test
983 # -------------------------------------------------------------------------
985 def test(self, verbose: bool = False) -> None:
986 """
987 Test the send function.
988 """
989 if self._debug_mode:
990 return
991 self.test_parser(
992 [
993 "Bob Hope visited Seattle and took venlafaxine M/R 375mg od.",
994 "James Joyce wrote Ulysses whilst taking aspirin 75mg mane.",
995 ]
996 )
998 # -------------------------------------------------------------------------
999 # Database structure
1000 # -------------------------------------------------------------------------
1002 def dest_tables_columns(self) -> Dict[str, List[Column]]:
1003 # docstring in superclass
1004 startposdef = "Start position (zero-based) of "
1005 endposdef = (
1006 "End position (zero-based index of one beyond last character) of "
1007 )
1008 return {
1009 self._tablename: [
1010 Column(
1011 "sentence_index",
1012 Integer,
1013 comment="One-based index of sentence in text",
1014 ),
1015 Column(
1016 "sentence_text",
1017 Text,
1018 comment="Text recognized as a sentence by MedEx",
1019 ),
1020 Column("drug", Text, comment="Drug name, as in the text"),
1021 Column("drug_startpos", Integer, comment=startposdef + "drug"),
1022 Column("drug_endpos", Integer, comment=endposdef + "drug"),
1023 Column(
1024 "brand",
1025 Text,
1026 comment="Drug brand name (?lookup ?only if given)",
1027 ),
1028 Column(
1029 "brand_startpos", Integer, comment=startposdef + "brand"
1030 ),
1031 Column("brand_endpos", Integer, comment=endposdef + "brand"),
1032 Column(
1033 "form",
1034 String(MEDEX_MAX_FORM_LENGTH),
1035 comment="Drug/dose form (e.g. 'tablet')",
1036 ),
1037 Column("form_startpos", Integer, comment=startposdef + "form"),
1038 Column("form_endpos", Integer, comment=endposdef + "form"),
1039 Column(
1040 "strength",
1041 String(MEDEX_MAX_STRENGTH_LENGTH),
1042 comment="Strength (e.g. '75mg')",
1043 ),
1044 Column(
1045 "strength_startpos",
1046 Integer,
1047 comment=startposdef + "strength",
1048 ),
1049 Column(
1050 "strength_endpos", Integer, comment=endposdef + "strength"
1051 ),
1052 Column(
1053 "dose_amount",
1054 String(MEDEX_MAX_DOSE_AMOUNT_LENGTH),
1055 comment="Dose amount (e.g. '2 tablets')",
1056 ),
1057 Column(
1058 "dose_amount_startpos",
1059 Integer,
1060 comment=startposdef + "dose_amount",
1061 ),
1062 Column(
1063 "dose_amount_endpos",
1064 Integer,
1065 comment=endposdef + "dose_amount",
1066 ),
1067 Column(
1068 "route",
1069 String(MEDEX_MAX_ROUTE_LENGTH),
1070 comment="Route (e.g. 'by mouth')",
1071 ),
1072 Column(
1073 "route_startpos", Integer, comment=startposdef + "route"
1074 ),
1075 Column("route_endpos", Integer, comment=endposdef + "route"),
1076 Column(
1077 "frequency",
1078 String(MEDEX_MAX_FREQUENCY_LENGTH),
1079 comment="Frequency (e.g. 'b.i.d.')",
1080 ),
1081 Column(
1082 "frequency_startpos",
1083 Integer,
1084 comment=startposdef + "frequency",
1085 ),
1086 Column(
1087 "frequency_endpos",
1088 Integer,
1089 comment=endposdef + "frequency",
1090 ),
1091 Column(
1092 "frequency_timex3",
1093 String(TIMEX3_MAX_LENGTH),
1094 comment=(
1095 "Normalized frequency in TIMEX3 format "
1096 "(e.g. 'R1P12H')"
1097 ),
1098 ),
1099 Column(
1100 "duration",
1101 String(MEDEX_MAX_DURATION_LENGTH),
1102 comment="Duration (e.g. 'for 10 days')",
1103 ),
1104 Column(
1105 "duration_startpos",
1106 Integer,
1107 comment=startposdef + "duration",
1108 ),
1109 Column(
1110 "duration_endpos", Integer, comment=endposdef + "duration"
1111 ),
1112 Column(
1113 "necessity",
1114 String(MEDEX_MAX_NECESSITY_LENGTH),
1115 comment="Necessity (e.g. 'prn')",
1116 ),
1117 Column(
1118 "necessity_startpos",
1119 Integer,
1120 comment=startposdef + "necessity",
1121 ),
1122 Column(
1123 "necessity_endpos",
1124 Integer,
1125 comment=endposdef + "necessity",
1126 ),
1127 Column(
1128 "umls_code",
1129 String(UMLS_CUI_MAX_LENGTH),
1130 comment="UMLS CUI",
1131 ),
1132 Column("rx_code", Integer, comment="RxNorm RxCUI for drug"),
1133 Column(
1134 "generic_code",
1135 Integer,
1136 comment="RxNorm RxCUI for generic name",
1137 ),
1138 Column(
1139 "generic_name",
1140 Text,
1141 comment="Generic drug name (associated with RxCUI code)",
1142 ),
1143 ]
1144 }
1146 def dest_tables_indexes(self) -> Dict[str, List[Index]]:
1147 # docstring in superclass
1148 return {}
1149 # return {
1150 # self._tablename: [
1151 # Index('idx_generic_name', 'generic_name'),
1152 # ]
1153 # }