Coverage for preprocess/systmone_ddgen.py: 49%
685 statements
« prev ^ index » next coverage.py v7.8.0, created at 2026-02-05 06:46 -0600
« prev ^ index » next coverage.py v7.8.0, created at 2026-02-05 06:46 -0600
1r"""
2crate_anon/preprocess/systmone_ddgen.py
4===============================================================================
6 Copyright (C) 2015, University of Cambridge, Department of Psychiatry.
7 Created by Rudolf Cardinal (rnc1001@cam.ac.uk).
9 This file is part of CRATE.
11 CRATE is free software: you can redistribute it and/or modify
12 it under the terms of the GNU General Public License as published by
13 the Free Software Foundation, either version 3 of the License, or
14 (at your option) any later version.
16 CRATE is distributed in the hope that it will be useful,
17 but WITHOUT ANY WARRANTY; without even the implied warranty of
18 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
19 GNU General Public License for more details.
21 You should have received a copy of the GNU General Public License
22 along with CRATE. If not, see <https://www.gnu.org/licenses/>.
24===============================================================================
26**Generate a CRATE data dictionary for SystmOne data.**
29Notes
30-----
32- SystmOne is a general-purpose electronic health record (EHR) system from TPP
33 (The Phoenix Partnership): https://tpp-uk.com/products/.
35- It's widely used in general practice (GP), and in
36 Cambridgeshire/Peterborough, ~80% of GP surgeries use it (2018 data,
37 https://pubmed.ncbi.nlm.nih.gov/29490968/, Figure 2).
39- Cambridgeshire & Peterborough NHS Foundation Trust (CPFT) used to use
40 SystmOne for community services, and then moved nearly all the rest of its
41 services to SystmOne (from RiO, in the case of mental health services):
42 Children's Directorate (12 Oct 2020), Community Hospital wards (30 Nov 2020),
43 the rest of the Older People, Adults, and Community Directorate (7 Dec 2020),
44 and finally the Adult and Specialist Directorate (14 Jun 2021).
46- SystmOne is centrally hosted by TPP.
48- TPP provide a nightly "Strategic Reporting extract" (SRE) of SystmOne data.
50- Its primary coding mechanisms are (1) CTV3 (Read) codes, and (2) SNOMED
51 codes (see e.g. https://termbrowser.nhs.uk/) -- the latter are gradually
52 taking over (as of 2021). Coded values can be numeric. For example, one entry
53 might include:
55 - SNOMED code 718087004
56 - SNOMED text "QRISK2 cardiovascular disease 10 year risk score"
57 - CTV3 code "XaQVY"
58 - CTV3 text "QRISK2 cardiovascular disease 10 year risk score"
59 - Numeric unit "%"
60 - Numeric value 10.4
62- SystmOne collects data mostly via "templates" and "questionnaires". Templates
63 are perhaps closer to the heart of SystmOne (e.g. better presented in the
64 long-form journal view) and values entered into templates are (always?)
65 coded. Questionnaires are more free-form. Both can have free text attached to
66 coded values.
69Strategic Reporting extract
70---------------------------
72``SpecificationDirectory.zip`` (e.g. 2021-02-18) contains e.g. ``Specification
73v123.csv``, which is a full description of the SRE. Principles:
75- All these tables start ``SR``, e.g. ``SR18WeekWait``, ``SRAAndEAttendance``.
77- Columns in that spreadsheet are:
79 .. code-block:: none
81 TableName
82 TableDescription
83 ColumnName
84 ColumnDescription
85 ColumnDataType -- possible values include:
86 Boolean
87 Date
88 Date and Time
89 Numeric - Integer
90 Numeric - Real
91 Text - Fixed
92 Text - Variable
93 ColumnLength -- possible values include:
94 empty (e.g. boolean, date, date/time)
95 8 for integer
96 4 for real
97 the VARCHAR length -- for both "variable" and "fixed" text types
98 DateDefining
99 ColumnOrdinal -- sequence number of column within table
100 LinkedTable }
101 LinkedColumn1 }-+
102 LinkedColumn2 } |
103 +-- e.g.
104 SROrganisation, ID
105 SRStaffMember, RowIdentifier
106 SRPatient, RowIdentifier, IDOrganisationVisibleTo
108- To get a table list:
110 .. code-block:: bash
112 # Poor for CSVs with newlines within their strings:
113 tail -n+2 "Specification v123.csv" | cut -d, -f1 | sort | uniq
115 # Much better:
116 python3 -c 'import csv; print("\n".join(row[0] for i, row in enumerate(csv.reader(open("Specification v123.csv"))) if i > 0))' | sort | uniq
118- Tables and their descriptions:
120 .. code-block:: python
122 import csv
123 s = set()
124 for i, row in enumerate(csv.reader(open("Specification v123.csv"))):
125 if i > 0:
126 s.add(f"{row[0]} - {row[1]}")
128 print("\n".join((x for x in sorted(s))))
130 Translating that to a single line: https://www.python.org/dev/peps/pep-0289/
131 ... meh, hard.
133- ``SRPatient`` looks to be the master patient table here -- including names,
134 dates of birth/death, NHS number.
136- ``Tpp Strategic Reporting Table Specification v123.rtf`` contains a nicer
137 version of (exactly?) the same information.
139- Strategic Reporting downloads can be configured. Options include:
141 - Whether to include the shared record. (I'm not sure if that means a
142 national thing or data from SystmOne that each patient may have consented
143 to sharing "'out' from another organization, then 'in' to mine".)
145- When a download is set up, the recipient gets one CSV file per table
146 selected, such as ``SRPatient.csv`` for the ``SRPatient`` table, plus some
147 ever-present system tables:
149 - ``SRManifest.csv``, describing what you've received;
150 - ``SRMapping.csv`` and ``SRMappingGroup.csv``, providing text for built-in
151 lists.
153- The date format is e.g. "29 Sep 2011 14:53:28". Unknown times are marked as
154 "00:00:00". Unknown dates give an empty string. Boolean values are ``TRUE``
155 or ``FALSE``.
158Free-text data
159--------------
161The SRE does not contain free text data or binary documents by default. For
162some Trusts, an augmented SRE is provided also, with that information.
164From ``FreeText Model.xlsx``, 2021-04-15, some of this data comes in the
165following format:
167.. code-block:: none
169 Field Name Type Description
170 RowIdentifier bigint The unique identifier of the
171 record
172 IDPatient bigint Links to patient ID in
173 demographics
174 IDReferralIn bigint ID of referral
175 IDEvent bigint Links to activity event ID
176 Question varchar(MAX) The questionnaire question
177 [FreeText] varchar(MAX) The answer given to the above
178 question
179 EventDate datetime The data/time of the
180 questionnaire
181 SRTable varchar(100) Which SR table the record
182 relates to
183 IDSRTable bigint The ID of the above table
184 QuestionnaireName varchar(255) The name of the questionnaire
185 IDAnsweredQuestionnaire bigint The ID of the above
186 questionnaire
187 QuestionnaireVersionNumber int The version number of the above
188 questionnaire
189 IDOrganisation bigint Organisation ID of the
190 questionnaire record
191 CPFTGroup int Group (directorate)
192 Directorate varchar(50) Directorate name
193 TeamName varchar(100) Name of team linked to the
194 referral
195 IsMentalHealth int Mental or physical health
196 Imported date Date imported to the database
198 (SR = Strategic Reporting.)
200Specimen values:
202- SRTable: 'SRAnsweredQuestionnaire'
203- IDSRTable: this varies for rows with SRTable = 'SRAnsweredQuestionnaire', so
204 I think it's the PK within the table indicated by SRTable.
205- QuestionnaireName = 'CPFT Risk Assessment'
206- IDAnsweredQuestionnaire = this is unique for rows with QuestionnaireName =
207 'CPFT Risk Assessment', so I think it's the ID of the Questionnaire, and is
208 probably a typo.
210(This ends up (in our environment) in the S1_FreeText table, as below, so it
211likely arrives as SRFreeText.)
213However, note that ``RowIdentifier`` is **not** unique in this table. Whatever
214they mean by "record", it isn't that. For example, there are 7 rows with one
215common value of ``RowIdentifier`` that are clearly the 7 questions (in
216``Question``) and textually coded answers (in ``FreeText``) to a SWEMWBS
217questionnaire. That means that to apply a FULLTEXT index, which requires an
218indexed unique value, we have to add one.
221Key fields
222----------
224- ``IDPatient`` -- the SystmOne patient number, in all patient tables (PID,
225 in CRATE terms).
226- ``SRPatient.NHSNumber`` -- the NHS number (MPID, in CRATE terms).
229Notable tables in the SRE
230-------------------------
232- [SR]Patient, as above
234- Patient identifiers and relationship/third-party details:
236 - [SR]PatientAddressHistory
237 - [SR]PatientContactDetails
238 - [SR]HospitalAAndENumber
240- Relationship/third-party details:
242 - [SR]PatientRelationship
243 - some of the safeguarding tables
245- [SR]NDOptOutPreference, re NHS national data opt out (for NHS Act s251 use)
247 - This has an IDPatient column; presumably presence indicates an active
248 opt-out.
250- Full text and binary:
252 - [SR]Media -- contains filenames and some metadata
253 - [SR]FreeText -- if supplied
256Notable additional tables/columns in the CPFT environment
257---------------------------------------------------------
259- S1_FreeText -- this includes all answers to Questionnaires (linked via
260 ``IDAnsweredQuestionnaire`` etc.). Comes from the "upgraded" SRE.
262- Several tables have identifiers linked in. For example, try:
264 .. code-block:: sql
266 SELECT * FROM information_schema.columns WHERE column_name = 'FirstName'
269Notable tables omitted from the CPFT environment
270------------------------------------------------
272- Questionnaire -- data is linked into to AnsweredQuestionnaire (which still
273 contains the column ``IDQuestionnaire``).
276CPFT copy
277---------
279This broadly follows the SRE, but is expanded. Some notable differences:
281- Tables named ``SR*`` in the SRE are named ``S1_*`` in the CPFT version (e.g.
282 ``SRPatient`` becomes ``S1_Patient``).
284- There is a ``S1_Patient.NationalDataOptOut`` column (0 or 1).
286- The local opt-out information appears in S1_ClinicalOutcome_ConsentResearch
287 (as the OptOut field, a text field) but is clearer in
288 S1_ClinicalOutcome_ConsentResearch_OptOutCheck, which only contains patients
289 opting out and has:
291 .. code-block:: none
293 IDPatient = <ID_of_patient_opting_out>
294 SNOMEDCode = 1091881000000109
295 CTV3Code = 'XaaDb'
296 CTV3Text = 'Declined invitation to participate in research study'
298 So for CPFT, we will autodetect this table/column
299 (S1_ClinicalOutcome_ConsentResearch_OptOutCheck.SNOMEDCode) and the config
300 file should contain:
302 .. code-block:: ini
304 optout_col_values = [1091881000000109]
306- There seem to be quite a few extra tables, such as:
308 .. code-block:: none
310 S1_ClinicalMeasure_QRisk
311 S1_ClinicalMeasure_SWEMWBS
312 S1_ClinicalMeasure_Section58
314 These look like CPFT-created tables pulling data from questionnaires or
315 similar.
317- There is ``S1_FreeText``, where someone (NP!) has helpfully imported that
318 additional data.
320- There is ``S1_ClinicalOutcome_ConsentResearch``, which is the traffic-light
321 system for the CPFT Research Database.
323In more detail:
325- All data is loaded via stored procedures, available via Microsoft SQL Server
326 Management Studio in :menuselection:`[server] --> [database] -->
327 Programmability --> Stored Procedures`. Right-click any and choose "Modify"
328 to view the source. For example, the stored procedure named
329 ``dbo.load_S1_Patient`` creates the ``S1_Patient`` table.
331- ``RwNo`` or ``RwNo_Patient`` is frequently used, typically via:
333 .. code-block:: none
335 SELECT
336 -- stuff,
337 ROW_NUMBER() OVER (
338 PARTITION BY IDPatient
339 ORDER BY DateEventRecorded DESC
340 ) AS RwNo
341 FROM
342 -- somewhere
343 WHERE
344 RwNo = 1
345 ;
347 SELECT
348 -- stuff,
349 ROW_NUMBER() OVER (
350 PARTITION BY IDPatient
351 ORDER BY DateEvent DESC
352 ) AS RwNo_Patient
353 FROM
354 -- somewhere
355 ;
357 ... in other words, picking the most recent for each patient (or, without
358 the WHERE clause, showing its sequencing within each patient).
361Test patients in the live system?
362---------------------------------
364There are some test patients in our live system.
366.. code-block:: sql
368 SELECT COUNT(*) -- or DISTINCT firstname, surname
369 FROM S1_Patient
370 WHERE firstname LIKE '%test%' AND surname LIKE '%test%';
372 -- Several present. However, in the CPFT copy, column "TestPatient" from
373 -- this table (BOOLEAN in SRE docs) is missing. How to distinguish?
375There are several present. They should be distinguished by the ``TestPatient``
376column (BOOLEAN, as per the SRE docs). Our code looks for the "TestPatient"
377column and marks it as an opt-out flag.
379.. todo:: TestPatient column missing in CPFT copy. [A/w NP 2022-03-21.]
382Manual review after first draft
383-------------------------------
385Reviewing CPFT de-identified output for patient-related content only (not
386staff-related), per local ethics approvals.
388.. code-block:: sql
390 -- Tables in the de-identified database:
391 SELECT table_name FROM information_schema.tables WHERE table_catalog = 'S1' ORDER BY table_name;
393All reviewed and this code tweaked accordingly.
396Related tools
397-------------
399- The OpenSAFELY research tool runs on SystmOne data (with other data linked
400 in); it therefore provides helpful code lists. See
402 - https://github.com/opensafely
403 - https://github.com/opensafely-core
405 This tool is one that separates researchers from data (by allowing queries,
406 not researchers, access); it made its debut during COVID-19 with Williamson
407 et al. (2020), https://pubmed.ncbi.nlm.nih.gov/32640463/.
409- Other such tools like DataSHIELD (https://www.datashield.org/, or e.g. Gaye
410 et al. 2014, https://pubmed.ncbi.nlm.nih.gov/25261970/) perform similar
411 researcher/data separation via other methods.
413""" # noqa: E501
415# todo: SystmOne (CRATE traffic-light system): implement S1_ClinicalOutcome_ConsentResearch # noqa: E501
417# =============================================================================
418# Imports
419# =============================================================================
421import csv
422from dataclasses import dataclass, field
423from enum import Enum
424import logging
425import re
426from typing import Any, Dict, Iterable, List, Optional, Set, Tuple
428from cardinal_pythonlib.dicts import reversedict
429from cardinal_pythonlib.enumlike import CaseInsensitiveEnumMeta
430from cardinal_pythonlib.sql.validation import SQLTYPE_DATE
432from crate_anon.anonymise.altermethod import AlterMethod
433from crate_anon.anonymise.constants import (
434 Decision,
435 IndexType,
436 ScrubMethod,
437 ScrubSrc,
438 SrcFlag,
439)
440from crate_anon.common.logfunc import warn_once
441from crate_anon.anonymise.dd import DataDictionary, DataDictionaryRow
442from crate_anon.preprocess.constants import CRATE_COL_PK
444log = logging.getLogger(__name__)
447# =============================================================================
448# Constants
449# =============================================================================
451# -----------------------------------------------------------------------------
452# Typing
453# -----------------------------------------------------------------------------
455TABLE_TRANSLATION_DICT_TYPE = Dict["SystmOneContext", Dict[str, str]]
456# ... maps a SystmOneContext to a dictionary mapping one tablename to another
458COLUMN_TRANSLATION_DICT_TYPE = Dict[
459 "SystmOneContext", Dict[Tuple[str, str], str]
460]
461# ... maps a SystmOneContext to a dictionary mapping (table, col) to newcol
464# -----------------------------------------------------------------------------
465# Helper functions for constants
466# -----------------------------------------------------------------------------
469def _flip_coldict(d: Dict[Tuple[str, str], str]) -> Dict[Tuple[str, str], str]:
470 """
471 Flips a mapping from (tablename, col1): col2 to (tablename, col2): col1.
472 """
473 return {(table, newcol): srccol for (table, srccol), newcol in d.items()}
476# -----------------------------------------------------------------------------
477# Cosmetic
478# -----------------------------------------------------------------------------
480COMMENT_SEP = " // " # for combining parts of column comments
483# -----------------------------------------------------------------------------
484# Generic regular expression
485# -----------------------------------------------------------------------------
487ANYTHING = ".+" # at least one character
490def not_just_at_start(x: str) -> str:
491 """
492 Apply a prefix so that a regex string doesn't just work at the start of a
493 string.
494 """
495 return ".*" + x
498def terminate(x: str) -> str:
499 """
500 Apply an end-of-string terminator to a regex string.
501 """
502 return x + "$"
505# -----------------------------------------------------------------------------
506# Contexts and table naming
507# -----------------------------------------------------------------------------
510class SystmOneContext(Enum, metaclass=CaseInsensitiveEnumMeta):
511 """
512 Environments in which we might have SystmOne data.
513 """
515 TPP_SRE = "TPP Strategic Reporting Extract"
516 CPFT_DW = "CPFT Data Warehouse"
519DEFAULT_SYSTMONE_CONTEXT = SystmOneContext.CPFT_DW
520TABLE_PREFIXES = {
521 SystmOneContext.TPP_SRE: "SR",
522 SystmOneContext.CPFT_DW: "S1_",
523}
526# -----------------------------------------------------------------------------
527# Table names
528# -----------------------------------------------------------------------------
529# "Core" tablename, without the SR/S1_/... prefix.
532class S1Table:
533 """
534 SystmOne "core" table names, with no prefix.
535 """
537 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
538 # Tables containing a range of patient identifiers
539 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
541 PATIENT = "Patient" # e.g. SRPatient (SRE), S1_Patient (CPFT)
542 ADDRESS_HISTORY = "PatientAddressHistory"
543 CARE_PLAN_REVIEW = "CarePlanReview"
544 CONTACT_DETAILS = "PatientContactDetails"
545 RELATIONSHIPS = "PatientRelationship"
547 # Other tables starting "Patient":
548 # - SRPatientContactProperty: seems not relevant; describes visits/diary
549 # events
550 # - SRPatientGroups: e.g. to group patients in a residential home; also
552 # family, but contains no direct identifiers.
553 # - SRPatientLeave: leave from hospital.
554 # - SRPatientLocation: location within A&E departments, I think.
555 # - SRPatientRegistration: registration status (and who did it); also their
556 # preferred pharmacy; but no direct identifiers.
558 HOSP_AE_NUMBERS = "HospitalAAndENumber"
559 SAFEGUARDING_PERSON_AT_RISK = "SafeguardingPersonAtRisk"
561 # See also OMIT_TABLENAME_COLNAME_PAIRS below.
562 #
563 # Other tables whose name might suggest patient identifiers:
564 # - SRAddressBookEntry: institutional addresses only? (FK to this from
565 # SRSafeguardingIncidentDetails, for example.) todo: check -- institutional addresses only? # noqa: E501
566 # - SRHospitalAdmissionAndDischarge, etc. -- no external identifiers linked
567 # to HospitalAAndENumber, just SystmOne IDs.
568 # - SROohEmergencyCall -- no contact numbers.
569 # - SROohTransport -- very structured.
570 # - SROohVisit -- very structured.
572 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
573 # Tables containing free text
574 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
576 FREETEXT = "FreeText"
577 MEDIA = "Media" # todo: binary documents -- how?
579 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
580 # Others requiring special treatment
581 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
583 ACTIVITY_EVENT = "ActivityEvent"
584 BED_CLOSURE = "BedClosure"
585 CONTACTS = "Contacts"
586 DISCHARGE_DELAY = "DischargeDelay"
587 DOCUMENTS = "Documents"
588 MENTAL_HEALTH_ACT_APPEAL = "SectionAppeal"
589 MENTAL_HEALTH_ACT_AWOL = "MHAWOL"
590 NOMIS_NUMBER = "NomisNumber" # National Offender Management Info. System
591 OUT_OF_HOURS_ACTION = "OohAction"
592 OUT_OF_HOURS_THIRD_PARTY_CALL = "OohThirdPartyCall"
593 SAFEGUARDING_ALLEGATION_DETAILS = "SafeguardingAllegationDetails"
594 SAFEGUARDING_INCIDENT_DETAILS = "SafeguardingIncidentDetails"
595 TASK = "Task"
597 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
598 # Not used here, but used for the consent-for-contact system
599 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
601 CLINICAL_CODE = "ClinicalCode"
602 GP_PRACTICE_HISTORY = "GPPracticeHistory"
603 HOSPITAL_CONSULTANT_EVENT = "HospitalConsultantEvent"
604 ORGANISATION = "Organisation"
605 REFERRAL_ALLOCATION = "ReferralAllocation"
606 REFERRAL_ALLOCATION_STAFF = "ReferralAllocationStaff"
607 RESPONSIBLE_PARTY = "ResponsibleParty"
608 TEAM = "Team"
611class CPFTTable:
612 """
613 Selected tables that CPFT have renamed or created.
614 """
616 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
617 # Tables containing a range of patient identifiers
618 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
620 ADDRESS = "PatientAddress"
621 CONTACT_DETAILS = "PatientContact"
622 DEMOGRAPHICS = "Demographics"
623 REL_MOTHER = "PatientRelationshipMother"
625 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
626 # Tables containing free text
627 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
629 CYP_FRS_TELEPHONE_TRIAGE = "CYPFRS_TelephoneTriage"
631 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
632 # Others requiring special treatment
633 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
635 BED_CLOSURE = "InpatientBedClosure"
636 CONTACTS_ARCHIVE = "ContactsArchive"
637 CONTACTS_ARCHIVE_CLIENT_SEQUENCE = "ContactsArchive_ClientSequence"
638 MENTAL_HEALTH_ACT_APPEAL = "MentalHealthAct_SectionAppeal"
639 MENTAL_HEALTH_ACT_AWOL = "MentalHealthAct_Awol"
641 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
642 # Not used here, but used for the consent-for-contact system
643 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
645 DIAGNOSIS = "Diagnosis"
646 GP_PRACTICE = "PatientGPPractice"
647 INPATIENT_CONSULTANT_EPISODE = "InpatientSpells_ConsultantEpisode"
648 CARE_COORDINATOR = "CPACareCoordinator"
650 # These created as custom tables (quasi-views) by CPFT, with no direct
651 # equivalent in the original (the data probably coming from SRCode or
652 # similar):
653 CPFTRD_PREFS = "ClinicalOutcome_ConsentResearch"
654 CPFTRD_EMAIL = "ClinicalOutcome_ConsentResearch_EmailCheck"
655 CPFTRD_OPT_OUT = "ClinicalOutcome_ConsentResearch_OptOutCheck"
658class CrateView:
659 """
660 Views created by CRATE, which do not have contextual prefixes.
661 """
663 CRATE_VIEW_PREFIX = "vw_crate_"
665 GEOGRAPHY_VIEW = CRATE_VIEW_PREFIX + "PatientAddressWithResearchGeography"
666 TESTPATIENT_VIEW = CRATE_VIEW_PREFIX + "FindExtraTestPatients"
669# -----------------------------------------------------------------------------
670# Table collections
671# -----------------------------------------------------------------------------
672# Tables are referred to here by their "core" name, i.e. after removal of
673# prefixes like "SR" or "S1_", if they have one.
675# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
676# Tables to include
677# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
679_INCLUDE_TABLES_REGEX_S1 = (
680 # Include even if --systmone_allow_unprefixed_tables is not used.
681 CrateView.CRATE_VIEW_PREFIX,
682)
683_INCLUDE_TABLES_REGEX_CPFT = ("vw",) # some other views
684INCLUDE_TABLES_REGEX = {
685 SystmOneContext.TPP_SRE: _INCLUDE_TABLES_REGEX_S1,
686 SystmOneContext.CPFT_DW: _INCLUDE_TABLES_REGEX_S1
687 + _INCLUDE_TABLES_REGEX_CPFT,
688}
690# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
691# Tables to omit
692# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
694_OMIT_AND_IGNORE_TABLES_S1 = (
695 S1Table.NOMIS_NUMBER, # NOMIS (prison) numbers
696 S1Table.SAFEGUARDING_ALLEGATION_DETAILS,
697 # ... sensitive and no patient ID column
698)
699_OMIT_AND_IGNORE_TABLES_CPFT = (
700 # CPFT extras:
701 "Deaths", # has NHS number then multi-system ID but not consistent SystmOne patient ID # noqa: E501
702 "gr_workings", # no idea
703 "InpatientAvailableBeds", # RowIdentifier very far from unique; ?no PK; no patient info # noqa: E501
704)
705OMIT_AND_IGNORE_TABLES = {
706 SystmOneContext.TPP_SRE: _OMIT_AND_IGNORE_TABLES_S1,
707 SystmOneContext.CPFT_DW: _OMIT_AND_IGNORE_TABLES_S1
708 + _OMIT_AND_IGNORE_TABLES_CPFT,
709}
711_OMIT_AND_IGNORE_TABLES_REGEX_S1 = ()
712_OMIT_AND_IGNORE_TABLES_REGEX_CPFT = (
713 # CPFT extras:
714 "Accommodation_",
715 # ... e.g. Accommodation_20210329, Accommodation_Wendy
716 "AuditLog", # may have gone now! Was there for a while. Not relevant.
717 # KEPT: ClinicalOutcome_NHS_Staff_LongCovid.
718 # This one filters for CTV3 codes "Y2c49" and "Y2ca". As d/w NP 2021-12-14:
719 # - These are "local" codes.
720 # ("Official" Read codes have been frozen since 2016, according to
721 # https://datadictionary.nhs.uk/supporting_information/read_coded_clinical_terms.html.) # noqa: E501
722 # - They indicate whether someone is a healthcare worker (HCW).
723 # - Introduced for COVID-19, since HCW status was a clear risk factor for
724 # infection.
725 # - No detail suggesting identification (and they don't indicate e.g.
726 # membership of a specific Trust, or a specific job role).
727 "Inpatients",
728 # S1_Inpatients, S1_Inpatients_20201020: current inpatients -- but these
729 # tables have NHSNumber as FLOAT. Exclude them.
730 "Mortality", # includes S1_Mortality, S1_MortalityAdditionalInfo
731 # These contain (a) age (rather than DOB) information, and (b) information
732 # from multiple systems -- some risk of including RiO patients with
733 # coincidentally the same ClientID as a SystmOne IDPatient, unless
734 # filtered, and is derived information anyway -- for now, we'll omit.
735 "ReferralsOpen$",
736 # This CPFT table is a non-patient table (but with potentially identifiable
737 # information about referral reason? -- maybe not) -- skip it.
738 "WaitList_", # S1_Waitlist_*
739 # Waiting list tables use a confusing blend of SystmOne "IDPatient" and
740 # RiO "ClientID" columns, and it's not clear they add much.
741 "UserSmartCard",
742 # Not relevant clinically.
743 # I considered excluding "vw.*" (views) and "zzz.*" (scratch tables) here,
744 # but the user has the option to exclude all such tables via
745 # --systmone_allow_unprefixed_tables if they desire. Views may be useful;
746 # see also INCLUDE_TABLES_REGEX above. However, "zz" or "zzz" tables in
747 # CPFT are scratch tables that should not be used:
748 "zz",
749 # Some have suffixes e.g. "S1_ReferralsIn_20200917", i.e. end with an
750 # underscore then 8 digits. These are temporary copies that we should not
751 # use. Some have more after that date.
752 r"\w+_\d{8}",
753 # If a table has the suffix "_old", we probably don't want it!
754 r"\w+_old",
755)
756OMIT_AND_IGNORE_TABLES_REGEX = {
757 SystmOneContext.TPP_SRE: _OMIT_AND_IGNORE_TABLES_REGEX_S1,
758 SystmOneContext.CPFT_DW: _OMIT_AND_IGNORE_TABLES_REGEX_S1
759 + _OMIT_AND_IGNORE_TABLES_REGEX_CPFT,
760}
762# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
763# Tables that have been renamed
764# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
766_S1_TO_CPFT_TABLE_TRANSLATION = {
767 # Where CPFT has renamed a S1 SRE table directly.
768 S1Table.ADDRESS_HISTORY: CPFTTable.ADDRESS,
769 # ... i.e. CPFT have renamed SRPatientAddressHistory to S1_PatientAddress.
770 S1Table.BED_CLOSURE: CPFTTable.BED_CLOSURE,
771 S1Table.CLINICAL_CODE: CPFTTable.DIAGNOSIS,
772 S1Table.CONTACT_DETAILS: CPFTTable.CONTACT_DETAILS,
773 S1Table.HOSPITAL_CONSULTANT_EVENT: CPFTTable.INPATIENT_CONSULTANT_EPISODE,
774 S1Table.MENTAL_HEALTH_ACT_APPEAL: CPFTTable.MENTAL_HEALTH_ACT_APPEAL,
775 S1Table.MENTAL_HEALTH_ACT_AWOL: CPFTTable.MENTAL_HEALTH_ACT_AWOL,
776 S1Table.GP_PRACTICE_HISTORY: CPFTTable.GP_PRACTICE,
777 S1Table.RESPONSIBLE_PARTY: CPFTTable.CARE_COORDINATOR,
778}
779CORE_TO_CONTEXT_TABLE_TRANSLATIONS = {
780 # Key: destination context.
781 # Value: translation dictionary, mapping "core" tablename to target.
782 # Absent values lead to no translation.
783 SystmOneContext.TPP_SRE: {},
784 SystmOneContext.CPFT_DW: _S1_TO_CPFT_TABLE_TRANSLATION,
785} # type: TABLE_TRANSLATION_DICT_TYPE
786CONTEXT_TO_CORE_TABLE_TRANSLATIONS = {
787 SystmOneContext.TPP_SRE: {},
788 SystmOneContext.CPFT_DW: reversedict(_S1_TO_CPFT_TABLE_TRANSLATION),
789} # type: TABLE_TRANSLATION_DICT_TYPE
791# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
792# Tables that look like they have a proper PK, but don't, and we very much want
793# them to.
794# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
796TABLES_REQUIRING_CRATE_PK_REGEX = (
797 # Tables go here if we have to add a PK-style column/index -- usually
798 # because we want to apply a FULLTEXT index.
799 S1Table.FREETEXT,
800 # ... unterminated, so includes FreeText (S1) and FreeText_* (CPFT)
801)
804# -----------------------------------------------------------------------------
805# Column names
806# -----------------------------------------------------------------------------
807# We work internally with TPP SRE column names. Any renaming (e.g. in CPFT) is
808# explicitly noted.
811class S1GenericCol:
812 """
813 Columns used in many SystmOne tables.
814 """
816 CTV3_CODE = "CTV3Code" # Read code
817 CTV3_TEXT = "CTV3Text" # ... and corresponding description
818 EVENT_ID = "IDEvent" # FK to SREvent.RowIdentifier
819 EVENT_OCCURRED_WHEN = "DateEvent" # when event happened
820 EVENT_RECORDED_WHEN = "DateEventRecorded" # when event recorded
821 FREETEXT = "FreeText"
822 NOTES_SUFFIX = "Notes"
823 # ... "Notes", but also "ends with 'Notes'", e.g. AdmissionNotes,
824 # IncidentNotes, LocationNotes
825 ORG_ID_DONE_AT = "IDOrganisationDoneAt" # FK to SROrganisation.ID
826 ORG_ID_ENTERED_AT = "IDOrganisation" # org at which the data was entered
827 ORG_ID_VISIBLE_TO = "IDOrganisationVisibleTo" # FK to SROrganisation.ID
828 ORG_REGISTERED_AT = "IDOrganisationRegisteredAt"
829 # ... org where the patient was registered when the data was entered
830 PATIENT_ID = "IDPatient" # FK to SRPatient.RowIdentifier
831 ROW_ID = "RowIdentifier" # PK for nearly all SystmOne original tables
832 QUESTIONNAIRE_ID = "IDAnsweredQuestionnaire"
833 # ... FK to SRAnsweredQuestionnaire.RowIdentifier
834 REFERRAL_ID = "IDReferralIn" # FK to SRReferralIn.RowIdentifier
835 SNOMED_CODE = "SNOMEDCode" # SNOMED-CT code
836 SNOMED_TEXT = "SNOMEDText" # ... and corresponding description
837 STAFF_ID_DONE_BY = "IDDoneBy" # FK to SRStaffMember.RowIdentifier
838 STAFF_PROFILE_ID_RECORDED_BY = "IDProfileEnteredBy"
839 # ... FK to SRStaffMemberProfile.RowIdentifier
842class CPFTGenericCol:
843 """ "
844 CPFT variants for generic column names.
845 """
847 AGE_YEARS = "AgeInYears"
848 # ... usually "at the time of calculation, or death", i.e. unhelpful if you
849 # are unsure when the data was extracted; see stored procedure
850 # load_S1_Patient.
851 ASSIGNMENT_NUMBER = "AssignmentNumber"
852 # ... payroll number of member of staff, I think -- too sensitive, and I am
853 # surprised it is there in the first place.
854 HOME_CONTACT_NUMBER = "HomeContactNumber"
855 MOBILE_CONTACT_NUMBER = "MobileContactNumber"
856 NHSNUM2 = "NHSNumber2"
857 NHSNUM_MOTHER = "CYPHS_NHSNumber_Mother"
858 PATIENT_ADDRESS = "PatientAddress"
859 PATIENT_NAME = "PatientName"
860 POSTCODE = "PostCode"
861 PATIENT_ID_SYNONYM_1 = "Patient_ID"
864class CrateS1ViewCol:
865 """
866 Additional columns added by CRATE's preprocessor
867 """
869 IS_TEST_PATIENT = "is_test_patient"
872class S1PatientCol:
873 """
874 Columns in the Patient table.
875 """
877 PK = S1GenericCol.ROW_ID # RowIdentifier
878 NHSNUM = "NHSNumber"
879 TITLE = "Title"
880 FORENAME = "FirstName"
881 MIDDLE_NAMES = "MiddleNames"
882 SURNAME = "Surname"
883 PREV_SURNAME = "PreviousSurname"
884 EMAIL = "EmailAddress"
885 DOB = "DateBirth"
886 DOD = "DateDeath"
887 BIRTHPLACE = "BirthPlace"
888 GENDER = "Gender"
889 SPEAKS_ENGLISH = "SpeaksEnglish" # curious that this is a specific flag
890 TESTPATIENT = "TestPatient"
891 SOCIAL_SERVICES_REF = "SSRef"
892 SPINE_MATCHED = "SpineMatched"
895class CPFTPatientCol:
896 """
897 CPFT variants for the patient table.
898 """
900 MIDDLE_NAMES = "GivenName2"
901 DOB = "DOB"
904class S1AddressCol:
905 """
906 Columns in the PatientAddressHistory table.
907 """
909 ADDRESS_TYPE = "AddressType"
910 CCG_OF_RESIDENCE = "CcgOfResidence"
911 DATE_TO = "DateTo"
912 BUILDING_NAME = "NameOfBuilding"
913 BUILDING_NUMBER = "NumberOfBuilding"
914 ROAD = "NameOfRoad"
915 LOCALITY = "NameOfLocality"
916 TOWN = "NameOfTown"
917 COUNTY = "NameOfCounty"
918 POSTCODE = "FullPostCode"
921class CPFTAddressCol:
922 """
923 CPFT variants for the address table.
924 """
926 POSTCODE_NOSPACE = "PostCode_NoSpaces"
929class S1ContactCol:
930 """
931 Columns in the PatientContactDetails table.
932 """
934 NUMBER = "ContactNumber"
937class S1RelCol:
938 """
939 Columns in the PatientRelationship table.
940 (This is also one for which we specify everything in detail, since CPFT add
941 in extra identifiers.)
942 """
944 RELATED_ID_DEPRECATED = "IDRelationshipWithPatient"
945 # ... replaced by IDPatientRelationshipWith
946 RELATED_ID = "IDPatientRelationshipWith"
947 RELATED_STAFFCODE_OR_RELNHSNUM = "CodeRelationshipWithUser"
948 # ... SRE help says "The ODS code for the staff member the relationship is
949 # with". However, it seems that it sometimes contains the NHS number of the
950 # relative (certainly an NHS number that differs from the patient's!).
951 NAME = "RelationshipWithName"
952 DOB = "RelationshipWithDateOfBirth"
953 ADDRESS_HOUSE_NAME = "RelationshipWithHouseName"
954 ADDRESS_HOUSE_NUMBER = "RelationshipWithHouseNumber"
955 ADDRESS_ROAD = "RelationshipWithRoad"
956 ADDRESS_LOCALITY = "RelationshipWithLocality"
957 ADDRESS_POST_TOWN = "RelationshipWithPostTown"
958 ADDRESS_COUNTY = "RelationshipWithCounty"
959 ADDRESS_POSTCODE = "RelationshipWithPostCode"
960 ADDRESS_TELEPHONE = "RelationshipWithTelephone"
961 ADDRESS_WORK_TELEPHONE = "RelationshipWithWorkTelephone"
962 ADDRESS_MOBILE_TELEPHONE = "RelationshipWithMobileTelephone"
963 ADDRESS_FAX = "RelationshipWithFax"
964 ADDRESS_EMAIL = "RelationshipWithEmailAddress"
965 # Fields about the timing/nature of the relationship:
966 DATE_ENDED = "DateEnded"
967 REL_TYPE = "RelationshipType"
968 GUARDIAN_PROXY = "PersonalGuardianOrProxy"
969 NEXT_OF_KIN = "NextOfKin"
970 CARER = "CaresForPatient"
971 PRINCIPAL_CARER = "PrincipalCarerForPatient"
972 KEYHOLDER = "KeyHolder"
973 PARENTAL_RESPONSIBILITY = "HasParentalResponsibility"
974 FINANCIAL_REP = "FinancialRepresentative"
975 ADVOCATE = "Advocate"
976 MAIN_VISITOR = "MainVisitor"
977 CALLBACK_CONSENT = "CallCentreCallBackConsent"
978 COPY_CORRESPONDENCE = "CopyCorrespondence"
979 CONTACT_ORDER = "ContactOrder"
980 CONTACT_METHOD = "ContactMethod"
981 COMMS_FORMAT = "CommunicationFormat"
982 INTERPRETER_REQUIRED = "InterpreterRequired"
983 # and things that are about the relative but not directly identifying:
984 SEX = "RelationshipWithSex"
985 LANGUAGE = "RelationshipWithSpokenLanguage"
986 ORG = "RelationshipWithOrganisation"
989class CPFTOtherCol:
990 """
991 Other CPFT variants.
992 """
994 REL_MOTHER_COL_NHSNUM = S1PatientCol.NHSNUM
997class S1HospNumCol:
998 """
999 Columns in the HospitalAAndENumber table.
1000 """
1002 HOSPNUM = "HospitalNumber"
1003 COMMENTS = "Comments"
1006# -----------------------------------------------------------------------------
1007# Column collections
1008# -----------------------------------------------------------------------------
1010# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1011# Columns that have been renamed
1012# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1014_S1_TO_CPFT_COLUMN_TRANSLATION = {
1015 # Where CPFT has renamed a column.
1016 # - Key: (core_tablename, colname) tuple.
1017 # - Value: new CPFT column name.
1018 (S1Table.PATIENT, S1PatientCol.MIDDLE_NAMES): CPFTPatientCol.MIDDLE_NAMES,
1019 (S1Table.PATIENT, S1PatientCol.DOB): CPFTPatientCol.DOB,
1020 (
1021 S1Table.RELATIONSHIPS,
1022 S1RelCol.RELATED_STAFFCODE_OR_RELNHSNUM,
1023 ): S1PatientCol.NHSNUM,
1024}
1025CORE_TO_CONTEXT_COLUMN_TRANSLATIONS = {
1026 # Key: destination context.
1027 # Value: translation dictionary -- see e.g. S1_TO_CPFT_COLUMN_TRANSLATION.
1028 SystmOneContext.TPP_SRE: {},
1029 SystmOneContext.CPFT_DW: _S1_TO_CPFT_COLUMN_TRANSLATION,
1030} # type: COLUMN_TRANSLATION_DICT_TYPE
1031CONTEXT_TO_CORE_CONTEXT_COLUMN_TRANSLATIONS = {
1032 SystmOneContext.TPP_SRE: {},
1033 SystmOneContext.CPFT_DW: _flip_coldict(_S1_TO_CPFT_COLUMN_TRANSLATION),
1034} # type: COLUMN_TRANSLATION_DICT_TYPE
1036# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1037# PID column names
1038# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1040_PID_SYNONYMS_S1 = (S1GenericCol.PATIENT_ID,)
1041_PID_SYNONYMS_CPFT = (
1042 "PatientID", # e.g. in CPFT: S1_eDSM (a CPFT table)
1043 "PatID", # e.g. in CPFT: ASCRIBE_Statin
1044)
1045PID_SYNONYMS = {
1046 SystmOneContext.TPP_SRE: _PID_SYNONYMS_S1,
1047 SystmOneContext.CPFT_DW: _PID_SYNONYMS_S1 + _PID_SYNONYMS_CPFT,
1048}
1050# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1051# MPID column names
1052# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1054_MPID_SYNONYMS_S1 = (S1PatientCol.NHSNUM,)
1055_MPID_SYNONYMS_CPFT = (CPFTGenericCol.NHSNUM2,)
1056MPID_SYNONYMS = {
1057 SystmOneContext.TPP_SRE: _MPID_SYNONYMS_S1,
1058 SystmOneContext.CPFT_DW: _MPID_SYNONYMS_S1 + _MPID_SYNONYMS_CPFT,
1059}
1061# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1062# Other system identifiers
1063# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1065_OTHER_SYSTEM_IDS_CPFT = (
1066 "ClientID",
1067 # ... seen in CPFT -- although often these tables should be excluded and
1068 # this field contains RiO (not SystmOne) IDs. However, some look at least
1069 # partially valid (e.g. S1_Deaths.ClientID, S1_Mortality.ClientID,
1070 # S1_MortalityAdditionalInfo.ClientID). For the last of those, there is an
1071 # explicit "SourceSystem" column; sometimes this is SystmOne, sometimes
1072 # RiO, etc. Using ClientID as a primary patient ID will mean that only IDs
1073 # present in the SystmOne master table will be taken. However, there is
1074 # also potential for confusion, so we exclude these tables above.
1075 #
1076 # Empirically: S1_Deaths.ClientID contains IDs that look neither like
1077 # SystmOne IDs (integer) or RiO ones (integer) or CRS/CDL ("M" prefix) but
1078 # hav e.g. an "AP<integer>" format -- ah, it's PCMIS.
1079)
1080OTHER_SYSTEM_IDS = {
1081 SystmOneContext.TPP_SRE: (),
1082 SystmOneContext.CPFT_DW: _OTHER_SYSTEM_IDS_CPFT,
1083}
1085# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1086# Columns to treat as safe
1087# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1089COLS_GENERIC_OK_UNMODIFIED_S1 = (
1090 S1GenericCol.ORG_ID_DONE_AT,
1091 S1GenericCol.STAFF_ID_DONE_BY,
1092 S1GenericCol.EVENT_ID,
1093 S1GenericCol.EVENT_OCCURRED_WHEN,
1094 S1GenericCol.EVENT_RECORDED_WHEN,
1095 S1GenericCol.ORG_ID_ENTERED_AT,
1096 S1GenericCol.ORG_ID_VISIBLE_TO,
1097 S1GenericCol.ORG_REGISTERED_AT,
1098 S1GenericCol.STAFF_PROFILE_ID_RECORDED_BY,
1099)
1100_COLS_PATIENT_TABLE_OK_UNMODIFIED_S1 = COLS_GENERIC_OK_UNMODIFIED_S1 + (
1101 # This list exists because we don't assume that things in the Patient table
1102 # are safe -- we assume they are unsafe, and let them through only if we
1103 # know about them. So we add "safe" things (that are not direct
1104 # identifiers) here.
1105 S1PatientCol.GENDER,
1106 S1PatientCol.SPEAKS_ENGLISH,
1107 S1PatientCol.SPINE_MATCHED,
1108)
1109_COLS_PATIENT_TABLE_OK_UNMODIFIED_CPFT = (
1110 # Added by CPFT:
1111 "DeathIndicator", # binary version (0 alive, 1 dead)
1112 "NationalDataOptOut", # Added by CPFT (from NDOptOutPreference)?)
1113 # - CPFT also add "RwNo" (bigint), but it's always 1 here. See above.
1114 # - We ignore "AgeInYears" (added by CPFT) since that is dangerous and
1115 # depends on when you ask.
1116)
1117COLS_PATIENT_TABLE_OK_UNMODIFIED = {
1118 SystmOneContext.TPP_SRE: _COLS_PATIENT_TABLE_OK_UNMODIFIED_S1,
1119 SystmOneContext.CPFT_DW: _COLS_PATIENT_TABLE_OK_UNMODIFIED_S1
1120 + _COLS_PATIENT_TABLE_OK_UNMODIFIED_CPFT,
1121}
1123_COLS_RELATIONSHIP_OK_UNMODIFIED_S1 = (
1124 S1RelCol.DATE_ENDED,
1125 S1RelCol.REL_TYPE,
1126 S1RelCol.GUARDIAN_PROXY,
1127 S1RelCol.NEXT_OF_KIN,
1128 S1RelCol.CARER,
1129 S1RelCol.PRINCIPAL_CARER,
1130 S1RelCol.KEYHOLDER,
1131 S1RelCol.PARENTAL_RESPONSIBILITY,
1132 S1RelCol.FINANCIAL_REP,
1133 S1RelCol.ADVOCATE,
1134 S1RelCol.MAIN_VISITOR,
1135 S1RelCol.CALLBACK_CONSENT,
1136 S1RelCol.COPY_CORRESPONDENCE,
1137 S1RelCol.CONTACT_ORDER,
1138 S1RelCol.CONTACT_METHOD,
1139 S1RelCol.COMMS_FORMAT,
1140 S1RelCol.INTERPRETER_REQUIRED,
1141 S1RelCol.SEX,
1142 S1RelCol.LANGUAGE,
1143 S1RelCol.ORG,
1144)
1145_COLS_RELATIONSHIP_OK_UNMODIFIED_CPFT = (
1146 # Added by CPFT:
1147 "DateDeath", # date of death of relative
1148)
1149COLS_RELATIONSHIP_OK_UNMODIFIED = {
1150 SystmOneContext.TPP_SRE: _COLS_RELATIONSHIP_OK_UNMODIFIED_S1,
1151 SystmOneContext.CPFT_DW: _COLS_RELATIONSHIP_OK_UNMODIFIED_S1
1152 + _COLS_RELATIONSHIP_OK_UNMODIFIED_CPFT,
1153}
1155CPFT_REL_MOTHER_OK_UNMODIFIED = ()
1157# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1158# Columns to exclude from the output
1159# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1161_COLS_GENERIC_EXCLUDE_S1 = (
1162 # Columns to exclude, regardless of table.
1163 #
1164 # This is primarily for when CPFT put identifiers all over the place (e.g.
1165 # joining them from the patient table). We can simply exclude (rather than
1166 # using them for scrubbing) since they are duplicates (for convenience) of
1167 # information in other tables like Patient, PatientContactDetails,
1168 # PatientAddressHistory.
1169 S1PatientCol.DOB,
1170 S1PatientCol.EMAIL,
1171 S1PatientCol.FORENAME,
1172 S1PatientCol.MIDDLE_NAMES,
1173 S1PatientCol.PREV_SURNAME,
1174 S1PatientCol.SURNAME,
1175 S1PatientCol.TITLE,
1176 # ... unnecessary -- and might be rare, e.g. Lord High Admiral
1177 S1AddressCol.BUILDING_NAME,
1178 S1AddressCol.BUILDING_NUMBER,
1179 S1AddressCol.COUNTY,
1180 S1AddressCol.LOCALITY,
1181 S1AddressCol.POSTCODE,
1182 S1AddressCol.ROAD,
1183 S1AddressCol.TOWN,
1184 S1RelCol.ADDRESS_COUNTY,
1185 S1RelCol.ADDRESS_EMAIL,
1186 S1RelCol.ADDRESS_FAX,
1187 S1RelCol.ADDRESS_HOUSE_NAME,
1188 S1RelCol.ADDRESS_HOUSE_NUMBER,
1189 S1RelCol.ADDRESS_LOCALITY,
1190 S1RelCol.ADDRESS_MOBILE_TELEPHONE,
1191 S1RelCol.ADDRESS_POST_TOWN,
1192 S1RelCol.ADDRESS_POSTCODE,
1193 S1RelCol.ADDRESS_ROAD,
1194 S1RelCol.ADDRESS_TELEPHONE,
1195 S1RelCol.ADDRESS_WORK_TELEPHONE,
1196 S1RelCol.DOB,
1197 S1RelCol.NAME,
1198 S1HospNumCol.HOSPNUM, # just in case
1199)
1200_COLS_GENERIC_EXCLUDE_CPFT = (
1201 CPFTAddressCol.POSTCODE_NOSPACE,
1202 CPFTGenericCol.AGE_YEARS,
1203 # ... age when? Unhelpful. (And also, potentially leads to DOB discovery;
1204 # a blurred age near a birthday might be un-blurred by this.)
1205 CPFTGenericCol.ASSIGNMENT_NUMBER, # could hash it, but still, sensitive.
1206 CPFTGenericCol.HOME_CONTACT_NUMBER,
1207 CPFTGenericCol.MOBILE_CONTACT_NUMBER,
1208 CPFTGenericCol.NHSNUM2,
1209 CPFTGenericCol.NHSNUM_MOTHER,
1210 CPFTGenericCol.PATIENT_ADDRESS,
1211 CPFTGenericCol.PATIENT_NAME,
1212 CPFTGenericCol.POSTCODE,
1213 CPFTPatientCol.DOB,
1214 CPFTPatientCol.MIDDLE_NAMES,
1215)
1216COLS_GENERIC_EXCLUDE = {
1217 SystmOneContext.TPP_SRE: _COLS_GENERIC_EXCLUDE_S1,
1218 SystmOneContext.CPFT_DW: _COLS_GENERIC_EXCLUDE_S1
1219 + _COLS_GENERIC_EXCLUDE_CPFT,
1220}
1222OMIT_TABLENAME_COLNAME_PAIRS_S1 = (
1223 # Other specific fields to omit.
1224 (S1Table.CONTACTS, "ContactWith"),
1225 (S1Table.HOSP_AE_NUMBERS, S1HospNumCol.COMMENTS),
1226 (S1Table.OUT_OF_HOURS_ACTION, "Details"),
1227 # ... out-of-hours calls; details can sometimes contain phone numbers
1228 (S1Table.OUT_OF_HOURS_THIRD_PARTY_CALL, "Contact"), # free text
1229 (S1Table.SAFEGUARDING_INCIDENT_DETAILS, "PoliceReference"),
1230)
1232# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1233# Columns containing scrub-source information
1234# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1236COLS_PATIENT_WORDS = (
1237 # Scrub as words.
1238 S1PatientCol.FORENAME,
1239 S1PatientCol.MIDDLE_NAMES,
1240 S1PatientCol.SURNAME,
1241 S1PatientCol.PREV_SURNAME,
1242 S1PatientCol.EMAIL,
1243 S1PatientCol.BIRTHPLACE, # Unusual. But: scrub birthplace.
1244)
1246COLS_REQUIRED_SCRUBBERS = (
1247 # Information that must be present in the master patient table.
1248 S1PatientCol.PK, # likely redundant! It's the PID "definer".
1249 S1PatientCol.FORENAME,
1250 S1PatientCol.SURNAME,
1251 S1PatientCol.DOB,
1252)
1254COLS_ADDRESS_PHRASES = (
1255 # Scrub as phrases.
1256 # - For S1AddressCol.BUILDING_NAME, see below.
1257 # - For S1AddressCol.BUILDING_NUMBER, see below.
1258 S1AddressCol.ROAD,
1259 S1AddressCol.LOCALITY,
1260 S1AddressCol.TOWN,
1261 S1AddressCol.COUNTY,
1262)
1263COLS_ADDRESS_PHRASE_UNLESS_NUMBER = (
1264 S1AddressCol.BUILDING_NAME,
1265 S1AddressCol.BUILDING_NUMBER,
1266 # S1AddressCol.BUILDING_NUMBER poses a new problem for us: this is meant
1267 # to be e.g. "5", which is by itself non-identifying and would be a big
1268 # problem if we scrub (consider e.g. "5 mg"). However, sometimes it is "5
1269 # Tree Road", because it's not forced to be numeric. So we extend CRATE
1270 # (2021-12-01) to add ScrubMethod.PHRASE_UNLESS_NUMERIC.
1271 # (S1AddressCol.BUILDING_NAME occasionally contains numbers only, so we
1272 # do the same thing.)
1273)
1275COLS_RELATIONSHIP_XREF_ID = (
1276 # Scrub (third-party) as full cross-references to another record.
1277 S1RelCol.RELATED_ID_DEPRECATED,
1278 S1RelCol.RELATED_ID,
1279)
1281_COLS_RELATIONSHIP_WORDS_S1 = (
1282 # Scrub (third-party) as words.
1283 S1RelCol.NAME,
1284 S1RelCol.ADDRESS_EMAIL,
1285)
1286_COLS_RELATIONSHIP_WORDS_CPFT = (
1287 # Added by CPFT:
1288 "Surname", # surname of relative
1289 "FirstName", # surname of relative
1290)
1291COLS_RELATIONSHIP_WORDS = {
1292 SystmOneContext.TPP_SRE: _COLS_RELATIONSHIP_WORDS_S1,
1293 SystmOneContext.CPFT_DW: _COLS_RELATIONSHIP_WORDS_S1
1294 + _COLS_RELATIONSHIP_WORDS_CPFT,
1295}
1297_COLS_RELATIONSHIP_DATES_S1 = (
1298 # Scrub (third-party) as dates.
1299 S1RelCol.DOB,
1300)
1301_COLS_RELATIONSHIP_DATES_CPFT = (
1302 # Added by CPFT:
1303 "DOB", # likely duplicate of S1RelCol.DOB
1304)
1305COLS_RELATIONSHIP_DATES = {
1306 SystmOneContext.TPP_SRE: _COLS_RELATIONSHIP_DATES_S1,
1307 SystmOneContext.CPFT_DW: _COLS_RELATIONSHIP_DATES_S1
1308 + _COLS_RELATIONSHIP_WORDS_CPFT,
1309}
1311COLS_RELATIONSHIP_PHRASES = (
1312 # Scrub (third-party) as phrases.
1313 # Same principles as for the patient address, above.
1314 # - For S1RelCol.ADDRESS_HOUSE_NAME, see below.
1315 # - For S1RelCol.ADDRESS_HOUSE_NUMBER, see below.
1316 S1RelCol.ADDRESS_ROAD,
1317 S1RelCol.ADDRESS_LOCALITY,
1318 S1RelCol.ADDRESS_POST_TOWN,
1319 S1RelCol.ADDRESS_COUNTY,
1320)
1321COLS_RELATIONSHIP_PHRASE_UNLESS_NUMERIC = (
1322 S1RelCol.ADDRESS_HOUSE_NAME,
1323 S1RelCol.ADDRESS_HOUSE_NUMBER,
1324)
1326COLS_RELATIONSHIP_CODES = (
1327 # Scrub (third-party) as codes.
1328 S1RelCol.ADDRESS_POSTCODE,
1329)
1331COLS_RELATIONSHIP_NUMBERS = (
1332 # Scrub (third-party) as numbers.
1333 S1RelCol.RELATED_STAFFCODE_OR_RELNHSNUM,
1334 S1RelCol.ADDRESS_TELEPHONE,
1335 S1RelCol.ADDRESS_WORK_TELEPHONE,
1336 S1RelCol.ADDRESS_MOBILE_TELEPHONE,
1337 S1RelCol.ADDRESS_FAX,
1338)
1340_COLS_TRUNCATE_DATE_S1 = (S1PatientCol.DOB,)
1341_COLS_TRUNCATE_DATE_CPFT = (CPFTPatientCol.DOB,)
1342COLS_TRUNCATE_DATE = {
1343 SystmOneContext.TPP_SRE: _COLS_TRUNCATE_DATE_S1,
1344 SystmOneContext.CPFT_DW: _COLS_TRUNCATE_DATE_S1 + _COLS_TRUNCATE_DATE_CPFT,
1345}
1347# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1348# Columns containing free text, which need to be scrubbed
1349# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1350# ... assuming they are of string type.
1352_FREETEXT_TABLENAME_COLNAME_REGEX_PAIRS_S1 = (
1353 (
1354 terminate(S1Table.FREETEXT),
1355 terminate(S1GenericCol.FREETEXT),
1356 ), # the bulk of free text; VARCHAR(MAX)
1357 (ANYTHING, not_just_at_start(S1GenericCol.NOTES_SUFFIX)),
1358 ("PersonAtRisk$", "ReasonForPlan$"), # free text re safeguarding
1359 # ("ReferralIn$", "PrimaryReason$"), # only 200 chars; may be OK -- yes
1360 (
1361 "SafeguardingAllegationDetails$",
1362 "Outcome$",
1363 ), # only 100 chars -- but OMIT whole table, as above
1364 ("SpecialNotes$", "Note$"), # 8000 char free text
1365)
1366_FREETEXT_TABLENAME_COLNAME_REGEX_PAIRS_CPFT = (
1367 # CPFT:
1368 # - SRReferralIn renamed to S1_ReferralsIn with extra columns,
1369 # PrimaryReason, but there are several others, like
1370 # ReferralReasonDescription1.
1371 # - Actually, however, on review, e.g.
1372 # - S1_ReferralInReferralReason.ReferralReason is numeric
1373 # - S1_ReferralInReferralReason.ReferralReasonDescription is pick-list,
1374 # not free text
1375 # - Similarly for referrals out.
1376 # and "Other" is scrubbed by the generic scrubber, so that messes up
1377 # useful data in the descriptions. So not this:
1378 #
1379 # (".*Referral", ".*Reason"),
1380 # A bunch of explicitly free-text fields:
1381 # - any not-otherwise-handled textual field in a table named "FreeText_..."
1382 (
1383 S1Table.FREETEXT,
1384 ANYTHING,
1385 ), # table name not terminated so allows anything starting with this
1386 # - any field named "FreeText..." (e.g. S1_Honos_Scores.FreeText)
1387 (ANYTHING, S1GenericCol.FREETEXT),
1388 # - S1_CYPFRS_TelephoneTriage links in a bunch of things from S1_FreeText.
1389 (CPFTTable.CYP_FRS_TELEPHONE_TRIAGE, ".*Assessment"),
1390 (CPFTTable.CYP_FRS_TELEPHONE_TRIAGE, ".*Presentation"),
1391 (CPFTTable.CYP_FRS_TELEPHONE_TRIAGE, ".*Risk"),
1392 (CPFTTable.CYP_FRS_TELEPHONE_TRIAGE, ".*Strength"),
1393 (CPFTTable.CYP_FRS_TELEPHONE_TRIAGE, ".*Intervention"),
1394 (CPFTTable.CYP_FRS_TELEPHONE_TRIAGE, ".*Problem"),
1395 (CPFTTable.CYP_FRS_TELEPHONE_TRIAGE, ".*Social"),
1396 (CPFTTable.CYP_FRS_TELEPHONE_TRIAGE, ".*Advice"),
1397 (CPFTTable.CYP_FRS_TELEPHONE_TRIAGE, ".*Clinical"),
1398 (CPFTTable.CYP_FRS_TELEPHONE_TRIAGE, ".*Outcome"),
1399 (CPFTTable.CYP_FRS_TELEPHONE_TRIAGE, ".*NextOfKin"),
1400)
1401FREETEXT_TABLENAME_COLNAME_REGEX_PAIRS = {
1402 SystmOneContext.TPP_SRE: _FREETEXT_TABLENAME_COLNAME_REGEX_PAIRS_S1,
1403 SystmOneContext.CPFT_DW: _FREETEXT_TABLENAME_COLNAME_REGEX_PAIRS_S1
1404 + _FREETEXT_TABLENAME_COLNAME_REGEX_PAIRS_CPFT,
1405}
1407_EXEMPT_FROM_SCRUBBING_TABLENAME_COLNAME_REGEX_PAIRS_S1 = (
1408 # Things that look like free text, but aren't.
1409 ("AnsweredQuestionnaire$", "QuestionnaireName$"),
1410 (ANYTHING, S1GenericCol.CTV3_CODE), # common clinical coding
1411 (ANYTHING, S1GenericCol.CTV3_TEXT), # common clinical coding
1412)
1413_EXEMPT_FROM_SCRUBBING_TABLENAME_COLNAME_REGEX_PAIRS_CPFT = (
1414 # These contain non-patient data -- instead, the stock text of
1415 # questionnaires:
1416 ("FreeText_Honos_Scoring_Answers", ANYTHING),
1417 ("FreeText_Honos_Scoring_Questions", ANYTHING),
1418 ("FreeText_SWEMWBS", ANYTHING),
1419 ("FreeText_SWEMWBS_Scores", ANYTHING),
1420 ("FreeText_WEMWBS", ANYTHING),
1421 # Should not be identifying:
1422 (ANYTHING, "Ethnicity$"),
1423)
1424EXEMPT_FROM_SCRUBBING_TABLENAME_COLNAME_REGEX_PAIRS = {
1425 SystmOneContext.TPP_SRE: (
1426 _EXEMPT_FROM_SCRUBBING_TABLENAME_COLNAME_REGEX_PAIRS_S1
1427 ),
1428 SystmOneContext.CPFT_DW: (
1429 _EXEMPT_FROM_SCRUBBING_TABLENAME_COLNAME_REGEX_PAIRS_S1
1430 + _EXEMPT_FROM_SCRUBBING_TABLENAME_COLNAME_REGEX_PAIRS_CPFT
1431 ),
1432}
1434# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1435# Columns to index
1436# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1438FULLTEXT_INDEX_TABLENAME_COLNAME_REGEX_PAIRS_S1 = (
1439 # Subset of free-text columns where we would want to implement a FULLTEXT
1440 # index.
1441 ("FreeText$", "FreeText$"), # the bulk of free text; VARCHAR(MAX)
1442)
1444_NORMAL_INDEX_TABLENAME_COLNAME_REGEX_PAIRS_CPFT = (
1445 # S1_Patient.IDPatient: Added by CPFT. Duplicate of RowIdentifier.
1446 # But likely to be used by researchers, so should be indexed.
1447 (terminate(S1Table.PATIENT), terminate(S1GenericCol.PATIENT_ID)),
1448)
1449NORMAL_INDEX_TABLENAME_COLNAME_REGEX_PAIRS = {
1450 SystmOneContext.TPP_SRE: (),
1451 SystmOneContext.CPFT_DW: _NORMAL_INDEX_TABLENAME_COLNAME_REGEX_PAIRS_CPFT,
1452}
1454_GENERIC_COLS_TO_INDEX_S1 = (
1455 # Generically sensible things to index.
1456 S1GenericCol.CTV3_CODE, # common clinical coding
1457 S1GenericCol.CTV3_TEXT, # common clinical coding
1458 S1GenericCol.EVENT_ID, # a common FK
1459 S1GenericCol.EVENT_OCCURRED_WHEN, # when did it happen?
1460 S1GenericCol.QUESTIONNAIRE_ID, # a common FK
1461 S1GenericCol.REFERRAL_ID, # a common FK
1462 S1GenericCol.SNOMED_CODE, # common clinical coding
1463 S1GenericCol.SNOMED_TEXT, # common clinical coding
1464)
1465_GENERIC_COLS_TO_INDEX_CPFT = (CPFTGenericCol.PATIENT_ID_SYNONYM_1,)
1466GENERIC_COLS_TO_INDEX = {
1467 SystmOneContext.TPP_SRE: _GENERIC_COLS_TO_INDEX_S1,
1468 SystmOneContext.CPFT_DW: _GENERIC_COLS_TO_INDEX_S1
1469 + _GENERIC_COLS_TO_INDEX_CPFT,
1470}
1472# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1473# Columns that are (or are not) PKs
1474# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1476_PK_TABLENAME_COLNAME_REGEX_PAIRS_S1 = (
1477 # If CRATE inserts its PK field somewhere, it's a PK.
1478 (ANYTHING, terminate(CRATE_COL_PK)),
1479)
1480_PK_TABLENAME_COLNAME_REGEX_PAIRS_CPFT = (
1481 # Primary key fields with non-standard names.
1482 # Note that some are CPFT-created tables, which is why they don't follow
1483 # standard conventions.
1484 ("3DayFollowUp", "IDHospitalAdmissionAndDischarge"),
1485 ("ActivityEvent_EventDuration", S1GenericCol.EVENT_ID),
1486 # AQ_* -- no obvious PK.
1487 ("_CGAS_PairedScore_By_Referral", S1GenericCol.REFERRAL_ID),
1488 (
1489 "CarePlan",
1490 "CarePlanID",
1491 ), # includes CarePlanDetail, modified from the source
1492 # ... CarePlanReview has IDCarePlan but that is not unique.
1493 ("Caseload", "IDReferralIn"),
1494 # ClinicalDashboard*: a bunch of CPFT things; no obvious PK.
1495 # ClinicalMeasure*: a bunch of CPFT derived things; no obvious PK.
1496 # ClinicalOutcome*: a bunch of CPFT derived things; no obvious PK.
1497 # ContactsArchive_LatestStaffContact: no obvious PK.
1498 ("ContactsArchive_NonLegalContacts", S1GenericCol.EVENT_ID),
1499 ("ContactsArchive_TotalsByReferral", S1GenericCol.REFERRAL_ID),
1500 # Contacts_CYP_PlanMetric_7: no obvious PK.
1501 ("Contacts_CarerStatus_MH", S1GenericCol.PATIENT_ID),
1502 ("Contacts_CarerStatus_PH", S1GenericCol.PATIENT_ID),
1503 # CoronaVirus: no obvious PK (maybe combination of IDPatient, IDReferralIn)
1504 # CurrentInpatientDashboard_Doctors: *almost* IDPatient, IDReferralIn but
1505 # one extra row
1506 # Deaths: no obvious PK, despite "ClientID" (not unique)
1507 (CPFTTable.DEMOGRAPHICS, S1GenericCol.PATIENT_ID),
1508 # DischargeDelay_Fact: no obvious PK.
1509 # EuroQol*: no PKs
1510 # FACT_Inp_Data: no obvious PK.
1511 # Falls_AtRiskState*: no PKs
1512 # FreeText_* [CPFT derived free text tables]: none in the first, not all
1513 # explored
1514 # GateKeeping: no obvious PK.
1515 # Honos_Scores: no PK
1516 # ICW_PTL: no PK
1517 # Immunisation: no obvious PK
1518 # InpatientLeave: no obvious PK
1519 # Inpatient_NorthwickParkIndex: no obvious PK
1520 # LADSAdults_Output: no obvious PK
1521 ("LADSAdultsQuestionnaires", S1GenericCol.QUESTIONNAIRE_ID),
1522 # LADSCYPHS_Output: no obvious PK
1523 ("LADSCYPQuestionnaires", S1GenericCol.QUESTIONNAIRE_ID),
1524 ("MDT_Caseload", S1GenericCol.REFERRAL_ID),
1525 # OutOfHoursSRCodeInformation: no PK
1526 # PRISM_ReReferral: no PK
1527 # PatientAnsweredQuestionnaireInformation: IDPatient currently unique but I
1528 # strongly suspect only temporarily
1529 # PatientContact: no PK
1530 # PatientEthnicity: no PK
1531 ("PatientGPPractice", S1GenericCol.PATIENT_ID),
1532 # PatientLanguageDeathOptions: no PK
1533 # PatientLetterInformation: IDPatient currently unique but I strongly
1534 # suspect only temporarily
1535 ("PatientOverview", S1GenericCol.REFERRAL_ID),
1536 # PatientRelationship: no PK (modified from the source)
1537 ("PatientRelationshipMother", S1GenericCol.PATIENT_ID),
1538 # PatientSRCodeInformation: no PK
1539 # PhysicalHealthChecks*: no PK (IDPatient currently unique in
1540 # S1_PhysicalHealthChecks_CQUIN but unlikely to remain so)
1541 # QRisk: no PK
1542 (
1543 "ReferralInIntervention",
1544 S1GenericCol.REFERRAL_ID,
1545 ), # modified from the source
1546 ("Vanguard", "ReferralNumber"),
1547 # eDSM: no PK
1548)
1549PK_TABLENAME_COLNAME_REGEX_PAIRS = {
1550 SystmOneContext.TPP_SRE: _PK_TABLENAME_COLNAME_REGEX_PAIRS_S1,
1551 SystmOneContext.CPFT_DW: _PK_TABLENAME_COLNAME_REGEX_PAIRS_S1
1552 + _PK_TABLENAME_COLNAME_REGEX_PAIRS_CPFT,
1553}
1555_NOT_PK_TABLENAME_COLNAME_REGEX_PAIRS_S1 = tuple(
1556 # Tables in which "RowIdentifier" (S1GenericCol.ROW_ID) is not unique.
1557 (terminate(t), S1GenericCol.ROW_ID)
1558 for t in (
1559 S1Table.ACTIVITY_EVENT,
1560 S1Table.CARE_PLAN_REVIEW,
1561 S1Table.DISCHARGE_DELAY,
1562 S1Table.FREETEXT, # see also TABLES_REQUIRING_CRATE_PK_REGEX above
1563 S1Table.BED_CLOSURE,
1564 S1Table.MENTAL_HEALTH_ACT_APPEAL,
1565 S1Table.MENTAL_HEALTH_ACT_AWOL,
1566 S1Table.TASK,
1567 )
1568) + tuple(
1569 # Also, if we insert a CRATE PK, then the "RowIdentifier" can't be the PK.
1570 (t_regex, S1GenericCol.ROW_ID)
1571 for t_regex in TABLES_REQUIRING_CRATE_PK_REGEX
1572)
1573_NOT_PK_TABLENAME_COLNAME_REGEX_PAIRS_CPFT = tuple(
1574 # Similarly. These tables have things that look like PKs, but gave rise to
1575 # a "Violation of PRIMARY KEY constraint" error, so they aren't. This
1576 # happens when someone in CPFT maps e.g. "RowIdentifier" in an unusual way.
1577 (t, S1GenericCol.ROW_ID)
1578 for t in (
1579 terminate("Child_At_Risk"),
1580 "InpatientBedStay", # includes InpatientBedStay_Old, etc.
1581 terminate(CPFTTable.CONTACTS_ARCHIVE),
1582 terminate(CPFTTable.CONTACTS_ARCHIVE_CLIENT_SEQUENCE),
1583 )
1584)
1585NOT_PK_TABLENAME_COLNAME_REGEX_PAIRS = {
1586 SystmOneContext.TPP_SRE: _NOT_PK_TABLENAME_COLNAME_REGEX_PAIRS_S1,
1587 SystmOneContext.CPFT_DW: _NOT_PK_TABLENAME_COLNAME_REGEX_PAIRS_S1
1588 + _NOT_PK_TABLENAME_COLNAME_REGEX_PAIRS_CPFT,
1589}
1592# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1593# Columns containing opt-out information
1594# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1596_OPT_OUT_TABLENAME_COLNAME_PAIRS_S1 = (
1597 # Test patients -- not an "opt out" but an "invalid patient" marker:
1598 (CrateView.TESTPATIENT_VIEW, CrateS1ViewCol.IS_TEST_PATIENT),
1599 # Note that the main patient table is handled separately/explicitly.
1600)
1601_OPT_OUT_TABLENAME_COLNAME_PAIRS_CPFT = (
1602 # CPFT Research Database local opt-out:
1603 ("ClinicalOutcome_ConsentResearch_OptOutCheck", "SNOMEDCode"),
1604)
1605OPT_OUT_TABLENAME_COLNAME_PAIRS = {
1606 SystmOneContext.TPP_SRE: _OPT_OUT_TABLENAME_COLNAME_PAIRS_S1,
1607 SystmOneContext.CPFT_DW: _OPT_OUT_TABLENAME_COLNAME_PAIRS_S1
1608 + _OPT_OUT_TABLENAME_COLNAME_PAIRS_CPFT,
1609}
1612# =============================================================================
1613# String comparison helper functions
1614# =============================================================================
1616# -----------------------------------------------------------------------------
1617# Plain strings
1618# -----------------------------------------------------------------------------
1621def eq(x: str, y: str) -> bool:
1622 """
1623 Case-insensitive string comparison.
1624 """
1625 return x.lower() == y.lower()
1628def tcmatch(table1: str, column1: str, table2: str, column2: str) -> bool:
1629 """
1630 Equal (in case-insensitive fashion) for table and column?
1631 """
1632 return eq(table1, table2) and eq(column1, column2)
1635def is_in(x: str, y: Iterable[str]) -> bool:
1636 """
1637 Case-insensitive version of "in", to replace "if x in y".
1638 """
1639 return any(eq(x, test) for test in y)
1642def is_pair_in(a: str, b: str, y: Iterable[Tuple[str, str]]) -> bool:
1643 """
1644 Case-insensitive version of "in", to replace "if a, b in y".
1645 """
1646 return any(eq(a, test_a) and eq(b, test_b) for test_a, test_b in y)
1649# -----------------------------------------------------------------------------
1650# Regular expressions
1651# -----------------------------------------------------------------------------
1654def eq_re(x: str, y_regex: str) -> bool:
1655 """
1656 Returns True if the regex matches at the start of the string.
1657 """
1658 return bool(re.match(y_regex, x, flags=re.IGNORECASE))
1661def is_in_re(x: str, y_regexes: Iterable[str]) -> bool:
1662 """
1663 Case-insensitive regex-based version of "in", to replace "if x in y".
1664 """
1665 return any(eq_re(x, test) for test in y_regexes)
1668def is_pair_in_re(
1669 a: str, b: str, y_regexes: Iterable[Tuple[str, str]]
1670) -> bool:
1671 """
1672 Case-insensitive regex-based version of "in", to replace "if a, b in y".
1673 """
1674 return any(
1675 eq_re(a, test_a) and eq_re(b, test_b) for test_a, test_b in y_regexes
1676 )
1679# =============================================================================
1680# Table/column name interpretation
1681# =============================================================================
1684def tablename_prefix(context: SystmOneContext) -> str:
1685 """
1686 The tablename prefix in the given context.
1687 """
1688 try:
1689 return TABLE_PREFIXES[context]
1690 except KeyError:
1691 raise KeyError(f"Unknown SystmOne context: {context}")
1694def core_tablename(
1695 tablename: str,
1696 from_context: SystmOneContext,
1697 allow_unprefixed: bool = False,
1698) -> str:
1699 """
1700 Is this a table of an expected format that we will consider?
1701 - If so, returns the "core" part of the tablename, in the given context.
1702 - Otherwise, if ``allow_unprefixed`` return the input.
1703 - Otherwise, return an empty string.
1704 """
1705 prefix = tablename_prefix(from_context)
1706 if not tablename.startswith(prefix):
1707 if is_in_re(tablename, INCLUDE_TABLES_REGEX[from_context]):
1708 return tablename
1709 else:
1710 warn_once(f"Unrecognized table name style: {tablename}", log)
1711 if allow_unprefixed:
1712 return tablename
1713 else:
1714 return ""
1715 rest = tablename[len(prefix) :]
1716 if not rest:
1717 raise ValueError(f"Table name {tablename!r} only contains its prefix")
1718 xlate = CONTEXT_TO_CORE_TABLE_TRANSLATIONS[from_context]
1719 return xlate.get(rest) or rest
1722def contextual_tablename(
1723 tablename_core: str, to_context: SystmOneContext
1724) -> str:
1725 """
1726 Prefixes the "core" table name for a given context, and sometimes
1727 translates it too.
1728 """
1729 prefix = tablename_prefix(to_context)
1730 xlate = CORE_TO_CONTEXT_TABLE_TRANSLATIONS[to_context]
1731 translated = xlate.get(tablename_core)
1732 tablename = translated if translated else tablename_core
1733 return f"{prefix}{tablename}"
1736def translate_tablename(
1737 from_tablename: str,
1738 from_context: SystmOneContext,
1739 to_context: SystmOneContext,
1740):
1741 """
1742 Translates a table name from one S1 context to another.
1743 """
1744 coretab = core_tablename(from_tablename, from_context=from_context)
1745 return contextual_tablename(coretab, to_context=to_context)
1748def cpft_s1_tablename(core_tablename: str) -> str:
1749 """
1750 Helper function for the consent-for-contact system, but conceptually it
1751 sits reasonably well here.
1753 Args:
1754 core_tablename:
1755 Table name in S1 "core" format (devoid of any prefix).
1757 Returns:
1758 Returns the local CPFT table name.
1759 """
1760 return contextual_tablename(
1761 core_tablename,
1762 to_context=SystmOneContext.CPFT_DW,
1763 )
1766def core_columnname(
1767 tablename_core: str, columnname_context: str, from_context: SystmOneContext
1768) -> str:
1769 """
1770 Some contexts rename their column names. This function puts them back into
1771 the "core" (TPP SRE) name space.
1772 """
1773 xlate = CONTEXT_TO_CORE_CONTEXT_COLUMN_TRANSLATIONS[from_context]
1774 return (
1775 xlate.get((tablename_core, columnname_context)) or columnname_context
1776 )
1779def contextual_columnname(
1780 tablename_core: str, columnname_core: str, to_context: SystmOneContext
1781) -> str:
1782 """
1783 Translates a "core" column name to its contextual variant, if applicable.
1784 """
1785 xlate = CORE_TO_CONTEXT_COLUMN_TRANSLATIONS[to_context]
1786 return xlate.get((tablename_core, columnname_core)) or columnname_core
1789# =============================================================================
1790# Comments
1791# =============================================================================
1794def join_comments(comments: List[str]) -> str:
1795 """
1796 Joins comment elements, skipping any blanks.
1797 """
1798 return COMMENT_SEP.join(filter(None, comments))
1801# =============================================================================
1802# Helper classes
1803# =============================================================================
1806class SystmOneSRESpecRow:
1807 """
1808 Represents a row in the SystmOne SRE specification CSV file.
1809 """
1811 def __init__(self, d: Dict[str, Any]) -> None:
1812 """
1813 Initialize with a row dictionary from a :class:`csv.DictReader`.
1814 """
1815 self.table_name = d["TableName"] # type: str
1816 self.table_description = d["TableDescription"] # type: str
1817 self.column_name = d["ColumnName"] # type: str
1818 self.column_description = d["ColumnDescription"] # type: str
1819 self.column_data_type = d["ColumnDataType"] # type: str
1820 _col_length = d["ColumnLength"]
1821 self.column_length = int(_col_length) if _col_length else None
1822 _date_def = d["DateDefining"] # type: str
1823 if _date_def == "Yes":
1824 self.date_defining = True
1825 elif _date_def == "No":
1826 self.date_defining = False
1827 else:
1828 raise ValueError(f"Bad DateDefining field: {_date_def!r}")
1829 self.column_ordinal = int(d["ColumnOrdinal"])
1830 self.linked_table = d["LinkedTable"] # type: str
1831 self.linked_column_1 = d["LinkedColumn1"] # type: str
1832 self.linked_column_2 = d["LinkedColumn2"] # type: str
1834 @property
1835 def tablename_core(self) -> str:
1836 """
1837 Core part of the tablename.
1838 """
1839 return core_tablename(
1840 self.table_name, from_context=SystmOneContext.TPP_SRE
1841 )
1843 @property
1844 def linked_table_core(self) -> str:
1845 """
1846 Core part of the linked table name.
1847 """
1848 return core_tablename(
1849 self.linked_table, from_context=SystmOneContext.TPP_SRE
1850 )
1852 def comment(
1853 self, context: SystmOneContext, with_table: bool = True
1854 ) -> str:
1855 """
1856 Used to generate a comment for the CRATE data dictionary.
1858 Args:
1859 context:
1860 The SystmOneContext in which data is being processed.
1861 with_table:
1862 Include information about the table.
1863 """
1864 if with_table:
1865 elements = [
1866 f"TABLE: {self.table_description}",
1867 f"COLUMN: {self.column_description}",
1868 ]
1869 else:
1870 elements = [f"{self.column_description}"]
1871 if self.linked_table:
1872 context_prefix = tablename_prefix(context)
1873 linked_table_ctx = f"{context_prefix}{self.linked_table_core}"
1874 links = [] # type: List[str]
1875 if self.linked_column_1:
1876 links.append(
1877 f"FOREIGN KEY TO "
1878 f"{linked_table_ctx}.{self.linked_column_1}"
1879 )
1880 if self.linked_column_2:
1881 links.append(f"WITH {linked_table_ctx}.{self.linked_column_2}")
1882 elements.append(" ".join(links))
1883 return join_comments(elements)
1885 def description(
1886 self, context: SystmOneContext, with_table: bool = True
1887 ) -> str:
1888 """
1889 Full description line.
1891 Args:
1892 context:
1893 The SystmOneContext in which data is being processed.
1894 with_table:
1895 Include information about the table.
1896 """
1897 tname = contextual_tablename(self.tablename_core, context)
1898 cname = contextual_columnname(
1899 self.tablename_core, self.column_name, context
1900 )
1901 elements = [
1902 f"{tname}.{cname}",
1903 self.comment(context, with_table=with_table),
1904 ]
1905 return join_comments(elements)
1908class SystmOneSRESpecs:
1909 """
1910 Loads and represents the SystmOne SRE specifications.
1911 """
1913 def __init__(self, context: SystmOneContext, filename: str) -> None:
1914 """
1915 Initialize by reading a SystmOne SRE specification CSV file.
1917 context:
1918 The context from which SystmOne data is being extracted (e.g. the
1919 raw TPP Strategic Reporting Extract (SRE), or a local version
1920 processed into CPFT's Data Warehouse).
1921 filename:
1922 Optional filename for the TPP SRE specification file, in
1923 comma-separated value (CSV) format.
1924 """
1925 self.context = context
1926 self.filename = filename
1927 # All specifications:
1928 self.specs = [] # type: List[SystmOneSRESpecRow]
1929 # Map from (tablename_core, columnname) to SystmOneSRESpecRow:
1930 self.tabcolmap = {} # type: Dict[Tuple[str, str], SystmOneSRESpecRow]
1931 # Map from tablename_core to table comment:
1932 self.tabcomments = {} # type: Dict[str, str]
1933 if filename:
1934 log.info(f"Reading SystmOne SRE specs from: {filename}")
1935 with open(filename, "r") as f:
1936 reader = csv.DictReader(f)
1937 for rowdict in reader:
1938 s = SystmOneSRESpecRow(rowdict)
1939 self.specs.append(s)
1940 self.tabcolmap[s.tablename_core, s.column_name] = s
1941 self.tabcomments.setdefault(
1942 s.tablename_core, s.table_description
1943 )
1945 def debug_specs(self) -> None:
1946 """
1947 Print the specs to the debugging log.
1948 """
1949 specs_str = "\n".join(
1950 spec.description(self.context) for spec in self.specs
1951 )
1952 log.debug(f"SystmOne specs:\n{specs_str}")
1954 def get_spec_row(
1955 self, tablename_core: str, columnname: str
1956 ) -> SystmOneSRESpecRow:
1957 """
1958 Look up a row specification.
1959 """
1960 return self.tabcolmap.get((tablename_core, columnname))
1962 def table_comment(self, tablename_core: str) -> str:
1963 """
1964 Returns the table description/comment for a given table, if known, or
1965 a blank string.
1966 """
1967 return self.tabcomments.get(tablename_core) or ""
1970@dataclass
1971class ScrubSrcAlterMethodInfo:
1972 """
1973 For describing scrub-source and alter-method information.
1974 """
1976 change_comment_and_indexing_only: bool = False
1977 src_flags: str = ""
1978 scrub_src: Optional[ScrubSrc] = None
1979 scrub_method: Optional[ScrubMethod] = None
1980 decision: Decision = Decision.OMIT
1981 alter_methods: List[AlterMethod] = field(default_factory=list)
1982 dest_datatype: str = None
1983 dest_field: str = None
1985 def __post_init__(self) -> None:
1986 """
1987 Validation.
1988 """
1989 for char in self.src_flags:
1990 assert any(char == enum.value for enum in SrcFlag)
1992 def add_src_flag(self, flag: SrcFlag) -> None:
1993 """
1994 Add a flag, if it doesn't exist already.
1995 """
1996 flagchar = flag.value
1997 if flagchar not in self.src_flags:
1998 self.src_flags += flagchar
2000 def add_alter_method(self, alter_method: AlterMethod) -> None:
2001 """
2002 Adds an alteration method.
2003 """
2004 self.alter_methods.append(alter_method)
2006 def include(self) -> None:
2007 """
2008 Sets the decision to "include".
2009 """
2010 self.decision = Decision.INCLUDE
2012 def omit(self) -> None:
2013 """
2014 Sets the decision to "omit".
2015 """
2016 self.decision = Decision.OMIT
2019# =============================================================================
2020# Feature detection
2021# =============================================================================
2024def is_master_patient_table(tablename: str) -> bool:
2025 """
2026 Is this the master patient table?
2027 """
2028 return eq(tablename, S1Table.PATIENT)
2031def is_pid(colname: str, context: SystmOneContext) -> bool:
2032 """
2033 Is this column the SystmOne primary patient identifier (PID)?
2035 It's nearly always S1GenericCol.PID. But occasionally something else
2036 (e.g. in CPFT-created tables).
2038 This works for all tables EXCEPT the main "Patient" table, where the PK
2039 takes its place.
2041 Occasionally, CPFT tables blend SystmOne patients with other patients using
2042 IDs from other EHR systems. However, those patients won't be in our master
2043 patient index, so their data won't be brought through.
2044 """
2045 return is_in(colname, PID_SYNONYMS[context])
2048def is_mpid(colname: str, context: SystmOneContext) -> bool:
2049 """
2050 Is this column the master patient identifier (MPID), i.e. the NHS number?
2051 """
2052 return is_in(colname, MPID_SYNONYMS[context])
2055def is_other_system_id(colname: str, context: SystmOneContext) -> bool:
2056 """
2057 Is this column an ID from another system (e.g. RiO, PCMIS)?
2058 """
2059 return is_in(colname, OTHER_SYSTEM_IDS[context])
2062def is_pk(
2063 tablename: str,
2064 colname: str,
2065 context: SystmOneContext,
2066 ddr: DataDictionaryRow = None,
2067) -> bool:
2068 """
2069 Is this a primary key (PK) column within its table?
2070 """
2071 # This check is debatable. It's possible that the source database has
2072 # columns that are NULLable but are in fact never null and are PKs. Indeed,
2073 # that is the case; e.g. S1_FreeText.RowIdentifier is shown as "bigint,
2074 # null" in SQL Server Manager, but "SELECT COUNT(*) FROM S1_FreeText WHERE
2075 # RowIdentifier IS NULL" gives 0 rows, out of ~1M rows in total.
2076 #
2077 # if ddr.src_reflected_nullable:
2078 # return False # can't be a PK if it can be NULL
2080 # 1. If it's explicitly ruled out as a PK (e.g. it has the name that should
2081 # mean it's a PK, but it's been messed with locally, or the TPP design
2082 # team were having an off day), then it's not a PK.
2083 if is_pair_in_re(
2084 tablename, colname, NOT_PK_TABLENAME_COLNAME_REGEX_PAIRS[context]
2085 ):
2086 return False
2087 # 2. If the source database says so (ours never does).
2088 if ddr and ddr.pk:
2089 return True
2090 # 3. If it has the standard column name, i.e. RowIdentifier, then it's
2091 # a PK.
2092 if eq(colname, S1GenericCol.ROW_ID):
2093 return True
2094 # 4. If it's a specifically noted PK.
2095 return is_pair_in_re(
2096 tablename, colname, PK_TABLENAME_COLNAME_REGEX_PAIRS[context]
2097 )
2100def is_free_text(
2101 tablename: str,
2102 colname: str,
2103 context: SystmOneContext,
2104 ddr: DataDictionaryRow = None,
2105) -> bool:
2106 """
2107 Is this a free-text field requiring scrubbing?
2109 Unusually, there is not very much free text, and it is mostly collated.
2110 (We haven't added binary support yet. Do we have the binary documents?)
2111 """
2112 if ddr and not ddr.src_is_textual:
2113 return False
2114 return is_pair_in_re(
2115 tablename, colname, FREETEXT_TABLENAME_COLNAME_REGEX_PAIRS[context]
2116 ) and not is_pair_in_re(
2117 tablename,
2118 colname,
2119 EXEMPT_FROM_SCRUBBING_TABLENAME_COLNAME_REGEX_PAIRS[context],
2120 )
2123def should_be_fulltext_indexed(tablename: str, colname: str) -> bool:
2124 """
2125 Is this a field that should get a FULLTEXT index? That's not just "a column
2126 that contains free text and should be scrubbed", that is "a column with a
2127 lot of interesting free text that should get a special index".
2128 """
2129 return is_pair_in_re(
2130 tablename, colname, FULLTEXT_INDEX_TABLENAME_COLNAME_REGEX_PAIRS_S1
2131 )
2134# =============================================================================
2135# Deciding about columns
2136# =============================================================================
2139def process_generic_table_column(
2140 tablename: str,
2141 colname: str,
2142 ddr: DataDictionaryRow,
2143 ssi: ScrubSrcAlterMethodInfo,
2144 context: SystmOneContext,
2145) -> bool:
2146 """
2147 Performs operations applicable to columns any SystmOne table, except a few
2148 very special ones like Patient. Modifies ``ssi`` in place.
2150 Returns: recognized and dealt with?
2151 """
2152 # -------------------------------------------------------------------------
2153 # Generic table
2154 # -------------------------------------------------------------------------
2156 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2157 # PKs, PIDs, MPIDs (which can overlap, e.g. a PK that is a PID)
2158 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2159 handled = False
2160 if is_pk(tablename, colname, context, ddr):
2161 # PK for all tables.
2162 ssi.add_src_flag(SrcFlag.PK)
2163 ssi.add_src_flag(SrcFlag.NOT_NULL)
2164 ssi.add_src_flag(SrcFlag.ADD_SRC_HASH)
2165 ssi.include()
2166 handled = True
2168 # PKs can also be other things:
2170 if is_pair_in(
2171 tablename, colname, OPT_OUT_TABLENAME_COLNAME_PAIRS[context]
2172 ):
2173 ssi.add_src_flag(SrcFlag.OPT_OUT)
2174 handled = True
2176 if is_pid(colname, context):
2177 # FK to Patient.RowIdentifier for all other patient-related tables.
2178 ssi.add_src_flag(SrcFlag.PRIMARY_PID)
2179 ssi.dest_field = ddr.config.research_id_fieldname
2180 ssi.include()
2181 handled = True
2183 elif is_mpid(colname, context):
2184 # An NHS number in a random table. OK, as long as we hash it.
2185 ssi.add_src_flag(SrcFlag.MASTER_PID)
2186 ssi.dest_field = ddr.config.master_research_id_fieldname
2187 ssi.include()
2188 handled = True
2190 elif is_other_system_id(colname, context):
2191 # Something like a RiO or PCMIS ID. Use it to scrub, but it's not a
2192 # PID or MPID.
2193 ssi.scrub_src = ScrubSrc.PATIENT
2194 ssi.scrub_method = ScrubMethod.CODE
2195 ssi.omit()
2196 handled = True
2198 if handled:
2199 return True
2201 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2202 # Other things
2203 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2204 if is_in(colname, COLS_GENERIC_EXCLUDE[context]):
2205 # Columns that are never OK in a generic table, such as duplicated
2206 # direct identifiers (handled specially in the master patient ID table
2207 # etc.).
2208 ssi.omit()
2209 # ... likely redundant but that's not obvious within this function
2211 elif is_in(colname, COLS_GENERIC_OK_UNMODIFIED_S1):
2212 # Generic columns that are always OK (e.g. organization ID).
2213 ssi.include()
2215 elif is_free_text(tablename, colname, context, ddr):
2216 # Free text to be scrubbed.
2217 ssi.add_alter_method(AlterMethod(config=ddr.config, scrub=True))
2218 ssi.include()
2220 elif is_in(colname, COLS_TRUNCATE_DATE[context]):
2221 # Truncate date?
2222 # We don't do the scrub_src and scrub_method here; we already know the
2223 # patient's DOB from the master patient table. This code is about
2224 # making sure that DOBs (for example) elsewhere are truncated, and
2225 # time information doesn't leak through.
2226 ssi.add_alter_method(
2227 AlterMethod(config=ddr.config, truncate_date=True)
2228 )
2229 ssi.dest_datatype = SQLTYPE_DATE
2230 ssi.include()
2232 else:
2233 # Unrecognized.
2234 return False
2236 return True
2239def get_scrub_alter_details(
2240 tablename: str,
2241 colname: str,
2242 ddr: DataDictionaryRow,
2243 context: SystmOneContext,
2244 include_generic: bool = False,
2245) -> ScrubSrcAlterMethodInfo:
2246 """
2247 The main "thinking" function.
2249 Is this a sensitive field that should be used for scrubbing?
2250 Should it be modified in transit?
2252 Args:
2253 tablename:
2254 The "core" tablename being considered, without any prefix (e.g.
2255 "Patient", not "SRPatient" or "S1_Patient").
2256 colname:
2257 The database column name.
2258 ddr:
2259 Data dictionary row.
2260 context:
2261 The context from which SystmOne data is being extracted (e.g. the
2262 raw TPP Strategic Reporting Extract (SRE), or a local version
2263 processed into CPFT's Data Warehouse).
2264 include_generic:
2265 Include all fields that are not known about by this code and
2266 treated specially? If False, the config file settings are used
2267 (which may omit or include). If True, all such fields are included.
2268 """
2269 ssi = ScrubSrcAlterMethodInfo(decision=Decision.OMIT) # omit by default
2271 # -------------------------------------------------------------------------
2272 # Omit table entirely (and ignore its contents for scrubbing)?
2273 # -------------------------------------------------------------------------
2274 if is_in(tablename, OMIT_AND_IGNORE_TABLES[context]) or is_in_re(
2275 tablename, OMIT_AND_IGNORE_TABLES_REGEX[context]
2276 ):
2277 return ssi
2279 # -------------------------------------------------------------------------
2280 # Deal with the core patient table. Many details here.
2281 # -------------------------------------------------------------------------
2282 if eq(tablename, S1Table.PATIENT):
2283 if eq(colname, S1GenericCol.ROW_ID):
2284 # RowIdentifier: SystmOne patient ID in the master patient table.
2285 # Hash and scrub SystmOne IDs.
2286 ssi.add_src_flag(SrcFlag.PK)
2287 ssi.add_src_flag(SrcFlag.NOT_NULL)
2288 ssi.add_src_flag(SrcFlag.PRIMARY_PID) # automatically hashed
2289 ssi.add_src_flag(SrcFlag.DEFINES_PRIMARY_PIDS)
2290 ssi.scrub_src = ScrubSrc.PATIENT
2291 ssi.scrub_method = ScrubMethod.NUMERIC
2292 ssi.include()
2294 elif eq(colname, S1GenericCol.PATIENT_ID):
2295 # IDPatient: Added by CPFT to the master patient table.
2296 # Needs to be hashed. Is a duplicate of RowIdentifier.
2297 ssi.add_src_flag(SrcFlag.PRIMARY_PID) # automatically hashed
2298 ssi.scrub_src = ScrubSrc.PATIENT
2299 ssi.scrub_method = ScrubMethod.NUMERIC
2300 ssi.include()
2302 elif is_mpid(colname, context): # NHS number
2303 # Hash and scrub NHS numbers. (Present as a text field, but is
2304 # numeric.)
2305 ssi.add_src_flag(SrcFlag.MASTER_PID) # automatically hashed
2306 ssi.scrub_src = ScrubSrc.PATIENT
2307 ssi.scrub_method = ScrubMethod.NUMERIC
2308 ssi.dest_field = ddr.config.master_research_id_fieldname
2309 ssi.include()
2311 elif is_in(colname, COLS_PATIENT_WORDS):
2312 # Scrub and omit all names.
2313 ssi.scrub_src = ScrubSrc.PATIENT
2314 ssi.scrub_method = ScrubMethod.WORDS
2316 elif eq(colname, S1PatientCol.DOB):
2317 # Truncate and scrub dates of birth.
2318 ssi.scrub_src = ScrubSrc.PATIENT
2319 ssi.scrub_method = ScrubMethod.DATE
2320 ssi.add_alter_method(
2321 AlterMethod(config=ddr.config, truncate_date=True)
2322 )
2323 ssi.dest_datatype = SQLTYPE_DATE
2324 ssi.include()
2326 elif eq(colname, S1PatientCol.DOD):
2327 # Include dates of death.
2328 ssi.include()
2330 elif eq(colname, S1PatientCol.BIRTHPLACE):
2331 # Unusual. But: scrub birthplace.
2332 ssi.scrub_src = ScrubSrc.PATIENT
2333 ssi.scrub_method = ScrubMethod.WORDS
2335 elif eq(colname, S1PatientCol.TESTPATIENT):
2336 # Exclude test patients.
2337 ssi.add_src_flag(SrcFlag.OPT_OUT)
2338 ssi.include()
2340 elif eq(colname, S1PatientCol.SOCIAL_SERVICES_REF):
2341 # Scrub and omit Social Services ID (text).
2342 ssi.scrub_src = ScrubSrc.PATIENT
2343 ssi.scrub_method = ScrubMethod.CODE
2344 ssi.omit() # just to be explicit
2346 elif is_in(colname, COLS_PATIENT_TABLE_OK_UNMODIFIED[context]):
2347 # These are OK.
2348 ssi.include()
2350 else:
2351 # If anything else is put into this table, it may be sensitive.
2352 pass # omit anything else in the master patient table
2354 # Via a separate "if" statement:
2355 if is_in(colname, COLS_REQUIRED_SCRUBBERS):
2356 ssi.add_src_flag(SrcFlag.REQUIRED_SCRUBBER)
2358 return ssi
2360 # -------------------------------------------------------------------------
2361 # Proceed for all other tables.
2362 # -------------------------------------------------------------------------
2363 handled = process_generic_table_column(
2364 tablename=tablename, colname=colname, ddr=ddr, ssi=ssi, context=context
2365 )
2366 if handled:
2367 # Recognized and handled as a generic column.
2368 return ssi
2370 if eq(tablename, S1Table.ADDRESS_HISTORY):
2371 # ---------------------------------------------------------------------
2372 # Address table.
2373 # ---------------------------------------------------------------------
2374 if is_in(colname, COLS_ADDRESS_PHRASES):
2375 ssi.scrub_src = ScrubSrc.PATIENT
2376 ssi.scrub_method = ScrubMethod.PHRASE
2378 elif is_in(colname, COLS_ADDRESS_PHRASE_UNLESS_NUMBER):
2379 ssi.scrub_src = ScrubSrc.PATIENT
2380 ssi.scrub_method = ScrubMethod.PHRASE_UNLESS_NUMERIC
2382 elif eq(colname, S1AddressCol.POSTCODE):
2383 ssi.scrub_src = ScrubSrc.PATIENT
2384 ssi.scrub_method = ScrubMethod.CODE
2386 else:
2387 # omit anything else in the address table, e.g.
2388 # CPFTAddressCol.POSTCODE_NOSPACE
2389 pass
2391 elif eq(tablename, S1Table.CONTACT_DETAILS):
2392 # ---------------------------------------------------------------------
2393 # Contact details table.
2394 # ---------------------------------------------------------------------
2395 if eq(colname, S1ContactCol.NUMBER):
2396 # Could be patient; ?could be third party; mostly patient?
2397 ssi.scrub_src = ScrubSrc.PATIENT
2398 ssi.scrub_method = ScrubMethod.NUMERIC
2400 else:
2401 pass # omit anything else in the contact details table
2403 elif eq(tablename, S1Table.RELATIONSHIPS):
2404 # ---------------------------------------------------------------------
2405 # Third-party (relationships) table.
2406 # ---------------------------------------------------------------------
2407 if is_in(colname, COLS_RELATIONSHIP_XREF_ID):
2408 # "Go fetch that linked patient, and use their identity information
2409 # as a third-party scrubber for our index patient."
2410 ssi.scrub_src = ScrubSrc.THIRDPARTY_XREF_PID
2411 ssi.scrub_method = ScrubMethod.NUMERIC
2412 ssi.include()
2414 elif is_in(colname, COLS_RELATIONSHIP_WORDS[context]):
2415 ssi.scrub_src = ScrubSrc.THIRDPARTY
2416 ssi.scrub_method = ScrubMethod.WORDS
2418 elif is_in(colname, COLS_RELATIONSHIP_DATES[context]):
2419 ssi.scrub_src = ScrubSrc.THIRDPARTY
2420 ssi.scrub_method = ScrubMethod.DATE
2422 elif is_in(colname, COLS_RELATIONSHIP_PHRASES):
2423 ssi.scrub_src = ScrubSrc.THIRDPARTY
2424 ssi.scrub_method = ScrubMethod.PHRASE
2426 elif is_in(colname, COLS_RELATIONSHIP_PHRASE_UNLESS_NUMERIC):
2427 ssi.scrub_src = ScrubSrc.THIRDPARTY
2428 ssi.scrub_method = ScrubMethod.PHRASE_UNLESS_NUMERIC
2430 elif is_in(colname, COLS_RELATIONSHIP_CODES):
2431 ssi.scrub_src = ScrubSrc.THIRDPARTY
2432 ssi.scrub_method = ScrubMethod.CODE
2434 elif is_in(colname, COLS_RELATIONSHIP_NUMBERS):
2435 ssi.scrub_src = ScrubSrc.THIRDPARTY
2436 ssi.scrub_method = ScrubMethod.NUMERIC
2438 elif is_in(colname, COLS_RELATIONSHIP_OK_UNMODIFIED[context]):
2439 ssi.include()
2441 else:
2442 pass # omit anything unknown in the relationship table
2444 elif context == SystmOneContext.CPFT_DW and eq(
2445 tablename, CPFTTable.REL_MOTHER
2446 ):
2447 # ---------------------------------------------------------------------
2448 # A CPFT partial duplicate table: from the relationship table where
2449 # that relationship is "Mother".
2450 # ---------------------------------------------------------------------
2451 if is_in(colname, COLS_RELATIONSHIP_XREF_ID):
2452 ssi.scrub_src = ScrubSrc.THIRDPARTY_XREF_PID
2453 ssi.scrub_method = ScrubMethod.NUMERIC
2454 ssi.include()
2456 elif eq(colname, CPFTOtherCol.REL_MOTHER_COL_NHSNUM):
2457 # Likely a duplicate as a scrubber. But that's not a problem for
2458 # CRATE and this also marks it as something to remove.
2459 ssi.scrub_src = ScrubSrc.THIRDPARTY
2460 ssi.scrub_method = ScrubMethod.NUMERIC
2462 elif is_in(colname, CPFT_REL_MOTHER_OK_UNMODIFIED):
2463 ssi.include()
2465 else:
2466 pass # omit anything unown
2468 elif tcmatch(
2469 tablename, colname, S1Table.HOSP_AE_NUMBERS, S1HospNumCol.HOSPNUM
2470 ):
2471 # ---------------------------------------------------------------------
2472 # A hospital number.
2473 # ---------------------------------------------------------------------
2474 ssi.scrub_src = ScrubSrc.PATIENT
2475 ssi.scrub_method = ScrubMethod.CODE # can contain text
2477 elif tcmatch(
2478 tablename,
2479 colname,
2480 S1Table.SAFEGUARDING_PERSON_AT_RISK,
2481 S1PatientCol.NHSNUM,
2482 ):
2483 # ---------------------------------------------------------------------
2484 # Another person's NHS number.
2485 # ---------------------------------------------------------------------
2486 ssi.scrub_src = ScrubSrc.THIRDPARTY
2487 ssi.scrub_method = ScrubMethod.NUMERIC
2489 elif is_pair_in(tablename, colname, OMIT_TABLENAME_COLNAME_PAIRS_S1):
2490 # ---------------------------------------------------------------------
2491 # A column to omit specifically.
2492 # ---------------------------------------------------------------------
2493 pass # omit
2495 else:
2496 # ---------------------------------------------------------------------
2497 # A generic field in a generic table.
2498 # ---------------------------------------------------------------------
2499 if include_generic:
2500 ssi.include()
2501 else:
2502 # Don't change anything except the comment:
2503 ssi.change_comment_and_indexing_only = True
2505 return ssi
2508def get_index_flag(
2509 tablename: str,
2510 colname: str,
2511 ddr: DataDictionaryRow,
2512 context: SystmOneContext,
2513) -> Optional[IndexType]:
2514 """
2515 Should this be indexed? Returns an indexing flag, or ``None`` if it should
2516 not be indexed.
2517 """
2518 if is_pk(tablename, colname, context, ddr):
2519 # PKs should have a unique index.
2520 return IndexType.UNIQUE
2521 pid = is_pid(colname, context)
2522 if is_master_patient_table(tablename) and pid:
2523 # In the master patient table, PIDs are unique.
2524 # (MPIDs aren't -- they can be NULL.)
2525 return IndexType.UNIQUE
2526 if pid or is_mpid(colname, context):
2527 # We index all patient IDs.
2528 return IndexType.NORMAL
2529 if is_pair_in_re(
2530 tablename, colname, NORMAL_INDEX_TABLENAME_COLNAME_REGEX_PAIRS[context]
2531 ):
2532 # Additional columns to index
2533 return IndexType.NORMAL
2534 if colname in GENERIC_COLS_TO_INDEX[context]:
2535 return IndexType.NORMAL
2536 if should_be_fulltext_indexed(tablename, colname):
2537 # Full-text indexes
2538 return IndexType.FULLTEXT
2539 return IndexType.NONE
2542# =============================================================================
2543# Modify a data dictionary row according to detected features
2544# =============================================================================
2547class TableCommentWorking:
2548 """
2549 Class used to store data temporarily about table comments, during SystmOne
2550 data dictionary annotation. Slightly complex because
2551 """
2553 def __init__(
2554 self,
2555 dd: DataDictionary,
2556 specifications: SystmOneSRESpecs,
2557 append_comments: bool = False,
2558 allow_unprefixed_tables: bool = False,
2559 ) -> None:
2560 """
2561 Args:
2562 dd:
2563 The data dictionary.
2564 specifications:
2565 Details of the TPP SRE specifications.
2566 append_comments:
2567 Append comments to any that were autogenerated, rather than
2568 replacing them. (If you use the SRE specifications, you may as
2569 well set this to False as the SRE specification comments are
2570 much better.)
2571 allow_unprefixed_tables:
2572 Permit tables that don't start with the expected contextual
2573 prefix? Discouraged; you may get odd tables and views.
2574 """
2575 self.dd = dd
2576 self.specifications = specifications
2577 self.append_comments = append_comments
2578 self.allow_unprefixed_tables = allow_unprefixed_tables
2580 # "Core" tablenames for which the DD has a table comment already:
2581 self.coretable_comments_already_exist = set() # type: Set[str]
2582 for ddr in dd.rows:
2583 if not ddr.is_table_comment:
2584 continue
2585 tablename_core = self._get_core_tablename(ddr.src_table)
2586 if not tablename_core:
2587 continue
2588 self.coretable_comments_already_exist.add(tablename_core)
2590 # We maintain a list of extra DDRs to add:
2591 self.extra_table_comment_rows = [] # type: List[DataDictionaryRow]
2593 def _get_core_tablename(self, dd_tablename: str) -> str:
2594 """
2595 Returns the equivalent "core" (unprefixed) table name, which we use
2596 with our SystmOne specifications file.
2598 If the input doesn't have the right prefix and allow_unprefixed_tables
2599 is False, this will return a blank string.
2600 """
2601 return core_tablename(
2602 dd_tablename,
2603 from_context=self.specifications.context,
2604 allow_unprefixed=self.allow_unprefixed_tables,
2605 )
2607 def maybe_add_table_comment(self, ddr: DataDictionaryRow):
2608 """
2609 We scan each data dictionary row via this function.
2611 - If we already have seen a comment for this table in the data
2612 dictionary, we don't do anything, UNLESS this row is itself that
2613 comment, and then if
2615 - Otherwise, we add the SystmOne comment, if found, as an extra DDR,
2616 storing it in our "extra_table_comment_rows" list.
2617 """
2618 tablename_core = self._get_core_tablename(ddr.src_table)
2619 if not tablename_core:
2620 return
2621 s1_table_comment = self.specifications.table_comment(tablename_core)
2622 if not s1_table_comment:
2623 return
2624 if tablename_core in self.coretable_comments_already_exist:
2625 if ddr.is_table_comment:
2626 # This DDR is itself the table comment. Modify it.
2627 if self.append_comments:
2628 ddr.comment = join_comments(
2629 [ddr.comment, s1_table_comment]
2630 )
2631 else:
2632 ddr.comment = s1_table_comment
2633 return
2634 # Add a new DDR for the table comment:
2635 tc_ddr = DataDictionaryRow(self.dd.config)
2636 tc_ddr.set_as_table_comment(
2637 ddr.src_db, ddr.src_table, s1_table_comment
2638 )
2639 self.extra_table_comment_rows.append(tc_ddr)
2640 self.coretable_comments_already_exist.add(tablename_core)
2643def annotate_systmone_dd_row(
2644 ddr: DataDictionaryRow,
2645 context: SystmOneContext,
2646 specifications: SystmOneSRESpecs,
2647 append_comments: bool = False,
2648 include_generic: bool = False,
2649 allow_unprefixed_tables: bool = False,
2650 table_info_in_comments: bool = True,
2651) -> None:
2652 """
2653 Modifies (in place) a data dictionary row for SystmOne.
2655 Args:
2656 ddr:
2657 The data dictionary row to amend.
2658 context:
2659 The context from which SystmOne data is being extracted (e.g. the
2660 raw TPP Strategic Reporting Extract (SRE), or a local version
2661 processed into CPFT's Data Warehouse).
2662 specifications:
2663 Details of the TPP SRE specifications.
2664 append_comments:
2665 Append comments to any that were autogenerated, rather than
2666 replacing them. (If you use the SRE specifications, you may as well
2667 set this to False as the SRE specification comments are much
2668 better.)
2669 include_generic:
2670 Include all fields that are not known about by this code and
2671 treated specially? If False, the config file settings are used
2672 (which may omit or include). If True, all such fields are included.
2673 allow_unprefixed_tables:
2674 Permit tables that don't start with the expected contextual prefix?
2675 Discouraged; you may get odd tables and views.
2676 A few (see INCLUDE_TABLES_REGEX) are explicitly included anyway.
2677 table_info_in_comments:
2678 Include table descriptions in column comments?
2679 """
2680 tablename = core_tablename(
2681 ddr.src_table,
2682 from_context=context,
2683 allow_unprefixed=allow_unprefixed_tables,
2684 )
2685 if not tablename:
2686 # It didn't have the right prefix and allow_unprefixed_tables is False.
2687 ddr.decision = Decision.OMIT
2688 return
2689 colname = core_columnname(tablename, ddr.src_field, from_context=context)
2691 debugmsg = f"Considering: {ddr.src_table}.{ddr.src_field}"
2692 if tablename != ddr.src_table or colname != ddr.src_field:
2693 debugmsg += f" [translated to 'core' version: {tablename}.{colname}]"
2694 log.debug(debugmsg)
2696 # Do our thinking
2697 ssi = get_scrub_alter_details(
2698 tablename=tablename,
2699 colname=colname,
2700 ddr=ddr,
2701 include_generic=include_generic,
2702 context=context,
2703 )
2705 if not ssi.change_comment_and_indexing_only:
2706 # Source information
2707 ddr.src_flags = ssi.src_flags
2708 ddr.scrub_src = ssi.scrub_src
2709 ddr.scrub_method = ssi.scrub_method
2711 # Output decision
2712 ddr.decision = ssi.decision
2714 # Alterations
2715 ddr.set_alter_methods_directly(ssi.alter_methods)
2717 # Destination -- mostly automatic
2718 ddr.dest_field = ssi.dest_field or ddr.dest_field
2719 ddr.dest_datatype = ssi.dest_datatype or ddr.dest_datatype
2721 # Indexing
2722 ddr.index = get_index_flag(tablename, colname, ddr, context)
2724 # Improve comment
2725 spec = specifications.get_spec_row(tablename, colname)
2726 if spec:
2727 spec_comment = spec.comment(context, with_table=table_info_in_comments)
2728 # If we have no new comment, leave the old one alone.
2729 if spec_comment:
2730 if append_comments:
2731 ddr.comment = join_comments([ddr.comment, spec_comment])
2732 else:
2733 ddr.comment = spec_comment
2736# =============================================================================
2737# Modify a data dictionary
2738# =============================================================================
2741def modify_dd_for_systmone(
2742 dd: DataDictionary,
2743 context: SystmOneContext,
2744 sre_spec_csv_filename: str = "",
2745 debug_specs: bool = False,
2746 append_comments: bool = False,
2747 include_generic: bool = False,
2748 allow_unprefixed_tables: bool = False,
2749 alter_loaded_rows: bool = False,
2750 table_info_in_comments: bool = True,
2751) -> None:
2752 """
2753 Modifies a data dictionary in place.
2755 Args:
2756 dd
2757 The data dictionary to amend.
2758 context:
2759 The context from which SystmOne data is being extracted (e.g. the
2760 raw TPP Strategic Reporting Extract (SRE), or a local version
2761 processed into CPFT's Data Warehouse).
2762 sre_spec_csv_filename:
2763 Optional filename for the TPP SRE specification file, in
2764 comma-separated value (CSV) format. If present, this will be used
2765 to add proper descriptive comments to all known fields. Highly
2766 recommended.
2767 debug_specs:
2768 Report the SRE specifications to the log.
2769 append_comments:
2770 Append comments to any that were autogenerated, rather than
2771 replacing them. (If you use the SRE specifications, you may as well
2772 set this to False as the SRE specification comments are much
2773 better.)
2774 include_generic:
2775 Include all fields that are not known about by this code and
2776 treated specially? If False, the config file settings are used
2777 (which may omit or include). If True, all such fields are included.
2778 allow_unprefixed_tables:
2779 Permit tables that don't start with the expected contextual prefix?
2780 Discouraged; you may get odd tables and views.
2781 alter_loaded_rows:
2782 Alter rows that were loaded from disk (not read from a database)?
2783 The default is to leave such rows untouched.
2784 table_info_in_comments:
2785 Include table descriptions in column comments?
2786 """
2787 log.info(f"Modifying data dictionary for SystmOne. Context = {context}")
2788 specs = SystmOneSRESpecs(context, sre_spec_csv_filename)
2789 if debug_specs:
2790 specs.debug_specs()
2791 table_comment_working = TableCommentWorking(
2792 dd=dd,
2793 specifications=specs,
2794 append_comments=append_comments,
2795 allow_unprefixed_tables=allow_unprefixed_tables,
2796 )
2797 for ddr in dd.rows:
2798 table_comment_working.maybe_add_table_comment(ddr)
2799 if ddr.from_file and not alter_loaded_rows:
2800 # Skip rows that were loaded from disk.
2801 continue
2802 if ddr.is_table_comment:
2803 continue # those were handled above
2804 annotate_systmone_dd_row(
2805 ddr=ddr,
2806 context=context,
2807 specifications=specs,
2808 append_comments=append_comments,
2809 include_generic=include_generic,
2810 allow_unprefixed_tables=allow_unprefixed_tables,
2811 table_info_in_comments=table_info_in_comments,
2812 )
2813 dd.rows += table_comment_working.extra_table_comment_rows
2814 log.info("... done")
2815 # We may have added table comments, so sort.
2816 dd.sort()