Coverage for preprocess/systmone_ddgen.py: 49%

685 statements  

« prev     ^ index     » next       coverage.py v7.8.0, created at 2026-02-05 06:46 -0600

1r""" 

2crate_anon/preprocess/systmone_ddgen.py 

3 

4=============================================================================== 

5 

6 Copyright (C) 2015, University of Cambridge, Department of Psychiatry. 

7 Created by Rudolf Cardinal (rnc1001@cam.ac.uk). 

8 

9 This file is part of CRATE. 

10 

11 CRATE is free software: you can redistribute it and/or modify 

12 it under the terms of the GNU General Public License as published by 

13 the Free Software Foundation, either version 3 of the License, or 

14 (at your option) any later version. 

15 

16 CRATE is distributed in the hope that it will be useful, 

17 but WITHOUT ANY WARRANTY; without even the implied warranty of 

18 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 

19 GNU General Public License for more details. 

20 

21 You should have received a copy of the GNU General Public License 

22 along with CRATE. If not, see <https://www.gnu.org/licenses/>. 

23 

24=============================================================================== 

25 

26**Generate a CRATE data dictionary for SystmOne data.** 

27 

28 

29Notes 

30----- 

31 

32- SystmOne is a general-purpose electronic health record (EHR) system from TPP 

33 (The Phoenix Partnership): https://tpp-uk.com/products/. 

34 

35- It's widely used in general practice (GP), and in 

36 Cambridgeshire/Peterborough, ~80% of GP surgeries use it (2018 data, 

37 https://pubmed.ncbi.nlm.nih.gov/29490968/, Figure 2). 

38 

39- Cambridgeshire & Peterborough NHS Foundation Trust (CPFT) used to use 

40 SystmOne for community services, and then moved nearly all the rest of its 

41 services to SystmOne (from RiO, in the case of mental health services): 

42 Children's Directorate (12 Oct 2020), Community Hospital wards (30 Nov 2020), 

43 the rest of the Older People, Adults, and Community Directorate (7 Dec 2020), 

44 and finally the Adult and Specialist Directorate (14 Jun 2021). 

45 

46- SystmOne is centrally hosted by TPP. 

47 

48- TPP provide a nightly "Strategic Reporting extract" (SRE) of SystmOne data. 

49 

50- Its primary coding mechanisms are (1) CTV3 (Read) codes, and (2) SNOMED 

51 codes (see e.g. https://termbrowser.nhs.uk/) -- the latter are gradually 

52 taking over (as of 2021). Coded values can be numeric. For example, one entry 

53 might include: 

54 

55 - SNOMED code 718087004 

56 - SNOMED text "QRISK2 cardiovascular disease 10 year risk score" 

57 - CTV3 code "XaQVY" 

58 - CTV3 text "QRISK2 cardiovascular disease 10 year risk score" 

59 - Numeric unit "%" 

60 - Numeric value 10.4 

61 

62- SystmOne collects data mostly via "templates" and "questionnaires". Templates 

63 are perhaps closer to the heart of SystmOne (e.g. better presented in the 

64 long-form journal view) and values entered into templates are (always?) 

65 coded. Questionnaires are more free-form. Both can have free text attached to 

66 coded values. 

67 

68 

69Strategic Reporting extract 

70--------------------------- 

71 

72``SpecificationDirectory.zip`` (e.g. 2021-02-18) contains e.g. ``Specification 

73v123.csv``, which is a full description of the SRE. Principles: 

74 

75- All these tables start ``SR``, e.g. ``SR18WeekWait``, ``SRAAndEAttendance``. 

76 

77- Columns in that spreadsheet are: 

78 

79 .. code-block:: none 

80 

81 TableName 

82 TableDescription 

83 ColumnName 

84 ColumnDescription 

85 ColumnDataType -- possible values include: 

86 Boolean 

87 Date 

88 Date and Time 

89 Numeric - Integer 

90 Numeric - Real 

91 Text - Fixed 

92 Text - Variable 

93 ColumnLength -- possible values include: 

94 empty (e.g. boolean, date, date/time) 

95 8 for integer 

96 4 for real 

97 the VARCHAR length -- for both "variable" and "fixed" text types 

98 DateDefining 

99 ColumnOrdinal -- sequence number of column within table 

100 LinkedTable } 

101 LinkedColumn1 }-+ 

102 LinkedColumn2 } | 

103 +-- e.g. 

104 SROrganisation, ID 

105 SRStaffMember, RowIdentifier 

106 SRPatient, RowIdentifier, IDOrganisationVisibleTo 

107 

108- To get a table list: 

109 

110 .. code-block:: bash 

111 

112 # Poor for CSVs with newlines within their strings: 

113 tail -n+2 "Specification v123.csv" | cut -d, -f1 | sort | uniq 

114 

115 # Much better: 

116 python3 -c 'import csv; print("\n".join(row[0] for i, row in enumerate(csv.reader(open("Specification v123.csv"))) if i > 0))' | sort | uniq 

117 

118- Tables and their descriptions: 

119 

120 .. code-block:: python 

121 

122 import csv 

123 s = set() 

124 for i, row in enumerate(csv.reader(open("Specification v123.csv"))): 

125 if i > 0: 

126 s.add(f"{row[0]} - {row[1]}") 

127 

128 print("\n".join((x for x in sorted(s)))) 

129 

130 Translating that to a single line: https://www.python.org/dev/peps/pep-0289/ 

131 ... meh, hard. 

132 

133- ``SRPatient`` looks to be the master patient table here -- including names, 

134 dates of birth/death, NHS number. 

135 

136- ``Tpp Strategic Reporting Table Specification v123.rtf`` contains a nicer 

137 version of (exactly?) the same information. 

138 

139- Strategic Reporting downloads can be configured. Options include: 

140 

141 - Whether to include the shared record. (I'm not sure if that means a 

142 national thing or data from SystmOne that each patient may have consented 

143 to sharing "'out' from another organization, then 'in' to mine".) 

144 

145- When a download is set up, the recipient gets one CSV file per table 

146 selected, such as ``SRPatient.csv`` for the ``SRPatient`` table, plus some 

147 ever-present system tables: 

148 

149 - ``SRManifest.csv``, describing what you've received; 

150 - ``SRMapping.csv`` and ``SRMappingGroup.csv``, providing text for built-in 

151 lists. 

152 

153- The date format is e.g. "29 Sep 2011 14:53:28". Unknown times are marked as 

154 "00:00:00". Unknown dates give an empty string. Boolean values are ``TRUE`` 

155 or ``FALSE``. 

156 

157 

158Free-text data 

159-------------- 

160 

161The SRE does not contain free text data or binary documents by default. For 

162some Trusts, an augmented SRE is provided also, with that information. 

163 

164From ``FreeText Model.xlsx``, 2021-04-15, some of this data comes in the 

165following format: 

166 

167.. code-block:: none 

168 

169 Field Name Type Description 

170 RowIdentifier bigint The unique identifier of the 

171 record 

172 IDPatient bigint Links to patient ID in 

173 demographics 

174 IDReferralIn bigint ID of referral 

175 IDEvent bigint Links to activity event ID 

176 Question varchar(MAX) The questionnaire question 

177 [FreeText] varchar(MAX) The answer given to the above 

178 question 

179 EventDate datetime The data/time of the 

180 questionnaire 

181 SRTable varchar(100) Which SR table the record 

182 relates to 

183 IDSRTable bigint The ID of the above table 

184 QuestionnaireName varchar(255) The name of the questionnaire 

185 IDAnsweredQuestionnaire bigint The ID of the above 

186 questionnaire 

187 QuestionnaireVersionNumber int The version number of the above 

188 questionnaire 

189 IDOrganisation bigint Organisation ID of the 

190 questionnaire record 

191 CPFTGroup int Group (directorate) 

192 Directorate varchar(50) Directorate name 

193 TeamName varchar(100) Name of team linked to the 

194 referral 

195 IsMentalHealth int Mental or physical health 

196 Imported date Date imported to the database 

197 

198 (SR = Strategic Reporting.) 

199 

200Specimen values: 

201 

202- SRTable: 'SRAnsweredQuestionnaire' 

203- IDSRTable: this varies for rows with SRTable = 'SRAnsweredQuestionnaire', so 

204 I think it's the PK within the table indicated by SRTable. 

205- QuestionnaireName = 'CPFT Risk Assessment' 

206- IDAnsweredQuestionnaire = this is unique for rows with QuestionnaireName = 

207 'CPFT Risk Assessment', so I think it's the ID of the Questionnaire, and is 

208 probably a typo. 

209 

210(This ends up (in our environment) in the S1_FreeText table, as below, so it 

211likely arrives as SRFreeText.) 

212 

213However, note that ``RowIdentifier`` is **not** unique in this table. Whatever 

214they mean by "record", it isn't that. For example, there are 7 rows with one 

215common value of ``RowIdentifier`` that are clearly the 7 questions (in 

216``Question``) and textually coded answers (in ``FreeText``) to a SWEMWBS 

217questionnaire. That means that to apply a FULLTEXT index, which requires an 

218indexed unique value, we have to add one. 

219 

220 

221Key fields 

222---------- 

223 

224- ``IDPatient`` -- the SystmOne patient number, in all patient tables (PID, 

225 in CRATE terms). 

226- ``SRPatient.NHSNumber`` -- the NHS number (MPID, in CRATE terms). 

227 

228 

229Notable tables in the SRE 

230------------------------- 

231 

232- [SR]Patient, as above 

233 

234- Patient identifiers and relationship/third-party details: 

235 

236 - [SR]PatientAddressHistory 

237 - [SR]PatientContactDetails 

238 - [SR]HospitalAAndENumber 

239 

240- Relationship/third-party details: 

241 

242 - [SR]PatientRelationship 

243 - some of the safeguarding tables 

244 

245- [SR]NDOptOutPreference, re NHS national data opt out (for NHS Act s251 use) 

246 

247 - This has an IDPatient column; presumably presence indicates an active 

248 opt-out. 

249 

250- Full text and binary: 

251 

252 - [SR]Media -- contains filenames and some metadata 

253 - [SR]FreeText -- if supplied 

254 

255 

256Notable additional tables/columns in the CPFT environment 

257--------------------------------------------------------- 

258 

259- S1_FreeText -- this includes all answers to Questionnaires (linked via 

260 ``IDAnsweredQuestionnaire`` etc.). Comes from the "upgraded" SRE. 

261 

262- Several tables have identifiers linked in. For example, try: 

263 

264 .. code-block:: sql 

265 

266 SELECT * FROM information_schema.columns WHERE column_name = 'FirstName' 

267 

268 

269Notable tables omitted from the CPFT environment 

270------------------------------------------------ 

271 

272- Questionnaire -- data is linked into to AnsweredQuestionnaire (which still 

273 contains the column ``IDQuestionnaire``). 

274 

275 

276CPFT copy 

277--------- 

278 

279This broadly follows the SRE, but is expanded. Some notable differences: 

280 

281- Tables named ``SR*`` in the SRE are named ``S1_*`` in the CPFT version (e.g. 

282 ``SRPatient`` becomes ``S1_Patient``). 

283 

284- There is a ``S1_Patient.NationalDataOptOut`` column (0 or 1). 

285 

286- The local opt-out information appears in S1_ClinicalOutcome_ConsentResearch 

287 (as the OptOut field, a text field) but is clearer in 

288 S1_ClinicalOutcome_ConsentResearch_OptOutCheck, which only contains patients 

289 opting out and has: 

290 

291 .. code-block:: none 

292 

293 IDPatient = <ID_of_patient_opting_out> 

294 SNOMEDCode = 1091881000000109 

295 CTV3Code = 'XaaDb' 

296 CTV3Text = 'Declined invitation to participate in research study' 

297 

298 So for CPFT, we will autodetect this table/column 

299 (S1_ClinicalOutcome_ConsentResearch_OptOutCheck.SNOMEDCode) and the config 

300 file should contain: 

301 

302 .. code-block:: ini 

303 

304 optout_col_values = [1091881000000109] 

305 

306- There seem to be quite a few extra tables, such as: 

307 

308 .. code-block:: none 

309 

310 S1_ClinicalMeasure_QRisk 

311 S1_ClinicalMeasure_SWEMWBS 

312 S1_ClinicalMeasure_Section58 

313 

314 These look like CPFT-created tables pulling data from questionnaires or 

315 similar. 

316 

317- There is ``S1_FreeText``, where someone (NP!) has helpfully imported that 

318 additional data. 

319 

320- There is ``S1_ClinicalOutcome_ConsentResearch``, which is the traffic-light 

321 system for the CPFT Research Database. 

322 

323In more detail: 

324 

325- All data is loaded via stored procedures, available via Microsoft SQL Server 

326 Management Studio in :menuselection:`[server] --> [database] --> 

327 Programmability --> Stored Procedures`. Right-click any and choose "Modify" 

328 to view the source. For example, the stored procedure named 

329 ``dbo.load_S1_Patient`` creates the ``S1_Patient`` table. 

330 

331- ``RwNo`` or ``RwNo_Patient`` is frequently used, typically via: 

332 

333 .. code-block:: none 

334 

335 SELECT 

336 -- stuff, 

337 ROW_NUMBER() OVER ( 

338 PARTITION BY IDPatient 

339 ORDER BY DateEventRecorded DESC 

340 ) AS RwNo 

341 FROM 

342 -- somewhere 

343 WHERE 

344 RwNo = 1 

345 ; 

346 

347 SELECT 

348 -- stuff, 

349 ROW_NUMBER() OVER ( 

350 PARTITION BY IDPatient 

351 ORDER BY DateEvent DESC 

352 ) AS RwNo_Patient 

353 FROM 

354 -- somewhere 

355 ; 

356 

357 ... in other words, picking the most recent for each patient (or, without 

358 the WHERE clause, showing its sequencing within each patient). 

359 

360 

361Test patients in the live system? 

362--------------------------------- 

363 

364There are some test patients in our live system. 

365 

366.. code-block:: sql 

367 

368 SELECT COUNT(*) -- or DISTINCT firstname, surname 

369 FROM S1_Patient 

370 WHERE firstname LIKE '%test%' AND surname LIKE '%test%'; 

371 

372 -- Several present. However, in the CPFT copy, column "TestPatient" from 

373 -- this table (BOOLEAN in SRE docs) is missing. How to distinguish? 

374 

375There are several present. They should be distinguished by the ``TestPatient`` 

376column (BOOLEAN, as per the SRE docs). Our code looks for the "TestPatient" 

377column and marks it as an opt-out flag. 

378 

379.. todo:: TestPatient column missing in CPFT copy. [A/w NP 2022-03-21.] 

380 

381 

382Manual review after first draft 

383------------------------------- 

384 

385Reviewing CPFT de-identified output for patient-related content only (not 

386staff-related), per local ethics approvals. 

387 

388.. code-block:: sql 

389 

390 -- Tables in the de-identified database: 

391 SELECT table_name FROM information_schema.tables WHERE table_catalog = 'S1' ORDER BY table_name; 

392 

393All reviewed and this code tweaked accordingly. 

394 

395 

396Related tools 

397------------- 

398 

399- The OpenSAFELY research tool runs on SystmOne data (with other data linked 

400 in); it therefore provides helpful code lists. See 

401 

402 - https://github.com/opensafely 

403 - https://github.com/opensafely-core 

404 

405 This tool is one that separates researchers from data (by allowing queries, 

406 not researchers, access); it made its debut during COVID-19 with Williamson 

407 et al. (2020), https://pubmed.ncbi.nlm.nih.gov/32640463/. 

408 

409- Other such tools like DataSHIELD (https://www.datashield.org/, or e.g. Gaye 

410 et al. 2014, https://pubmed.ncbi.nlm.nih.gov/25261970/) perform similar 

411 researcher/data separation via other methods. 

412 

413""" # noqa: E501 

414 

415# todo: SystmOne (CRATE traffic-light system): implement S1_ClinicalOutcome_ConsentResearch # noqa: E501 

416 

417# ============================================================================= 

418# Imports 

419# ============================================================================= 

420 

421import csv 

422from dataclasses import dataclass, field 

423from enum import Enum 

424import logging 

425import re 

426from typing import Any, Dict, Iterable, List, Optional, Set, Tuple 

427 

428from cardinal_pythonlib.dicts import reversedict 

429from cardinal_pythonlib.enumlike import CaseInsensitiveEnumMeta 

430from cardinal_pythonlib.sql.validation import SQLTYPE_DATE 

431 

432from crate_anon.anonymise.altermethod import AlterMethod 

433from crate_anon.anonymise.constants import ( 

434 Decision, 

435 IndexType, 

436 ScrubMethod, 

437 ScrubSrc, 

438 SrcFlag, 

439) 

440from crate_anon.common.logfunc import warn_once 

441from crate_anon.anonymise.dd import DataDictionary, DataDictionaryRow 

442from crate_anon.preprocess.constants import CRATE_COL_PK 

443 

444log = logging.getLogger(__name__) 

445 

446 

447# ============================================================================= 

448# Constants 

449# ============================================================================= 

450 

451# ----------------------------------------------------------------------------- 

452# Typing 

453# ----------------------------------------------------------------------------- 

454 

455TABLE_TRANSLATION_DICT_TYPE = Dict["SystmOneContext", Dict[str, str]] 

456# ... maps a SystmOneContext to a dictionary mapping one tablename to another 

457 

458COLUMN_TRANSLATION_DICT_TYPE = Dict[ 

459 "SystmOneContext", Dict[Tuple[str, str], str] 

460] 

461# ... maps a SystmOneContext to a dictionary mapping (table, col) to newcol 

462 

463 

464# ----------------------------------------------------------------------------- 

465# Helper functions for constants 

466# ----------------------------------------------------------------------------- 

467 

468 

469def _flip_coldict(d: Dict[Tuple[str, str], str]) -> Dict[Tuple[str, str], str]: 

470 """ 

471 Flips a mapping from (tablename, col1): col2 to (tablename, col2): col1. 

472 """ 

473 return {(table, newcol): srccol for (table, srccol), newcol in d.items()} 

474 

475 

476# ----------------------------------------------------------------------------- 

477# Cosmetic 

478# ----------------------------------------------------------------------------- 

479 

480COMMENT_SEP = " // " # for combining parts of column comments 

481 

482 

483# ----------------------------------------------------------------------------- 

484# Generic regular expression 

485# ----------------------------------------------------------------------------- 

486 

487ANYTHING = ".+" # at least one character 

488 

489 

490def not_just_at_start(x: str) -> str: 

491 """ 

492 Apply a prefix so that a regex string doesn't just work at the start of a 

493 string. 

494 """ 

495 return ".*" + x 

496 

497 

498def terminate(x: str) -> str: 

499 """ 

500 Apply an end-of-string terminator to a regex string. 

501 """ 

502 return x + "$" 

503 

504 

505# ----------------------------------------------------------------------------- 

506# Contexts and table naming 

507# ----------------------------------------------------------------------------- 

508 

509 

510class SystmOneContext(Enum, metaclass=CaseInsensitiveEnumMeta): 

511 """ 

512 Environments in which we might have SystmOne data. 

513 """ 

514 

515 TPP_SRE = "TPP Strategic Reporting Extract" 

516 CPFT_DW = "CPFT Data Warehouse" 

517 

518 

519DEFAULT_SYSTMONE_CONTEXT = SystmOneContext.CPFT_DW 

520TABLE_PREFIXES = { 

521 SystmOneContext.TPP_SRE: "SR", 

522 SystmOneContext.CPFT_DW: "S1_", 

523} 

524 

525 

526# ----------------------------------------------------------------------------- 

527# Table names 

528# ----------------------------------------------------------------------------- 

529# "Core" tablename, without the SR/S1_/... prefix. 

530 

531 

532class S1Table: 

533 """ 

534 SystmOne "core" table names, with no prefix. 

535 """ 

536 

537 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 

538 # Tables containing a range of patient identifiers 

539 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 

540 

541 PATIENT = "Patient" # e.g. SRPatient (SRE), S1_Patient (CPFT) 

542 ADDRESS_HISTORY = "PatientAddressHistory" 

543 CARE_PLAN_REVIEW = "CarePlanReview" 

544 CONTACT_DETAILS = "PatientContactDetails" 

545 RELATIONSHIPS = "PatientRelationship" 

546 

547 # Other tables starting "Patient": 

548 # - SRPatientContactProperty: seems not relevant; describes visits/diary 

549 # events 

550 # - SRPatientGroups: e.g. to group patients in a residential home; also 

551 

552 # family, but contains no direct identifiers. 

553 # - SRPatientLeave: leave from hospital. 

554 # - SRPatientLocation: location within A&E departments, I think. 

555 # - SRPatientRegistration: registration status (and who did it); also their 

556 # preferred pharmacy; but no direct identifiers. 

557 

558 HOSP_AE_NUMBERS = "HospitalAAndENumber" 

559 SAFEGUARDING_PERSON_AT_RISK = "SafeguardingPersonAtRisk" 

560 

561 # See also OMIT_TABLENAME_COLNAME_PAIRS below. 

562 # 

563 # Other tables whose name might suggest patient identifiers: 

564 # - SRAddressBookEntry: institutional addresses only? (FK to this from 

565 # SRSafeguardingIncidentDetails, for example.) todo: check -- institutional addresses only? # noqa: E501 

566 # - SRHospitalAdmissionAndDischarge, etc. -- no external identifiers linked 

567 # to HospitalAAndENumber, just SystmOne IDs. 

568 # - SROohEmergencyCall -- no contact numbers. 

569 # - SROohTransport -- very structured. 

570 # - SROohVisit -- very structured. 

571 

572 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 

573 # Tables containing free text 

574 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 

575 

576 FREETEXT = "FreeText" 

577 MEDIA = "Media" # todo: binary documents -- how? 

578 

579 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 

580 # Others requiring special treatment 

581 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 

582 

583 ACTIVITY_EVENT = "ActivityEvent" 

584 BED_CLOSURE = "BedClosure" 

585 CONTACTS = "Contacts" 

586 DISCHARGE_DELAY = "DischargeDelay" 

587 DOCUMENTS = "Documents" 

588 MENTAL_HEALTH_ACT_APPEAL = "SectionAppeal" 

589 MENTAL_HEALTH_ACT_AWOL = "MHAWOL" 

590 NOMIS_NUMBER = "NomisNumber" # National Offender Management Info. System 

591 OUT_OF_HOURS_ACTION = "OohAction" 

592 OUT_OF_HOURS_THIRD_PARTY_CALL = "OohThirdPartyCall" 

593 SAFEGUARDING_ALLEGATION_DETAILS = "SafeguardingAllegationDetails" 

594 SAFEGUARDING_INCIDENT_DETAILS = "SafeguardingIncidentDetails" 

595 TASK = "Task" 

596 

597 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 

598 # Not used here, but used for the consent-for-contact system 

599 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 

600 

601 CLINICAL_CODE = "ClinicalCode" 

602 GP_PRACTICE_HISTORY = "GPPracticeHistory" 

603 HOSPITAL_CONSULTANT_EVENT = "HospitalConsultantEvent" 

604 ORGANISATION = "Organisation" 

605 REFERRAL_ALLOCATION = "ReferralAllocation" 

606 REFERRAL_ALLOCATION_STAFF = "ReferralAllocationStaff" 

607 RESPONSIBLE_PARTY = "ResponsibleParty" 

608 TEAM = "Team" 

609 

610 

611class CPFTTable: 

612 """ 

613 Selected tables that CPFT have renamed or created. 

614 """ 

615 

616 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 

617 # Tables containing a range of patient identifiers 

618 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 

619 

620 ADDRESS = "PatientAddress" 

621 CONTACT_DETAILS = "PatientContact" 

622 DEMOGRAPHICS = "Demographics" 

623 REL_MOTHER = "PatientRelationshipMother" 

624 

625 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 

626 # Tables containing free text 

627 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 

628 

629 CYP_FRS_TELEPHONE_TRIAGE = "CYPFRS_TelephoneTriage" 

630 

631 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 

632 # Others requiring special treatment 

633 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 

634 

635 BED_CLOSURE = "InpatientBedClosure" 

636 CONTACTS_ARCHIVE = "ContactsArchive" 

637 CONTACTS_ARCHIVE_CLIENT_SEQUENCE = "ContactsArchive_ClientSequence" 

638 MENTAL_HEALTH_ACT_APPEAL = "MentalHealthAct_SectionAppeal" 

639 MENTAL_HEALTH_ACT_AWOL = "MentalHealthAct_Awol" 

640 

641 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 

642 # Not used here, but used for the consent-for-contact system 

643 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 

644 

645 DIAGNOSIS = "Diagnosis" 

646 GP_PRACTICE = "PatientGPPractice" 

647 INPATIENT_CONSULTANT_EPISODE = "InpatientSpells_ConsultantEpisode" 

648 CARE_COORDINATOR = "CPACareCoordinator" 

649 

650 # These created as custom tables (quasi-views) by CPFT, with no direct 

651 # equivalent in the original (the data probably coming from SRCode or 

652 # similar): 

653 CPFTRD_PREFS = "ClinicalOutcome_ConsentResearch" 

654 CPFTRD_EMAIL = "ClinicalOutcome_ConsentResearch_EmailCheck" 

655 CPFTRD_OPT_OUT = "ClinicalOutcome_ConsentResearch_OptOutCheck" 

656 

657 

658class CrateView: 

659 """ 

660 Views created by CRATE, which do not have contextual prefixes. 

661 """ 

662 

663 CRATE_VIEW_PREFIX = "vw_crate_" 

664 

665 GEOGRAPHY_VIEW = CRATE_VIEW_PREFIX + "PatientAddressWithResearchGeography" 

666 TESTPATIENT_VIEW = CRATE_VIEW_PREFIX + "FindExtraTestPatients" 

667 

668 

669# ----------------------------------------------------------------------------- 

670# Table collections 

671# ----------------------------------------------------------------------------- 

672# Tables are referred to here by their "core" name, i.e. after removal of 

673# prefixes like "SR" or "S1_", if they have one. 

674 

675# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 

676# Tables to include 

677# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 

678 

679_INCLUDE_TABLES_REGEX_S1 = ( 

680 # Include even if --systmone_allow_unprefixed_tables is not used. 

681 CrateView.CRATE_VIEW_PREFIX, 

682) 

683_INCLUDE_TABLES_REGEX_CPFT = ("vw",) # some other views 

684INCLUDE_TABLES_REGEX = { 

685 SystmOneContext.TPP_SRE: _INCLUDE_TABLES_REGEX_S1, 

686 SystmOneContext.CPFT_DW: _INCLUDE_TABLES_REGEX_S1 

687 + _INCLUDE_TABLES_REGEX_CPFT, 

688} 

689 

690# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 

691# Tables to omit 

692# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 

693 

694_OMIT_AND_IGNORE_TABLES_S1 = ( 

695 S1Table.NOMIS_NUMBER, # NOMIS (prison) numbers 

696 S1Table.SAFEGUARDING_ALLEGATION_DETAILS, 

697 # ... sensitive and no patient ID column 

698) 

699_OMIT_AND_IGNORE_TABLES_CPFT = ( 

700 # CPFT extras: 

701 "Deaths", # has NHS number then multi-system ID but not consistent SystmOne patient ID # noqa: E501 

702 "gr_workings", # no idea 

703 "InpatientAvailableBeds", # RowIdentifier very far from unique; ?no PK; no patient info # noqa: E501 

704) 

705OMIT_AND_IGNORE_TABLES = { 

706 SystmOneContext.TPP_SRE: _OMIT_AND_IGNORE_TABLES_S1, 

707 SystmOneContext.CPFT_DW: _OMIT_AND_IGNORE_TABLES_S1 

708 + _OMIT_AND_IGNORE_TABLES_CPFT, 

709} 

710 

711_OMIT_AND_IGNORE_TABLES_REGEX_S1 = () 

712_OMIT_AND_IGNORE_TABLES_REGEX_CPFT = ( 

713 # CPFT extras: 

714 "Accommodation_", 

715 # ... e.g. Accommodation_20210329, Accommodation_Wendy 

716 "AuditLog", # may have gone now! Was there for a while. Not relevant. 

717 # KEPT: ClinicalOutcome_NHS_Staff_LongCovid. 

718 # This one filters for CTV3 codes "Y2c49" and "Y2ca". As d/w NP 2021-12-14: 

719 # - These are "local" codes. 

720 # ("Official" Read codes have been frozen since 2016, according to 

721 # https://datadictionary.nhs.uk/supporting_information/read_coded_clinical_terms.html.) # noqa: E501 

722 # - They indicate whether someone is a healthcare worker (HCW). 

723 # - Introduced for COVID-19, since HCW status was a clear risk factor for 

724 # infection. 

725 # - No detail suggesting identification (and they don't indicate e.g. 

726 # membership of a specific Trust, or a specific job role). 

727 "Inpatients", 

728 # S1_Inpatients, S1_Inpatients_20201020: current inpatients -- but these 

729 # tables have NHSNumber as FLOAT. Exclude them. 

730 "Mortality", # includes S1_Mortality, S1_MortalityAdditionalInfo 

731 # These contain (a) age (rather than DOB) information, and (b) information 

732 # from multiple systems -- some risk of including RiO patients with 

733 # coincidentally the same ClientID as a SystmOne IDPatient, unless 

734 # filtered, and is derived information anyway -- for now, we'll omit. 

735 "ReferralsOpen$", 

736 # This CPFT table is a non-patient table (but with potentially identifiable 

737 # information about referral reason? -- maybe not) -- skip it. 

738 "WaitList_", # S1_Waitlist_* 

739 # Waiting list tables use a confusing blend of SystmOne "IDPatient" and 

740 # RiO "ClientID" columns, and it's not clear they add much. 

741 "UserSmartCard", 

742 # Not relevant clinically. 

743 # I considered excluding "vw.*" (views) and "zzz.*" (scratch tables) here, 

744 # but the user has the option to exclude all such tables via 

745 # --systmone_allow_unprefixed_tables if they desire. Views may be useful; 

746 # see also INCLUDE_TABLES_REGEX above. However, "zz" or "zzz" tables in 

747 # CPFT are scratch tables that should not be used: 

748 "zz", 

749 # Some have suffixes e.g. "S1_ReferralsIn_20200917", i.e. end with an 

750 # underscore then 8 digits. These are temporary copies that we should not 

751 # use. Some have more after that date. 

752 r"\w+_\d{8}", 

753 # If a table has the suffix "_old", we probably don't want it! 

754 r"\w+_old", 

755) 

756OMIT_AND_IGNORE_TABLES_REGEX = { 

757 SystmOneContext.TPP_SRE: _OMIT_AND_IGNORE_TABLES_REGEX_S1, 

758 SystmOneContext.CPFT_DW: _OMIT_AND_IGNORE_TABLES_REGEX_S1 

759 + _OMIT_AND_IGNORE_TABLES_REGEX_CPFT, 

760} 

761 

762# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 

763# Tables that have been renamed 

764# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 

765 

766_S1_TO_CPFT_TABLE_TRANSLATION = { 

767 # Where CPFT has renamed a S1 SRE table directly. 

768 S1Table.ADDRESS_HISTORY: CPFTTable.ADDRESS, 

769 # ... i.e. CPFT have renamed SRPatientAddressHistory to S1_PatientAddress. 

770 S1Table.BED_CLOSURE: CPFTTable.BED_CLOSURE, 

771 S1Table.CLINICAL_CODE: CPFTTable.DIAGNOSIS, 

772 S1Table.CONTACT_DETAILS: CPFTTable.CONTACT_DETAILS, 

773 S1Table.HOSPITAL_CONSULTANT_EVENT: CPFTTable.INPATIENT_CONSULTANT_EPISODE, 

774 S1Table.MENTAL_HEALTH_ACT_APPEAL: CPFTTable.MENTAL_HEALTH_ACT_APPEAL, 

775 S1Table.MENTAL_HEALTH_ACT_AWOL: CPFTTable.MENTAL_HEALTH_ACT_AWOL, 

776 S1Table.GP_PRACTICE_HISTORY: CPFTTable.GP_PRACTICE, 

777 S1Table.RESPONSIBLE_PARTY: CPFTTable.CARE_COORDINATOR, 

778} 

779CORE_TO_CONTEXT_TABLE_TRANSLATIONS = { 

780 # Key: destination context. 

781 # Value: translation dictionary, mapping "core" tablename to target. 

782 # Absent values lead to no translation. 

783 SystmOneContext.TPP_SRE: {}, 

784 SystmOneContext.CPFT_DW: _S1_TO_CPFT_TABLE_TRANSLATION, 

785} # type: TABLE_TRANSLATION_DICT_TYPE 

786CONTEXT_TO_CORE_TABLE_TRANSLATIONS = { 

787 SystmOneContext.TPP_SRE: {}, 

788 SystmOneContext.CPFT_DW: reversedict(_S1_TO_CPFT_TABLE_TRANSLATION), 

789} # type: TABLE_TRANSLATION_DICT_TYPE 

790 

791# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 

792# Tables that look like they have a proper PK, but don't, and we very much want 

793# them to. 

794# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 

795 

796TABLES_REQUIRING_CRATE_PK_REGEX = ( 

797 # Tables go here if we have to add a PK-style column/index -- usually 

798 # because we want to apply a FULLTEXT index. 

799 S1Table.FREETEXT, 

800 # ... unterminated, so includes FreeText (S1) and FreeText_* (CPFT) 

801) 

802 

803 

804# ----------------------------------------------------------------------------- 

805# Column names 

806# ----------------------------------------------------------------------------- 

807# We work internally with TPP SRE column names. Any renaming (e.g. in CPFT) is 

808# explicitly noted. 

809 

810 

811class S1GenericCol: 

812 """ 

813 Columns used in many SystmOne tables. 

814 """ 

815 

816 CTV3_CODE = "CTV3Code" # Read code 

817 CTV3_TEXT = "CTV3Text" # ... and corresponding description 

818 EVENT_ID = "IDEvent" # FK to SREvent.RowIdentifier 

819 EVENT_OCCURRED_WHEN = "DateEvent" # when event happened 

820 EVENT_RECORDED_WHEN = "DateEventRecorded" # when event recorded 

821 FREETEXT = "FreeText" 

822 NOTES_SUFFIX = "Notes" 

823 # ... "Notes", but also "ends with 'Notes'", e.g. AdmissionNotes, 

824 # IncidentNotes, LocationNotes 

825 ORG_ID_DONE_AT = "IDOrganisationDoneAt" # FK to SROrganisation.ID 

826 ORG_ID_ENTERED_AT = "IDOrganisation" # org at which the data was entered 

827 ORG_ID_VISIBLE_TO = "IDOrganisationVisibleTo" # FK to SROrganisation.ID 

828 ORG_REGISTERED_AT = "IDOrganisationRegisteredAt" 

829 # ... org where the patient was registered when the data was entered 

830 PATIENT_ID = "IDPatient" # FK to SRPatient.RowIdentifier 

831 ROW_ID = "RowIdentifier" # PK for nearly all SystmOne original tables 

832 QUESTIONNAIRE_ID = "IDAnsweredQuestionnaire" 

833 # ... FK to SRAnsweredQuestionnaire.RowIdentifier 

834 REFERRAL_ID = "IDReferralIn" # FK to SRReferralIn.RowIdentifier 

835 SNOMED_CODE = "SNOMEDCode" # SNOMED-CT code 

836 SNOMED_TEXT = "SNOMEDText" # ... and corresponding description 

837 STAFF_ID_DONE_BY = "IDDoneBy" # FK to SRStaffMember.RowIdentifier 

838 STAFF_PROFILE_ID_RECORDED_BY = "IDProfileEnteredBy" 

839 # ... FK to SRStaffMemberProfile.RowIdentifier 

840 

841 

842class CPFTGenericCol: 

843 """ " 

844 CPFT variants for generic column names. 

845 """ 

846 

847 AGE_YEARS = "AgeInYears" 

848 # ... usually "at the time of calculation, or death", i.e. unhelpful if you 

849 # are unsure when the data was extracted; see stored procedure 

850 # load_S1_Patient. 

851 ASSIGNMENT_NUMBER = "AssignmentNumber" 

852 # ... payroll number of member of staff, I think -- too sensitive, and I am 

853 # surprised it is there in the first place. 

854 HOME_CONTACT_NUMBER = "HomeContactNumber" 

855 MOBILE_CONTACT_NUMBER = "MobileContactNumber" 

856 NHSNUM2 = "NHSNumber2" 

857 NHSNUM_MOTHER = "CYPHS_NHSNumber_Mother" 

858 PATIENT_ADDRESS = "PatientAddress" 

859 PATIENT_NAME = "PatientName" 

860 POSTCODE = "PostCode" 

861 PATIENT_ID_SYNONYM_1 = "Patient_ID" 

862 

863 

864class CrateS1ViewCol: 

865 """ 

866 Additional columns added by CRATE's preprocessor 

867 """ 

868 

869 IS_TEST_PATIENT = "is_test_patient" 

870 

871 

872class S1PatientCol: 

873 """ 

874 Columns in the Patient table. 

875 """ 

876 

877 PK = S1GenericCol.ROW_ID # RowIdentifier 

878 NHSNUM = "NHSNumber" 

879 TITLE = "Title" 

880 FORENAME = "FirstName" 

881 MIDDLE_NAMES = "MiddleNames" 

882 SURNAME = "Surname" 

883 PREV_SURNAME = "PreviousSurname" 

884 EMAIL = "EmailAddress" 

885 DOB = "DateBirth" 

886 DOD = "DateDeath" 

887 BIRTHPLACE = "BirthPlace" 

888 GENDER = "Gender" 

889 SPEAKS_ENGLISH = "SpeaksEnglish" # curious that this is a specific flag 

890 TESTPATIENT = "TestPatient" 

891 SOCIAL_SERVICES_REF = "SSRef" 

892 SPINE_MATCHED = "SpineMatched" 

893 

894 

895class CPFTPatientCol: 

896 """ 

897 CPFT variants for the patient table. 

898 """ 

899 

900 MIDDLE_NAMES = "GivenName2" 

901 DOB = "DOB" 

902 

903 

904class S1AddressCol: 

905 """ 

906 Columns in the PatientAddressHistory table. 

907 """ 

908 

909 ADDRESS_TYPE = "AddressType" 

910 CCG_OF_RESIDENCE = "CcgOfResidence" 

911 DATE_TO = "DateTo" 

912 BUILDING_NAME = "NameOfBuilding" 

913 BUILDING_NUMBER = "NumberOfBuilding" 

914 ROAD = "NameOfRoad" 

915 LOCALITY = "NameOfLocality" 

916 TOWN = "NameOfTown" 

917 COUNTY = "NameOfCounty" 

918 POSTCODE = "FullPostCode" 

919 

920 

921class CPFTAddressCol: 

922 """ 

923 CPFT variants for the address table. 

924 """ 

925 

926 POSTCODE_NOSPACE = "PostCode_NoSpaces" 

927 

928 

929class S1ContactCol: 

930 """ 

931 Columns in the PatientContactDetails table. 

932 """ 

933 

934 NUMBER = "ContactNumber" 

935 

936 

937class S1RelCol: 

938 """ 

939 Columns in the PatientRelationship table. 

940 (This is also one for which we specify everything in detail, since CPFT add 

941 in extra identifiers.) 

942 """ 

943 

944 RELATED_ID_DEPRECATED = "IDRelationshipWithPatient" 

945 # ... replaced by IDPatientRelationshipWith 

946 RELATED_ID = "IDPatientRelationshipWith" 

947 RELATED_STAFFCODE_OR_RELNHSNUM = "CodeRelationshipWithUser" 

948 # ... SRE help says "The ODS code for the staff member the relationship is 

949 # with". However, it seems that it sometimes contains the NHS number of the 

950 # relative (certainly an NHS number that differs from the patient's!). 

951 NAME = "RelationshipWithName" 

952 DOB = "RelationshipWithDateOfBirth" 

953 ADDRESS_HOUSE_NAME = "RelationshipWithHouseName" 

954 ADDRESS_HOUSE_NUMBER = "RelationshipWithHouseNumber" 

955 ADDRESS_ROAD = "RelationshipWithRoad" 

956 ADDRESS_LOCALITY = "RelationshipWithLocality" 

957 ADDRESS_POST_TOWN = "RelationshipWithPostTown" 

958 ADDRESS_COUNTY = "RelationshipWithCounty" 

959 ADDRESS_POSTCODE = "RelationshipWithPostCode" 

960 ADDRESS_TELEPHONE = "RelationshipWithTelephone" 

961 ADDRESS_WORK_TELEPHONE = "RelationshipWithWorkTelephone" 

962 ADDRESS_MOBILE_TELEPHONE = "RelationshipWithMobileTelephone" 

963 ADDRESS_FAX = "RelationshipWithFax" 

964 ADDRESS_EMAIL = "RelationshipWithEmailAddress" 

965 # Fields about the timing/nature of the relationship: 

966 DATE_ENDED = "DateEnded" 

967 REL_TYPE = "RelationshipType" 

968 GUARDIAN_PROXY = "PersonalGuardianOrProxy" 

969 NEXT_OF_KIN = "NextOfKin" 

970 CARER = "CaresForPatient" 

971 PRINCIPAL_CARER = "PrincipalCarerForPatient" 

972 KEYHOLDER = "KeyHolder" 

973 PARENTAL_RESPONSIBILITY = "HasParentalResponsibility" 

974 FINANCIAL_REP = "FinancialRepresentative" 

975 ADVOCATE = "Advocate" 

976 MAIN_VISITOR = "MainVisitor" 

977 CALLBACK_CONSENT = "CallCentreCallBackConsent" 

978 COPY_CORRESPONDENCE = "CopyCorrespondence" 

979 CONTACT_ORDER = "ContactOrder" 

980 CONTACT_METHOD = "ContactMethod" 

981 COMMS_FORMAT = "CommunicationFormat" 

982 INTERPRETER_REQUIRED = "InterpreterRequired" 

983 # and things that are about the relative but not directly identifying: 

984 SEX = "RelationshipWithSex" 

985 LANGUAGE = "RelationshipWithSpokenLanguage" 

986 ORG = "RelationshipWithOrganisation" 

987 

988 

989class CPFTOtherCol: 

990 """ 

991 Other CPFT variants. 

992 """ 

993 

994 REL_MOTHER_COL_NHSNUM = S1PatientCol.NHSNUM 

995 

996 

997class S1HospNumCol: 

998 """ 

999 Columns in the HospitalAAndENumber table. 

1000 """ 

1001 

1002 HOSPNUM = "HospitalNumber" 

1003 COMMENTS = "Comments" 

1004 

1005 

1006# ----------------------------------------------------------------------------- 

1007# Column collections 

1008# ----------------------------------------------------------------------------- 

1009 

1010# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 

1011# Columns that have been renamed 

1012# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 

1013 

1014_S1_TO_CPFT_COLUMN_TRANSLATION = { 

1015 # Where CPFT has renamed a column. 

1016 # - Key: (core_tablename, colname) tuple. 

1017 # - Value: new CPFT column name. 

1018 (S1Table.PATIENT, S1PatientCol.MIDDLE_NAMES): CPFTPatientCol.MIDDLE_NAMES, 

1019 (S1Table.PATIENT, S1PatientCol.DOB): CPFTPatientCol.DOB, 

1020 ( 

1021 S1Table.RELATIONSHIPS, 

1022 S1RelCol.RELATED_STAFFCODE_OR_RELNHSNUM, 

1023 ): S1PatientCol.NHSNUM, 

1024} 

1025CORE_TO_CONTEXT_COLUMN_TRANSLATIONS = { 

1026 # Key: destination context. 

1027 # Value: translation dictionary -- see e.g. S1_TO_CPFT_COLUMN_TRANSLATION. 

1028 SystmOneContext.TPP_SRE: {}, 

1029 SystmOneContext.CPFT_DW: _S1_TO_CPFT_COLUMN_TRANSLATION, 

1030} # type: COLUMN_TRANSLATION_DICT_TYPE 

1031CONTEXT_TO_CORE_CONTEXT_COLUMN_TRANSLATIONS = { 

1032 SystmOneContext.TPP_SRE: {}, 

1033 SystmOneContext.CPFT_DW: _flip_coldict(_S1_TO_CPFT_COLUMN_TRANSLATION), 

1034} # type: COLUMN_TRANSLATION_DICT_TYPE 

1035 

1036# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 

1037# PID column names 

1038# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 

1039 

1040_PID_SYNONYMS_S1 = (S1GenericCol.PATIENT_ID,) 

1041_PID_SYNONYMS_CPFT = ( 

1042 "PatientID", # e.g. in CPFT: S1_eDSM (a CPFT table) 

1043 "PatID", # e.g. in CPFT: ASCRIBE_Statin 

1044) 

1045PID_SYNONYMS = { 

1046 SystmOneContext.TPP_SRE: _PID_SYNONYMS_S1, 

1047 SystmOneContext.CPFT_DW: _PID_SYNONYMS_S1 + _PID_SYNONYMS_CPFT, 

1048} 

1049 

1050# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 

1051# MPID column names 

1052# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 

1053 

1054_MPID_SYNONYMS_S1 = (S1PatientCol.NHSNUM,) 

1055_MPID_SYNONYMS_CPFT = (CPFTGenericCol.NHSNUM2,) 

1056MPID_SYNONYMS = { 

1057 SystmOneContext.TPP_SRE: _MPID_SYNONYMS_S1, 

1058 SystmOneContext.CPFT_DW: _MPID_SYNONYMS_S1 + _MPID_SYNONYMS_CPFT, 

1059} 

1060 

1061# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 

1062# Other system identifiers 

1063# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 

1064 

1065_OTHER_SYSTEM_IDS_CPFT = ( 

1066 "ClientID", 

1067 # ... seen in CPFT -- although often these tables should be excluded and 

1068 # this field contains RiO (not SystmOne) IDs. However, some look at least 

1069 # partially valid (e.g. S1_Deaths.ClientID, S1_Mortality.ClientID, 

1070 # S1_MortalityAdditionalInfo.ClientID). For the last of those, there is an 

1071 # explicit "SourceSystem" column; sometimes this is SystmOne, sometimes 

1072 # RiO, etc. Using ClientID as a primary patient ID will mean that only IDs 

1073 # present in the SystmOne master table will be taken. However, there is 

1074 # also potential for confusion, so we exclude these tables above. 

1075 # 

1076 # Empirically: S1_Deaths.ClientID contains IDs that look neither like 

1077 # SystmOne IDs (integer) or RiO ones (integer) or CRS/CDL ("M" prefix) but 

1078 # hav e.g. an "AP<integer>" format -- ah, it's PCMIS. 

1079) 

1080OTHER_SYSTEM_IDS = { 

1081 SystmOneContext.TPP_SRE: (), 

1082 SystmOneContext.CPFT_DW: _OTHER_SYSTEM_IDS_CPFT, 

1083} 

1084 

1085# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 

1086# Columns to treat as safe 

1087# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 

1088 

1089COLS_GENERIC_OK_UNMODIFIED_S1 = ( 

1090 S1GenericCol.ORG_ID_DONE_AT, 

1091 S1GenericCol.STAFF_ID_DONE_BY, 

1092 S1GenericCol.EVENT_ID, 

1093 S1GenericCol.EVENT_OCCURRED_WHEN, 

1094 S1GenericCol.EVENT_RECORDED_WHEN, 

1095 S1GenericCol.ORG_ID_ENTERED_AT, 

1096 S1GenericCol.ORG_ID_VISIBLE_TO, 

1097 S1GenericCol.ORG_REGISTERED_AT, 

1098 S1GenericCol.STAFF_PROFILE_ID_RECORDED_BY, 

1099) 

1100_COLS_PATIENT_TABLE_OK_UNMODIFIED_S1 = COLS_GENERIC_OK_UNMODIFIED_S1 + ( 

1101 # This list exists because we don't assume that things in the Patient table 

1102 # are safe -- we assume they are unsafe, and let them through only if we 

1103 # know about them. So we add "safe" things (that are not direct 

1104 # identifiers) here. 

1105 S1PatientCol.GENDER, 

1106 S1PatientCol.SPEAKS_ENGLISH, 

1107 S1PatientCol.SPINE_MATCHED, 

1108) 

1109_COLS_PATIENT_TABLE_OK_UNMODIFIED_CPFT = ( 

1110 # Added by CPFT: 

1111 "DeathIndicator", # binary version (0 alive, 1 dead) 

1112 "NationalDataOptOut", # Added by CPFT (from NDOptOutPreference)?) 

1113 # - CPFT also add "RwNo" (bigint), but it's always 1 here. See above. 

1114 # - We ignore "AgeInYears" (added by CPFT) since that is dangerous and 

1115 # depends on when you ask. 

1116) 

1117COLS_PATIENT_TABLE_OK_UNMODIFIED = { 

1118 SystmOneContext.TPP_SRE: _COLS_PATIENT_TABLE_OK_UNMODIFIED_S1, 

1119 SystmOneContext.CPFT_DW: _COLS_PATIENT_TABLE_OK_UNMODIFIED_S1 

1120 + _COLS_PATIENT_TABLE_OK_UNMODIFIED_CPFT, 

1121} 

1122 

1123_COLS_RELATIONSHIP_OK_UNMODIFIED_S1 = ( 

1124 S1RelCol.DATE_ENDED, 

1125 S1RelCol.REL_TYPE, 

1126 S1RelCol.GUARDIAN_PROXY, 

1127 S1RelCol.NEXT_OF_KIN, 

1128 S1RelCol.CARER, 

1129 S1RelCol.PRINCIPAL_CARER, 

1130 S1RelCol.KEYHOLDER, 

1131 S1RelCol.PARENTAL_RESPONSIBILITY, 

1132 S1RelCol.FINANCIAL_REP, 

1133 S1RelCol.ADVOCATE, 

1134 S1RelCol.MAIN_VISITOR, 

1135 S1RelCol.CALLBACK_CONSENT, 

1136 S1RelCol.COPY_CORRESPONDENCE, 

1137 S1RelCol.CONTACT_ORDER, 

1138 S1RelCol.CONTACT_METHOD, 

1139 S1RelCol.COMMS_FORMAT, 

1140 S1RelCol.INTERPRETER_REQUIRED, 

1141 S1RelCol.SEX, 

1142 S1RelCol.LANGUAGE, 

1143 S1RelCol.ORG, 

1144) 

1145_COLS_RELATIONSHIP_OK_UNMODIFIED_CPFT = ( 

1146 # Added by CPFT: 

1147 "DateDeath", # date of death of relative 

1148) 

1149COLS_RELATIONSHIP_OK_UNMODIFIED = { 

1150 SystmOneContext.TPP_SRE: _COLS_RELATIONSHIP_OK_UNMODIFIED_S1, 

1151 SystmOneContext.CPFT_DW: _COLS_RELATIONSHIP_OK_UNMODIFIED_S1 

1152 + _COLS_RELATIONSHIP_OK_UNMODIFIED_CPFT, 

1153} 

1154 

1155CPFT_REL_MOTHER_OK_UNMODIFIED = () 

1156 

1157# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 

1158# Columns to exclude from the output 

1159# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 

1160 

1161_COLS_GENERIC_EXCLUDE_S1 = ( 

1162 # Columns to exclude, regardless of table. 

1163 # 

1164 # This is primarily for when CPFT put identifiers all over the place (e.g. 

1165 # joining them from the patient table). We can simply exclude (rather than 

1166 # using them for scrubbing) since they are duplicates (for convenience) of 

1167 # information in other tables like Patient, PatientContactDetails, 

1168 # PatientAddressHistory. 

1169 S1PatientCol.DOB, 

1170 S1PatientCol.EMAIL, 

1171 S1PatientCol.FORENAME, 

1172 S1PatientCol.MIDDLE_NAMES, 

1173 S1PatientCol.PREV_SURNAME, 

1174 S1PatientCol.SURNAME, 

1175 S1PatientCol.TITLE, 

1176 # ... unnecessary -- and might be rare, e.g. Lord High Admiral 

1177 S1AddressCol.BUILDING_NAME, 

1178 S1AddressCol.BUILDING_NUMBER, 

1179 S1AddressCol.COUNTY, 

1180 S1AddressCol.LOCALITY, 

1181 S1AddressCol.POSTCODE, 

1182 S1AddressCol.ROAD, 

1183 S1AddressCol.TOWN, 

1184 S1RelCol.ADDRESS_COUNTY, 

1185 S1RelCol.ADDRESS_EMAIL, 

1186 S1RelCol.ADDRESS_FAX, 

1187 S1RelCol.ADDRESS_HOUSE_NAME, 

1188 S1RelCol.ADDRESS_HOUSE_NUMBER, 

1189 S1RelCol.ADDRESS_LOCALITY, 

1190 S1RelCol.ADDRESS_MOBILE_TELEPHONE, 

1191 S1RelCol.ADDRESS_POST_TOWN, 

1192 S1RelCol.ADDRESS_POSTCODE, 

1193 S1RelCol.ADDRESS_ROAD, 

1194 S1RelCol.ADDRESS_TELEPHONE, 

1195 S1RelCol.ADDRESS_WORK_TELEPHONE, 

1196 S1RelCol.DOB, 

1197 S1RelCol.NAME, 

1198 S1HospNumCol.HOSPNUM, # just in case 

1199) 

1200_COLS_GENERIC_EXCLUDE_CPFT = ( 

1201 CPFTAddressCol.POSTCODE_NOSPACE, 

1202 CPFTGenericCol.AGE_YEARS, 

1203 # ... age when? Unhelpful. (And also, potentially leads to DOB discovery; 

1204 # a blurred age near a birthday might be un-blurred by this.) 

1205 CPFTGenericCol.ASSIGNMENT_NUMBER, # could hash it, but still, sensitive. 

1206 CPFTGenericCol.HOME_CONTACT_NUMBER, 

1207 CPFTGenericCol.MOBILE_CONTACT_NUMBER, 

1208 CPFTGenericCol.NHSNUM2, 

1209 CPFTGenericCol.NHSNUM_MOTHER, 

1210 CPFTGenericCol.PATIENT_ADDRESS, 

1211 CPFTGenericCol.PATIENT_NAME, 

1212 CPFTGenericCol.POSTCODE, 

1213 CPFTPatientCol.DOB, 

1214 CPFTPatientCol.MIDDLE_NAMES, 

1215) 

1216COLS_GENERIC_EXCLUDE = { 

1217 SystmOneContext.TPP_SRE: _COLS_GENERIC_EXCLUDE_S1, 

1218 SystmOneContext.CPFT_DW: _COLS_GENERIC_EXCLUDE_S1 

1219 + _COLS_GENERIC_EXCLUDE_CPFT, 

1220} 

1221 

1222OMIT_TABLENAME_COLNAME_PAIRS_S1 = ( 

1223 # Other specific fields to omit. 

1224 (S1Table.CONTACTS, "ContactWith"), 

1225 (S1Table.HOSP_AE_NUMBERS, S1HospNumCol.COMMENTS), 

1226 (S1Table.OUT_OF_HOURS_ACTION, "Details"), 

1227 # ... out-of-hours calls; details can sometimes contain phone numbers 

1228 (S1Table.OUT_OF_HOURS_THIRD_PARTY_CALL, "Contact"), # free text 

1229 (S1Table.SAFEGUARDING_INCIDENT_DETAILS, "PoliceReference"), 

1230) 

1231 

1232# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 

1233# Columns containing scrub-source information 

1234# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 

1235 

1236COLS_PATIENT_WORDS = ( 

1237 # Scrub as words. 

1238 S1PatientCol.FORENAME, 

1239 S1PatientCol.MIDDLE_NAMES, 

1240 S1PatientCol.SURNAME, 

1241 S1PatientCol.PREV_SURNAME, 

1242 S1PatientCol.EMAIL, 

1243 S1PatientCol.BIRTHPLACE, # Unusual. But: scrub birthplace. 

1244) 

1245 

1246COLS_REQUIRED_SCRUBBERS = ( 

1247 # Information that must be present in the master patient table. 

1248 S1PatientCol.PK, # likely redundant! It's the PID "definer". 

1249 S1PatientCol.FORENAME, 

1250 S1PatientCol.SURNAME, 

1251 S1PatientCol.DOB, 

1252) 

1253 

1254COLS_ADDRESS_PHRASES = ( 

1255 # Scrub as phrases. 

1256 # - For S1AddressCol.BUILDING_NAME, see below. 

1257 # - For S1AddressCol.BUILDING_NUMBER, see below. 

1258 S1AddressCol.ROAD, 

1259 S1AddressCol.LOCALITY, 

1260 S1AddressCol.TOWN, 

1261 S1AddressCol.COUNTY, 

1262) 

1263COLS_ADDRESS_PHRASE_UNLESS_NUMBER = ( 

1264 S1AddressCol.BUILDING_NAME, 

1265 S1AddressCol.BUILDING_NUMBER, 

1266 # S1AddressCol.BUILDING_NUMBER poses a new problem for us: this is meant 

1267 # to be e.g. "5", which is by itself non-identifying and would be a big 

1268 # problem if we scrub (consider e.g. "5 mg"). However, sometimes it is "5 

1269 # Tree Road", because it's not forced to be numeric. So we extend CRATE 

1270 # (2021-12-01) to add ScrubMethod.PHRASE_UNLESS_NUMERIC. 

1271 # (S1AddressCol.BUILDING_NAME occasionally contains numbers only, so we 

1272 # do the same thing.) 

1273) 

1274 

1275COLS_RELATIONSHIP_XREF_ID = ( 

1276 # Scrub (third-party) as full cross-references to another record. 

1277 S1RelCol.RELATED_ID_DEPRECATED, 

1278 S1RelCol.RELATED_ID, 

1279) 

1280 

1281_COLS_RELATIONSHIP_WORDS_S1 = ( 

1282 # Scrub (third-party) as words. 

1283 S1RelCol.NAME, 

1284 S1RelCol.ADDRESS_EMAIL, 

1285) 

1286_COLS_RELATIONSHIP_WORDS_CPFT = ( 

1287 # Added by CPFT: 

1288 "Surname", # surname of relative 

1289 "FirstName", # surname of relative 

1290) 

1291COLS_RELATIONSHIP_WORDS = { 

1292 SystmOneContext.TPP_SRE: _COLS_RELATIONSHIP_WORDS_S1, 

1293 SystmOneContext.CPFT_DW: _COLS_RELATIONSHIP_WORDS_S1 

1294 + _COLS_RELATIONSHIP_WORDS_CPFT, 

1295} 

1296 

1297_COLS_RELATIONSHIP_DATES_S1 = ( 

1298 # Scrub (third-party) as dates. 

1299 S1RelCol.DOB, 

1300) 

1301_COLS_RELATIONSHIP_DATES_CPFT = ( 

1302 # Added by CPFT: 

1303 "DOB", # likely duplicate of S1RelCol.DOB 

1304) 

1305COLS_RELATIONSHIP_DATES = { 

1306 SystmOneContext.TPP_SRE: _COLS_RELATIONSHIP_DATES_S1, 

1307 SystmOneContext.CPFT_DW: _COLS_RELATIONSHIP_DATES_S1 

1308 + _COLS_RELATIONSHIP_WORDS_CPFT, 

1309} 

1310 

1311COLS_RELATIONSHIP_PHRASES = ( 

1312 # Scrub (third-party) as phrases. 

1313 # Same principles as for the patient address, above. 

1314 # - For S1RelCol.ADDRESS_HOUSE_NAME, see below. 

1315 # - For S1RelCol.ADDRESS_HOUSE_NUMBER, see below. 

1316 S1RelCol.ADDRESS_ROAD, 

1317 S1RelCol.ADDRESS_LOCALITY, 

1318 S1RelCol.ADDRESS_POST_TOWN, 

1319 S1RelCol.ADDRESS_COUNTY, 

1320) 

1321COLS_RELATIONSHIP_PHRASE_UNLESS_NUMERIC = ( 

1322 S1RelCol.ADDRESS_HOUSE_NAME, 

1323 S1RelCol.ADDRESS_HOUSE_NUMBER, 

1324) 

1325 

1326COLS_RELATIONSHIP_CODES = ( 

1327 # Scrub (third-party) as codes. 

1328 S1RelCol.ADDRESS_POSTCODE, 

1329) 

1330 

1331COLS_RELATIONSHIP_NUMBERS = ( 

1332 # Scrub (third-party) as numbers. 

1333 S1RelCol.RELATED_STAFFCODE_OR_RELNHSNUM, 

1334 S1RelCol.ADDRESS_TELEPHONE, 

1335 S1RelCol.ADDRESS_WORK_TELEPHONE, 

1336 S1RelCol.ADDRESS_MOBILE_TELEPHONE, 

1337 S1RelCol.ADDRESS_FAX, 

1338) 

1339 

1340_COLS_TRUNCATE_DATE_S1 = (S1PatientCol.DOB,) 

1341_COLS_TRUNCATE_DATE_CPFT = (CPFTPatientCol.DOB,) 

1342COLS_TRUNCATE_DATE = { 

1343 SystmOneContext.TPP_SRE: _COLS_TRUNCATE_DATE_S1, 

1344 SystmOneContext.CPFT_DW: _COLS_TRUNCATE_DATE_S1 + _COLS_TRUNCATE_DATE_CPFT, 

1345} 

1346 

1347# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 

1348# Columns containing free text, which need to be scrubbed 

1349# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 

1350# ... assuming they are of string type. 

1351 

1352_FREETEXT_TABLENAME_COLNAME_REGEX_PAIRS_S1 = ( 

1353 ( 

1354 terminate(S1Table.FREETEXT), 

1355 terminate(S1GenericCol.FREETEXT), 

1356 ), # the bulk of free text; VARCHAR(MAX) 

1357 (ANYTHING, not_just_at_start(S1GenericCol.NOTES_SUFFIX)), 

1358 ("PersonAtRisk$", "ReasonForPlan$"), # free text re safeguarding 

1359 # ("ReferralIn$", "PrimaryReason$"), # only 200 chars; may be OK -- yes 

1360 ( 

1361 "SafeguardingAllegationDetails$", 

1362 "Outcome$", 

1363 ), # only 100 chars -- but OMIT whole table, as above 

1364 ("SpecialNotes$", "Note$"), # 8000 char free text 

1365) 

1366_FREETEXT_TABLENAME_COLNAME_REGEX_PAIRS_CPFT = ( 

1367 # CPFT: 

1368 # - SRReferralIn renamed to S1_ReferralsIn with extra columns, 

1369 # PrimaryReason, but there are several others, like 

1370 # ReferralReasonDescription1. 

1371 # - Actually, however, on review, e.g. 

1372 # - S1_ReferralInReferralReason.ReferralReason is numeric 

1373 # - S1_ReferralInReferralReason.ReferralReasonDescription is pick-list, 

1374 # not free text 

1375 # - Similarly for referrals out. 

1376 # and "Other" is scrubbed by the generic scrubber, so that messes up 

1377 # useful data in the descriptions. So not this: 

1378 # 

1379 # (".*Referral", ".*Reason"), 

1380 # A bunch of explicitly free-text fields: 

1381 # - any not-otherwise-handled textual field in a table named "FreeText_..." 

1382 ( 

1383 S1Table.FREETEXT, 

1384 ANYTHING, 

1385 ), # table name not terminated so allows anything starting with this 

1386 # - any field named "FreeText..." (e.g. S1_Honos_Scores.FreeText) 

1387 (ANYTHING, S1GenericCol.FREETEXT), 

1388 # - S1_CYPFRS_TelephoneTriage links in a bunch of things from S1_FreeText. 

1389 (CPFTTable.CYP_FRS_TELEPHONE_TRIAGE, ".*Assessment"), 

1390 (CPFTTable.CYP_FRS_TELEPHONE_TRIAGE, ".*Presentation"), 

1391 (CPFTTable.CYP_FRS_TELEPHONE_TRIAGE, ".*Risk"), 

1392 (CPFTTable.CYP_FRS_TELEPHONE_TRIAGE, ".*Strength"), 

1393 (CPFTTable.CYP_FRS_TELEPHONE_TRIAGE, ".*Intervention"), 

1394 (CPFTTable.CYP_FRS_TELEPHONE_TRIAGE, ".*Problem"), 

1395 (CPFTTable.CYP_FRS_TELEPHONE_TRIAGE, ".*Social"), 

1396 (CPFTTable.CYP_FRS_TELEPHONE_TRIAGE, ".*Advice"), 

1397 (CPFTTable.CYP_FRS_TELEPHONE_TRIAGE, ".*Clinical"), 

1398 (CPFTTable.CYP_FRS_TELEPHONE_TRIAGE, ".*Outcome"), 

1399 (CPFTTable.CYP_FRS_TELEPHONE_TRIAGE, ".*NextOfKin"), 

1400) 

1401FREETEXT_TABLENAME_COLNAME_REGEX_PAIRS = { 

1402 SystmOneContext.TPP_SRE: _FREETEXT_TABLENAME_COLNAME_REGEX_PAIRS_S1, 

1403 SystmOneContext.CPFT_DW: _FREETEXT_TABLENAME_COLNAME_REGEX_PAIRS_S1 

1404 + _FREETEXT_TABLENAME_COLNAME_REGEX_PAIRS_CPFT, 

1405} 

1406 

1407_EXEMPT_FROM_SCRUBBING_TABLENAME_COLNAME_REGEX_PAIRS_S1 = ( 

1408 # Things that look like free text, but aren't. 

1409 ("AnsweredQuestionnaire$", "QuestionnaireName$"), 

1410 (ANYTHING, S1GenericCol.CTV3_CODE), # common clinical coding 

1411 (ANYTHING, S1GenericCol.CTV3_TEXT), # common clinical coding 

1412) 

1413_EXEMPT_FROM_SCRUBBING_TABLENAME_COLNAME_REGEX_PAIRS_CPFT = ( 

1414 # These contain non-patient data -- instead, the stock text of 

1415 # questionnaires: 

1416 ("FreeText_Honos_Scoring_Answers", ANYTHING), 

1417 ("FreeText_Honos_Scoring_Questions", ANYTHING), 

1418 ("FreeText_SWEMWBS", ANYTHING), 

1419 ("FreeText_SWEMWBS_Scores", ANYTHING), 

1420 ("FreeText_WEMWBS", ANYTHING), 

1421 # Should not be identifying: 

1422 (ANYTHING, "Ethnicity$"), 

1423) 

1424EXEMPT_FROM_SCRUBBING_TABLENAME_COLNAME_REGEX_PAIRS = { 

1425 SystmOneContext.TPP_SRE: ( 

1426 _EXEMPT_FROM_SCRUBBING_TABLENAME_COLNAME_REGEX_PAIRS_S1 

1427 ), 

1428 SystmOneContext.CPFT_DW: ( 

1429 _EXEMPT_FROM_SCRUBBING_TABLENAME_COLNAME_REGEX_PAIRS_S1 

1430 + _EXEMPT_FROM_SCRUBBING_TABLENAME_COLNAME_REGEX_PAIRS_CPFT 

1431 ), 

1432} 

1433 

1434# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 

1435# Columns to index 

1436# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 

1437 

1438FULLTEXT_INDEX_TABLENAME_COLNAME_REGEX_PAIRS_S1 = ( 

1439 # Subset of free-text columns where we would want to implement a FULLTEXT 

1440 # index. 

1441 ("FreeText$", "FreeText$"), # the bulk of free text; VARCHAR(MAX) 

1442) 

1443 

1444_NORMAL_INDEX_TABLENAME_COLNAME_REGEX_PAIRS_CPFT = ( 

1445 # S1_Patient.IDPatient: Added by CPFT. Duplicate of RowIdentifier. 

1446 # But likely to be used by researchers, so should be indexed. 

1447 (terminate(S1Table.PATIENT), terminate(S1GenericCol.PATIENT_ID)), 

1448) 

1449NORMAL_INDEX_TABLENAME_COLNAME_REGEX_PAIRS = { 

1450 SystmOneContext.TPP_SRE: (), 

1451 SystmOneContext.CPFT_DW: _NORMAL_INDEX_TABLENAME_COLNAME_REGEX_PAIRS_CPFT, 

1452} 

1453 

1454_GENERIC_COLS_TO_INDEX_S1 = ( 

1455 # Generically sensible things to index. 

1456 S1GenericCol.CTV3_CODE, # common clinical coding 

1457 S1GenericCol.CTV3_TEXT, # common clinical coding 

1458 S1GenericCol.EVENT_ID, # a common FK 

1459 S1GenericCol.EVENT_OCCURRED_WHEN, # when did it happen? 

1460 S1GenericCol.QUESTIONNAIRE_ID, # a common FK 

1461 S1GenericCol.REFERRAL_ID, # a common FK 

1462 S1GenericCol.SNOMED_CODE, # common clinical coding 

1463 S1GenericCol.SNOMED_TEXT, # common clinical coding 

1464) 

1465_GENERIC_COLS_TO_INDEX_CPFT = (CPFTGenericCol.PATIENT_ID_SYNONYM_1,) 

1466GENERIC_COLS_TO_INDEX = { 

1467 SystmOneContext.TPP_SRE: _GENERIC_COLS_TO_INDEX_S1, 

1468 SystmOneContext.CPFT_DW: _GENERIC_COLS_TO_INDEX_S1 

1469 + _GENERIC_COLS_TO_INDEX_CPFT, 

1470} 

1471 

1472# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 

1473# Columns that are (or are not) PKs 

1474# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 

1475 

1476_PK_TABLENAME_COLNAME_REGEX_PAIRS_S1 = ( 

1477 # If CRATE inserts its PK field somewhere, it's a PK. 

1478 (ANYTHING, terminate(CRATE_COL_PK)), 

1479) 

1480_PK_TABLENAME_COLNAME_REGEX_PAIRS_CPFT = ( 

1481 # Primary key fields with non-standard names. 

1482 # Note that some are CPFT-created tables, which is why they don't follow 

1483 # standard conventions. 

1484 ("3DayFollowUp", "IDHospitalAdmissionAndDischarge"), 

1485 ("ActivityEvent_EventDuration", S1GenericCol.EVENT_ID), 

1486 # AQ_* -- no obvious PK. 

1487 ("_CGAS_PairedScore_By_Referral", S1GenericCol.REFERRAL_ID), 

1488 ( 

1489 "CarePlan", 

1490 "CarePlanID", 

1491 ), # includes CarePlanDetail, modified from the source 

1492 # ... CarePlanReview has IDCarePlan but that is not unique. 

1493 ("Caseload", "IDReferralIn"), 

1494 # ClinicalDashboard*: a bunch of CPFT things; no obvious PK. 

1495 # ClinicalMeasure*: a bunch of CPFT derived things; no obvious PK. 

1496 # ClinicalOutcome*: a bunch of CPFT derived things; no obvious PK. 

1497 # ContactsArchive_LatestStaffContact: no obvious PK. 

1498 ("ContactsArchive_NonLegalContacts", S1GenericCol.EVENT_ID), 

1499 ("ContactsArchive_TotalsByReferral", S1GenericCol.REFERRAL_ID), 

1500 # Contacts_CYP_PlanMetric_7: no obvious PK. 

1501 ("Contacts_CarerStatus_MH", S1GenericCol.PATIENT_ID), 

1502 ("Contacts_CarerStatus_PH", S1GenericCol.PATIENT_ID), 

1503 # CoronaVirus: no obvious PK (maybe combination of IDPatient, IDReferralIn) 

1504 # CurrentInpatientDashboard_Doctors: *almost* IDPatient, IDReferralIn but 

1505 # one extra row 

1506 # Deaths: no obvious PK, despite "ClientID" (not unique) 

1507 (CPFTTable.DEMOGRAPHICS, S1GenericCol.PATIENT_ID), 

1508 # DischargeDelay_Fact: no obvious PK. 

1509 # EuroQol*: no PKs 

1510 # FACT_Inp_Data: no obvious PK. 

1511 # Falls_AtRiskState*: no PKs 

1512 # FreeText_* [CPFT derived free text tables]: none in the first, not all 

1513 # explored 

1514 # GateKeeping: no obvious PK. 

1515 # Honos_Scores: no PK 

1516 # ICW_PTL: no PK 

1517 # Immunisation: no obvious PK 

1518 # InpatientLeave: no obvious PK 

1519 # Inpatient_NorthwickParkIndex: no obvious PK 

1520 # LADSAdults_Output: no obvious PK 

1521 ("LADSAdultsQuestionnaires", S1GenericCol.QUESTIONNAIRE_ID), 

1522 # LADSCYPHS_Output: no obvious PK 

1523 ("LADSCYPQuestionnaires", S1GenericCol.QUESTIONNAIRE_ID), 

1524 ("MDT_Caseload", S1GenericCol.REFERRAL_ID), 

1525 # OutOfHoursSRCodeInformation: no PK 

1526 # PRISM_ReReferral: no PK 

1527 # PatientAnsweredQuestionnaireInformation: IDPatient currently unique but I 

1528 # strongly suspect only temporarily 

1529 # PatientContact: no PK 

1530 # PatientEthnicity: no PK 

1531 ("PatientGPPractice", S1GenericCol.PATIENT_ID), 

1532 # PatientLanguageDeathOptions: no PK 

1533 # PatientLetterInformation: IDPatient currently unique but I strongly 

1534 # suspect only temporarily 

1535 ("PatientOverview", S1GenericCol.REFERRAL_ID), 

1536 # PatientRelationship: no PK (modified from the source) 

1537 ("PatientRelationshipMother", S1GenericCol.PATIENT_ID), 

1538 # PatientSRCodeInformation: no PK 

1539 # PhysicalHealthChecks*: no PK (IDPatient currently unique in 

1540 # S1_PhysicalHealthChecks_CQUIN but unlikely to remain so) 

1541 # QRisk: no PK 

1542 ( 

1543 "ReferralInIntervention", 

1544 S1GenericCol.REFERRAL_ID, 

1545 ), # modified from the source 

1546 ("Vanguard", "ReferralNumber"), 

1547 # eDSM: no PK 

1548) 

1549PK_TABLENAME_COLNAME_REGEX_PAIRS = { 

1550 SystmOneContext.TPP_SRE: _PK_TABLENAME_COLNAME_REGEX_PAIRS_S1, 

1551 SystmOneContext.CPFT_DW: _PK_TABLENAME_COLNAME_REGEX_PAIRS_S1 

1552 + _PK_TABLENAME_COLNAME_REGEX_PAIRS_CPFT, 

1553} 

1554 

1555_NOT_PK_TABLENAME_COLNAME_REGEX_PAIRS_S1 = tuple( 

1556 # Tables in which "RowIdentifier" (S1GenericCol.ROW_ID) is not unique. 

1557 (terminate(t), S1GenericCol.ROW_ID) 

1558 for t in ( 

1559 S1Table.ACTIVITY_EVENT, 

1560 S1Table.CARE_PLAN_REVIEW, 

1561 S1Table.DISCHARGE_DELAY, 

1562 S1Table.FREETEXT, # see also TABLES_REQUIRING_CRATE_PK_REGEX above 

1563 S1Table.BED_CLOSURE, 

1564 S1Table.MENTAL_HEALTH_ACT_APPEAL, 

1565 S1Table.MENTAL_HEALTH_ACT_AWOL, 

1566 S1Table.TASK, 

1567 ) 

1568) + tuple( 

1569 # Also, if we insert a CRATE PK, then the "RowIdentifier" can't be the PK. 

1570 (t_regex, S1GenericCol.ROW_ID) 

1571 for t_regex in TABLES_REQUIRING_CRATE_PK_REGEX 

1572) 

1573_NOT_PK_TABLENAME_COLNAME_REGEX_PAIRS_CPFT = tuple( 

1574 # Similarly. These tables have things that look like PKs, but gave rise to 

1575 # a "Violation of PRIMARY KEY constraint" error, so they aren't. This 

1576 # happens when someone in CPFT maps e.g. "RowIdentifier" in an unusual way. 

1577 (t, S1GenericCol.ROW_ID) 

1578 for t in ( 

1579 terminate("Child_At_Risk"), 

1580 "InpatientBedStay", # includes InpatientBedStay_Old, etc. 

1581 terminate(CPFTTable.CONTACTS_ARCHIVE), 

1582 terminate(CPFTTable.CONTACTS_ARCHIVE_CLIENT_SEQUENCE), 

1583 ) 

1584) 

1585NOT_PK_TABLENAME_COLNAME_REGEX_PAIRS = { 

1586 SystmOneContext.TPP_SRE: _NOT_PK_TABLENAME_COLNAME_REGEX_PAIRS_S1, 

1587 SystmOneContext.CPFT_DW: _NOT_PK_TABLENAME_COLNAME_REGEX_PAIRS_S1 

1588 + _NOT_PK_TABLENAME_COLNAME_REGEX_PAIRS_CPFT, 

1589} 

1590 

1591 

1592# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 

1593# Columns containing opt-out information 

1594# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 

1595 

1596_OPT_OUT_TABLENAME_COLNAME_PAIRS_S1 = ( 

1597 # Test patients -- not an "opt out" but an "invalid patient" marker: 

1598 (CrateView.TESTPATIENT_VIEW, CrateS1ViewCol.IS_TEST_PATIENT), 

1599 # Note that the main patient table is handled separately/explicitly. 

1600) 

1601_OPT_OUT_TABLENAME_COLNAME_PAIRS_CPFT = ( 

1602 # CPFT Research Database local opt-out: 

1603 ("ClinicalOutcome_ConsentResearch_OptOutCheck", "SNOMEDCode"), 

1604) 

1605OPT_OUT_TABLENAME_COLNAME_PAIRS = { 

1606 SystmOneContext.TPP_SRE: _OPT_OUT_TABLENAME_COLNAME_PAIRS_S1, 

1607 SystmOneContext.CPFT_DW: _OPT_OUT_TABLENAME_COLNAME_PAIRS_S1 

1608 + _OPT_OUT_TABLENAME_COLNAME_PAIRS_CPFT, 

1609} 

1610 

1611 

1612# ============================================================================= 

1613# String comparison helper functions 

1614# ============================================================================= 

1615 

1616# ----------------------------------------------------------------------------- 

1617# Plain strings 

1618# ----------------------------------------------------------------------------- 

1619 

1620 

1621def eq(x: str, y: str) -> bool: 

1622 """ 

1623 Case-insensitive string comparison. 

1624 """ 

1625 return x.lower() == y.lower() 

1626 

1627 

1628def tcmatch(table1: str, column1: str, table2: str, column2: str) -> bool: 

1629 """ 

1630 Equal (in case-insensitive fashion) for table and column? 

1631 """ 

1632 return eq(table1, table2) and eq(column1, column2) 

1633 

1634 

1635def is_in(x: str, y: Iterable[str]) -> bool: 

1636 """ 

1637 Case-insensitive version of "in", to replace "if x in y". 

1638 """ 

1639 return any(eq(x, test) for test in y) 

1640 

1641 

1642def is_pair_in(a: str, b: str, y: Iterable[Tuple[str, str]]) -> bool: 

1643 """ 

1644 Case-insensitive version of "in", to replace "if a, b in y". 

1645 """ 

1646 return any(eq(a, test_a) and eq(b, test_b) for test_a, test_b in y) 

1647 

1648 

1649# ----------------------------------------------------------------------------- 

1650# Regular expressions 

1651# ----------------------------------------------------------------------------- 

1652 

1653 

1654def eq_re(x: str, y_regex: str) -> bool: 

1655 """ 

1656 Returns True if the regex matches at the start of the string. 

1657 """ 

1658 return bool(re.match(y_regex, x, flags=re.IGNORECASE)) 

1659 

1660 

1661def is_in_re(x: str, y_regexes: Iterable[str]) -> bool: 

1662 """ 

1663 Case-insensitive regex-based version of "in", to replace "if x in y". 

1664 """ 

1665 return any(eq_re(x, test) for test in y_regexes) 

1666 

1667 

1668def is_pair_in_re( 

1669 a: str, b: str, y_regexes: Iterable[Tuple[str, str]] 

1670) -> bool: 

1671 """ 

1672 Case-insensitive regex-based version of "in", to replace "if a, b in y". 

1673 """ 

1674 return any( 

1675 eq_re(a, test_a) and eq_re(b, test_b) for test_a, test_b in y_regexes 

1676 ) 

1677 

1678 

1679# ============================================================================= 

1680# Table/column name interpretation 

1681# ============================================================================= 

1682 

1683 

1684def tablename_prefix(context: SystmOneContext) -> str: 

1685 """ 

1686 The tablename prefix in the given context. 

1687 """ 

1688 try: 

1689 return TABLE_PREFIXES[context] 

1690 except KeyError: 

1691 raise KeyError(f"Unknown SystmOne context: {context}") 

1692 

1693 

1694def core_tablename( 

1695 tablename: str, 

1696 from_context: SystmOneContext, 

1697 allow_unprefixed: bool = False, 

1698) -> str: 

1699 """ 

1700 Is this a table of an expected format that we will consider? 

1701 - If so, returns the "core" part of the tablename, in the given context. 

1702 - Otherwise, if ``allow_unprefixed`` return the input. 

1703 - Otherwise, return an empty string. 

1704 """ 

1705 prefix = tablename_prefix(from_context) 

1706 if not tablename.startswith(prefix): 

1707 if is_in_re(tablename, INCLUDE_TABLES_REGEX[from_context]): 

1708 return tablename 

1709 else: 

1710 warn_once(f"Unrecognized table name style: {tablename}", log) 

1711 if allow_unprefixed: 

1712 return tablename 

1713 else: 

1714 return "" 

1715 rest = tablename[len(prefix) :] 

1716 if not rest: 

1717 raise ValueError(f"Table name {tablename!r} only contains its prefix") 

1718 xlate = CONTEXT_TO_CORE_TABLE_TRANSLATIONS[from_context] 

1719 return xlate.get(rest) or rest 

1720 

1721 

1722def contextual_tablename( 

1723 tablename_core: str, to_context: SystmOneContext 

1724) -> str: 

1725 """ 

1726 Prefixes the "core" table name for a given context, and sometimes 

1727 translates it too. 

1728 """ 

1729 prefix = tablename_prefix(to_context) 

1730 xlate = CORE_TO_CONTEXT_TABLE_TRANSLATIONS[to_context] 

1731 translated = xlate.get(tablename_core) 

1732 tablename = translated if translated else tablename_core 

1733 return f"{prefix}{tablename}" 

1734 

1735 

1736def translate_tablename( 

1737 from_tablename: str, 

1738 from_context: SystmOneContext, 

1739 to_context: SystmOneContext, 

1740): 

1741 """ 

1742 Translates a table name from one S1 context to another. 

1743 """ 

1744 coretab = core_tablename(from_tablename, from_context=from_context) 

1745 return contextual_tablename(coretab, to_context=to_context) 

1746 

1747 

1748def cpft_s1_tablename(core_tablename: str) -> str: 

1749 """ 

1750 Helper function for the consent-for-contact system, but conceptually it 

1751 sits reasonably well here. 

1752 

1753 Args: 

1754 core_tablename: 

1755 Table name in S1 "core" format (devoid of any prefix). 

1756 

1757 Returns: 

1758 Returns the local CPFT table name. 

1759 """ 

1760 return contextual_tablename( 

1761 core_tablename, 

1762 to_context=SystmOneContext.CPFT_DW, 

1763 ) 

1764 

1765 

1766def core_columnname( 

1767 tablename_core: str, columnname_context: str, from_context: SystmOneContext 

1768) -> str: 

1769 """ 

1770 Some contexts rename their column names. This function puts them back into 

1771 the "core" (TPP SRE) name space. 

1772 """ 

1773 xlate = CONTEXT_TO_CORE_CONTEXT_COLUMN_TRANSLATIONS[from_context] 

1774 return ( 

1775 xlate.get((tablename_core, columnname_context)) or columnname_context 

1776 ) 

1777 

1778 

1779def contextual_columnname( 

1780 tablename_core: str, columnname_core: str, to_context: SystmOneContext 

1781) -> str: 

1782 """ 

1783 Translates a "core" column name to its contextual variant, if applicable. 

1784 """ 

1785 xlate = CORE_TO_CONTEXT_COLUMN_TRANSLATIONS[to_context] 

1786 return xlate.get((tablename_core, columnname_core)) or columnname_core 

1787 

1788 

1789# ============================================================================= 

1790# Comments 

1791# ============================================================================= 

1792 

1793 

1794def join_comments(comments: List[str]) -> str: 

1795 """ 

1796 Joins comment elements, skipping any blanks. 

1797 """ 

1798 return COMMENT_SEP.join(filter(None, comments)) 

1799 

1800 

1801# ============================================================================= 

1802# Helper classes 

1803# ============================================================================= 

1804 

1805 

1806class SystmOneSRESpecRow: 

1807 """ 

1808 Represents a row in the SystmOne SRE specification CSV file. 

1809 """ 

1810 

1811 def __init__(self, d: Dict[str, Any]) -> None: 

1812 """ 

1813 Initialize with a row dictionary from a :class:`csv.DictReader`. 

1814 """ 

1815 self.table_name = d["TableName"] # type: str 

1816 self.table_description = d["TableDescription"] # type: str 

1817 self.column_name = d["ColumnName"] # type: str 

1818 self.column_description = d["ColumnDescription"] # type: str 

1819 self.column_data_type = d["ColumnDataType"] # type: str 

1820 _col_length = d["ColumnLength"] 

1821 self.column_length = int(_col_length) if _col_length else None 

1822 _date_def = d["DateDefining"] # type: str 

1823 if _date_def == "Yes": 

1824 self.date_defining = True 

1825 elif _date_def == "No": 

1826 self.date_defining = False 

1827 else: 

1828 raise ValueError(f"Bad DateDefining field: {_date_def!r}") 

1829 self.column_ordinal = int(d["ColumnOrdinal"]) 

1830 self.linked_table = d["LinkedTable"] # type: str 

1831 self.linked_column_1 = d["LinkedColumn1"] # type: str 

1832 self.linked_column_2 = d["LinkedColumn2"] # type: str 

1833 

1834 @property 

1835 def tablename_core(self) -> str: 

1836 """ 

1837 Core part of the tablename. 

1838 """ 

1839 return core_tablename( 

1840 self.table_name, from_context=SystmOneContext.TPP_SRE 

1841 ) 

1842 

1843 @property 

1844 def linked_table_core(self) -> str: 

1845 """ 

1846 Core part of the linked table name. 

1847 """ 

1848 return core_tablename( 

1849 self.linked_table, from_context=SystmOneContext.TPP_SRE 

1850 ) 

1851 

1852 def comment( 

1853 self, context: SystmOneContext, with_table: bool = True 

1854 ) -> str: 

1855 """ 

1856 Used to generate a comment for the CRATE data dictionary. 

1857 

1858 Args: 

1859 context: 

1860 The SystmOneContext in which data is being processed. 

1861 with_table: 

1862 Include information about the table. 

1863 """ 

1864 if with_table: 

1865 elements = [ 

1866 f"TABLE: {self.table_description}", 

1867 f"COLUMN: {self.column_description}", 

1868 ] 

1869 else: 

1870 elements = [f"{self.column_description}"] 

1871 if self.linked_table: 

1872 context_prefix = tablename_prefix(context) 

1873 linked_table_ctx = f"{context_prefix}{self.linked_table_core}" 

1874 links = [] # type: List[str] 

1875 if self.linked_column_1: 

1876 links.append( 

1877 f"FOREIGN KEY TO " 

1878 f"{linked_table_ctx}.{self.linked_column_1}" 

1879 ) 

1880 if self.linked_column_2: 

1881 links.append(f"WITH {linked_table_ctx}.{self.linked_column_2}") 

1882 elements.append(" ".join(links)) 

1883 return join_comments(elements) 

1884 

1885 def description( 

1886 self, context: SystmOneContext, with_table: bool = True 

1887 ) -> str: 

1888 """ 

1889 Full description line. 

1890 

1891 Args: 

1892 context: 

1893 The SystmOneContext in which data is being processed. 

1894 with_table: 

1895 Include information about the table. 

1896 """ 

1897 tname = contextual_tablename(self.tablename_core, context) 

1898 cname = contextual_columnname( 

1899 self.tablename_core, self.column_name, context 

1900 ) 

1901 elements = [ 

1902 f"{tname}.{cname}", 

1903 self.comment(context, with_table=with_table), 

1904 ] 

1905 return join_comments(elements) 

1906 

1907 

1908class SystmOneSRESpecs: 

1909 """ 

1910 Loads and represents the SystmOne SRE specifications. 

1911 """ 

1912 

1913 def __init__(self, context: SystmOneContext, filename: str) -> None: 

1914 """ 

1915 Initialize by reading a SystmOne SRE specification CSV file. 

1916 

1917 context: 

1918 The context from which SystmOne data is being extracted (e.g. the 

1919 raw TPP Strategic Reporting Extract (SRE), or a local version 

1920 processed into CPFT's Data Warehouse). 

1921 filename: 

1922 Optional filename for the TPP SRE specification file, in 

1923 comma-separated value (CSV) format. 

1924 """ 

1925 self.context = context 

1926 self.filename = filename 

1927 # All specifications: 

1928 self.specs = [] # type: List[SystmOneSRESpecRow] 

1929 # Map from (tablename_core, columnname) to SystmOneSRESpecRow: 

1930 self.tabcolmap = {} # type: Dict[Tuple[str, str], SystmOneSRESpecRow] 

1931 # Map from tablename_core to table comment: 

1932 self.tabcomments = {} # type: Dict[str, str] 

1933 if filename: 

1934 log.info(f"Reading SystmOne SRE specs from: {filename}") 

1935 with open(filename, "r") as f: 

1936 reader = csv.DictReader(f) 

1937 for rowdict in reader: 

1938 s = SystmOneSRESpecRow(rowdict) 

1939 self.specs.append(s) 

1940 self.tabcolmap[s.tablename_core, s.column_name] = s 

1941 self.tabcomments.setdefault( 

1942 s.tablename_core, s.table_description 

1943 ) 

1944 

1945 def debug_specs(self) -> None: 

1946 """ 

1947 Print the specs to the debugging log. 

1948 """ 

1949 specs_str = "\n".join( 

1950 spec.description(self.context) for spec in self.specs 

1951 ) 

1952 log.debug(f"SystmOne specs:\n{specs_str}") 

1953 

1954 def get_spec_row( 

1955 self, tablename_core: str, columnname: str 

1956 ) -> SystmOneSRESpecRow: 

1957 """ 

1958 Look up a row specification. 

1959 """ 

1960 return self.tabcolmap.get((tablename_core, columnname)) 

1961 

1962 def table_comment(self, tablename_core: str) -> str: 

1963 """ 

1964 Returns the table description/comment for a given table, if known, or 

1965 a blank string. 

1966 """ 

1967 return self.tabcomments.get(tablename_core) or "" 

1968 

1969 

1970@dataclass 

1971class ScrubSrcAlterMethodInfo: 

1972 """ 

1973 For describing scrub-source and alter-method information. 

1974 """ 

1975 

1976 change_comment_and_indexing_only: bool = False 

1977 src_flags: str = "" 

1978 scrub_src: Optional[ScrubSrc] = None 

1979 scrub_method: Optional[ScrubMethod] = None 

1980 decision: Decision = Decision.OMIT 

1981 alter_methods: List[AlterMethod] = field(default_factory=list) 

1982 dest_datatype: str = None 

1983 dest_field: str = None 

1984 

1985 def __post_init__(self) -> None: 

1986 """ 

1987 Validation. 

1988 """ 

1989 for char in self.src_flags: 

1990 assert any(char == enum.value for enum in SrcFlag) 

1991 

1992 def add_src_flag(self, flag: SrcFlag) -> None: 

1993 """ 

1994 Add a flag, if it doesn't exist already. 

1995 """ 

1996 flagchar = flag.value 

1997 if flagchar not in self.src_flags: 

1998 self.src_flags += flagchar 

1999 

2000 def add_alter_method(self, alter_method: AlterMethod) -> None: 

2001 """ 

2002 Adds an alteration method. 

2003 """ 

2004 self.alter_methods.append(alter_method) 

2005 

2006 def include(self) -> None: 

2007 """ 

2008 Sets the decision to "include". 

2009 """ 

2010 self.decision = Decision.INCLUDE 

2011 

2012 def omit(self) -> None: 

2013 """ 

2014 Sets the decision to "omit". 

2015 """ 

2016 self.decision = Decision.OMIT 

2017 

2018 

2019# ============================================================================= 

2020# Feature detection 

2021# ============================================================================= 

2022 

2023 

2024def is_master_patient_table(tablename: str) -> bool: 

2025 """ 

2026 Is this the master patient table? 

2027 """ 

2028 return eq(tablename, S1Table.PATIENT) 

2029 

2030 

2031def is_pid(colname: str, context: SystmOneContext) -> bool: 

2032 """ 

2033 Is this column the SystmOne primary patient identifier (PID)? 

2034 

2035 It's nearly always S1GenericCol.PID. But occasionally something else 

2036 (e.g. in CPFT-created tables). 

2037 

2038 This works for all tables EXCEPT the main "Patient" table, where the PK 

2039 takes its place. 

2040 

2041 Occasionally, CPFT tables blend SystmOne patients with other patients using 

2042 IDs from other EHR systems. However, those patients won't be in our master 

2043 patient index, so their data won't be brought through. 

2044 """ 

2045 return is_in(colname, PID_SYNONYMS[context]) 

2046 

2047 

2048def is_mpid(colname: str, context: SystmOneContext) -> bool: 

2049 """ 

2050 Is this column the master patient identifier (MPID), i.e. the NHS number? 

2051 """ 

2052 return is_in(colname, MPID_SYNONYMS[context]) 

2053 

2054 

2055def is_other_system_id(colname: str, context: SystmOneContext) -> bool: 

2056 """ 

2057 Is this column an ID from another system (e.g. RiO, PCMIS)? 

2058 """ 

2059 return is_in(colname, OTHER_SYSTEM_IDS[context]) 

2060 

2061 

2062def is_pk( 

2063 tablename: str, 

2064 colname: str, 

2065 context: SystmOneContext, 

2066 ddr: DataDictionaryRow = None, 

2067) -> bool: 

2068 """ 

2069 Is this a primary key (PK) column within its table? 

2070 """ 

2071 # This check is debatable. It's possible that the source database has 

2072 # columns that are NULLable but are in fact never null and are PKs. Indeed, 

2073 # that is the case; e.g. S1_FreeText.RowIdentifier is shown as "bigint, 

2074 # null" in SQL Server Manager, but "SELECT COUNT(*) FROM S1_FreeText WHERE 

2075 # RowIdentifier IS NULL" gives 0 rows, out of ~1M rows in total. 

2076 # 

2077 # if ddr.src_reflected_nullable: 

2078 # return False # can't be a PK if it can be NULL 

2079 

2080 # 1. If it's explicitly ruled out as a PK (e.g. it has the name that should 

2081 # mean it's a PK, but it's been messed with locally, or the TPP design 

2082 # team were having an off day), then it's not a PK. 

2083 if is_pair_in_re( 

2084 tablename, colname, NOT_PK_TABLENAME_COLNAME_REGEX_PAIRS[context] 

2085 ): 

2086 return False 

2087 # 2. If the source database says so (ours never does). 

2088 if ddr and ddr.pk: 

2089 return True 

2090 # 3. If it has the standard column name, i.e. RowIdentifier, then it's 

2091 # a PK. 

2092 if eq(colname, S1GenericCol.ROW_ID): 

2093 return True 

2094 # 4. If it's a specifically noted PK. 

2095 return is_pair_in_re( 

2096 tablename, colname, PK_TABLENAME_COLNAME_REGEX_PAIRS[context] 

2097 ) 

2098 

2099 

2100def is_free_text( 

2101 tablename: str, 

2102 colname: str, 

2103 context: SystmOneContext, 

2104 ddr: DataDictionaryRow = None, 

2105) -> bool: 

2106 """ 

2107 Is this a free-text field requiring scrubbing? 

2108 

2109 Unusually, there is not very much free text, and it is mostly collated. 

2110 (We haven't added binary support yet. Do we have the binary documents?) 

2111 """ 

2112 if ddr and not ddr.src_is_textual: 

2113 return False 

2114 return is_pair_in_re( 

2115 tablename, colname, FREETEXT_TABLENAME_COLNAME_REGEX_PAIRS[context] 

2116 ) and not is_pair_in_re( 

2117 tablename, 

2118 colname, 

2119 EXEMPT_FROM_SCRUBBING_TABLENAME_COLNAME_REGEX_PAIRS[context], 

2120 ) 

2121 

2122 

2123def should_be_fulltext_indexed(tablename: str, colname: str) -> bool: 

2124 """ 

2125 Is this a field that should get a FULLTEXT index? That's not just "a column 

2126 that contains free text and should be scrubbed", that is "a column with a 

2127 lot of interesting free text that should get a special index". 

2128 """ 

2129 return is_pair_in_re( 

2130 tablename, colname, FULLTEXT_INDEX_TABLENAME_COLNAME_REGEX_PAIRS_S1 

2131 ) 

2132 

2133 

2134# ============================================================================= 

2135# Deciding about columns 

2136# ============================================================================= 

2137 

2138 

2139def process_generic_table_column( 

2140 tablename: str, 

2141 colname: str, 

2142 ddr: DataDictionaryRow, 

2143 ssi: ScrubSrcAlterMethodInfo, 

2144 context: SystmOneContext, 

2145) -> bool: 

2146 """ 

2147 Performs operations applicable to columns any SystmOne table, except a few 

2148 very special ones like Patient. Modifies ``ssi`` in place. 

2149 

2150 Returns: recognized and dealt with? 

2151 """ 

2152 # ------------------------------------------------------------------------- 

2153 # Generic table 

2154 # ------------------------------------------------------------------------- 

2155 

2156 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 

2157 # PKs, PIDs, MPIDs (which can overlap, e.g. a PK that is a PID) 

2158 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 

2159 handled = False 

2160 if is_pk(tablename, colname, context, ddr): 

2161 # PK for all tables. 

2162 ssi.add_src_flag(SrcFlag.PK) 

2163 ssi.add_src_flag(SrcFlag.NOT_NULL) 

2164 ssi.add_src_flag(SrcFlag.ADD_SRC_HASH) 

2165 ssi.include() 

2166 handled = True 

2167 

2168 # PKs can also be other things: 

2169 

2170 if is_pair_in( 

2171 tablename, colname, OPT_OUT_TABLENAME_COLNAME_PAIRS[context] 

2172 ): 

2173 ssi.add_src_flag(SrcFlag.OPT_OUT) 

2174 handled = True 

2175 

2176 if is_pid(colname, context): 

2177 # FK to Patient.RowIdentifier for all other patient-related tables. 

2178 ssi.add_src_flag(SrcFlag.PRIMARY_PID) 

2179 ssi.dest_field = ddr.config.research_id_fieldname 

2180 ssi.include() 

2181 handled = True 

2182 

2183 elif is_mpid(colname, context): 

2184 # An NHS number in a random table. OK, as long as we hash it. 

2185 ssi.add_src_flag(SrcFlag.MASTER_PID) 

2186 ssi.dest_field = ddr.config.master_research_id_fieldname 

2187 ssi.include() 

2188 handled = True 

2189 

2190 elif is_other_system_id(colname, context): 

2191 # Something like a RiO or PCMIS ID. Use it to scrub, but it's not a 

2192 # PID or MPID. 

2193 ssi.scrub_src = ScrubSrc.PATIENT 

2194 ssi.scrub_method = ScrubMethod.CODE 

2195 ssi.omit() 

2196 handled = True 

2197 

2198 if handled: 

2199 return True 

2200 

2201 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 

2202 # Other things 

2203 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 

2204 if is_in(colname, COLS_GENERIC_EXCLUDE[context]): 

2205 # Columns that are never OK in a generic table, such as duplicated 

2206 # direct identifiers (handled specially in the master patient ID table 

2207 # etc.). 

2208 ssi.omit() 

2209 # ... likely redundant but that's not obvious within this function 

2210 

2211 elif is_in(colname, COLS_GENERIC_OK_UNMODIFIED_S1): 

2212 # Generic columns that are always OK (e.g. organization ID). 

2213 ssi.include() 

2214 

2215 elif is_free_text(tablename, colname, context, ddr): 

2216 # Free text to be scrubbed. 

2217 ssi.add_alter_method(AlterMethod(config=ddr.config, scrub=True)) 

2218 ssi.include() 

2219 

2220 elif is_in(colname, COLS_TRUNCATE_DATE[context]): 

2221 # Truncate date? 

2222 # We don't do the scrub_src and scrub_method here; we already know the 

2223 # patient's DOB from the master patient table. This code is about 

2224 # making sure that DOBs (for example) elsewhere are truncated, and 

2225 # time information doesn't leak through. 

2226 ssi.add_alter_method( 

2227 AlterMethod(config=ddr.config, truncate_date=True) 

2228 ) 

2229 ssi.dest_datatype = SQLTYPE_DATE 

2230 ssi.include() 

2231 

2232 else: 

2233 # Unrecognized. 

2234 return False 

2235 

2236 return True 

2237 

2238 

2239def get_scrub_alter_details( 

2240 tablename: str, 

2241 colname: str, 

2242 ddr: DataDictionaryRow, 

2243 context: SystmOneContext, 

2244 include_generic: bool = False, 

2245) -> ScrubSrcAlterMethodInfo: 

2246 """ 

2247 The main "thinking" function. 

2248 

2249 Is this a sensitive field that should be used for scrubbing? 

2250 Should it be modified in transit? 

2251 

2252 Args: 

2253 tablename: 

2254 The "core" tablename being considered, without any prefix (e.g. 

2255 "Patient", not "SRPatient" or "S1_Patient"). 

2256 colname: 

2257 The database column name. 

2258 ddr: 

2259 Data dictionary row. 

2260 context: 

2261 The context from which SystmOne data is being extracted (e.g. the 

2262 raw TPP Strategic Reporting Extract (SRE), or a local version 

2263 processed into CPFT's Data Warehouse). 

2264 include_generic: 

2265 Include all fields that are not known about by this code and 

2266 treated specially? If False, the config file settings are used 

2267 (which may omit or include). If True, all such fields are included. 

2268 """ 

2269 ssi = ScrubSrcAlterMethodInfo(decision=Decision.OMIT) # omit by default 

2270 

2271 # ------------------------------------------------------------------------- 

2272 # Omit table entirely (and ignore its contents for scrubbing)? 

2273 # ------------------------------------------------------------------------- 

2274 if is_in(tablename, OMIT_AND_IGNORE_TABLES[context]) or is_in_re( 

2275 tablename, OMIT_AND_IGNORE_TABLES_REGEX[context] 

2276 ): 

2277 return ssi 

2278 

2279 # ------------------------------------------------------------------------- 

2280 # Deal with the core patient table. Many details here. 

2281 # ------------------------------------------------------------------------- 

2282 if eq(tablename, S1Table.PATIENT): 

2283 if eq(colname, S1GenericCol.ROW_ID): 

2284 # RowIdentifier: SystmOne patient ID in the master patient table. 

2285 # Hash and scrub SystmOne IDs. 

2286 ssi.add_src_flag(SrcFlag.PK) 

2287 ssi.add_src_flag(SrcFlag.NOT_NULL) 

2288 ssi.add_src_flag(SrcFlag.PRIMARY_PID) # automatically hashed 

2289 ssi.add_src_flag(SrcFlag.DEFINES_PRIMARY_PIDS) 

2290 ssi.scrub_src = ScrubSrc.PATIENT 

2291 ssi.scrub_method = ScrubMethod.NUMERIC 

2292 ssi.include() 

2293 

2294 elif eq(colname, S1GenericCol.PATIENT_ID): 

2295 # IDPatient: Added by CPFT to the master patient table. 

2296 # Needs to be hashed. Is a duplicate of RowIdentifier. 

2297 ssi.add_src_flag(SrcFlag.PRIMARY_PID) # automatically hashed 

2298 ssi.scrub_src = ScrubSrc.PATIENT 

2299 ssi.scrub_method = ScrubMethod.NUMERIC 

2300 ssi.include() 

2301 

2302 elif is_mpid(colname, context): # NHS number 

2303 # Hash and scrub NHS numbers. (Present as a text field, but is 

2304 # numeric.) 

2305 ssi.add_src_flag(SrcFlag.MASTER_PID) # automatically hashed 

2306 ssi.scrub_src = ScrubSrc.PATIENT 

2307 ssi.scrub_method = ScrubMethod.NUMERIC 

2308 ssi.dest_field = ddr.config.master_research_id_fieldname 

2309 ssi.include() 

2310 

2311 elif is_in(colname, COLS_PATIENT_WORDS): 

2312 # Scrub and omit all names. 

2313 ssi.scrub_src = ScrubSrc.PATIENT 

2314 ssi.scrub_method = ScrubMethod.WORDS 

2315 

2316 elif eq(colname, S1PatientCol.DOB): 

2317 # Truncate and scrub dates of birth. 

2318 ssi.scrub_src = ScrubSrc.PATIENT 

2319 ssi.scrub_method = ScrubMethod.DATE 

2320 ssi.add_alter_method( 

2321 AlterMethod(config=ddr.config, truncate_date=True) 

2322 ) 

2323 ssi.dest_datatype = SQLTYPE_DATE 

2324 ssi.include() 

2325 

2326 elif eq(colname, S1PatientCol.DOD): 

2327 # Include dates of death. 

2328 ssi.include() 

2329 

2330 elif eq(colname, S1PatientCol.BIRTHPLACE): 

2331 # Unusual. But: scrub birthplace. 

2332 ssi.scrub_src = ScrubSrc.PATIENT 

2333 ssi.scrub_method = ScrubMethod.WORDS 

2334 

2335 elif eq(colname, S1PatientCol.TESTPATIENT): 

2336 # Exclude test patients. 

2337 ssi.add_src_flag(SrcFlag.OPT_OUT) 

2338 ssi.include() 

2339 

2340 elif eq(colname, S1PatientCol.SOCIAL_SERVICES_REF): 

2341 # Scrub and omit Social Services ID (text). 

2342 ssi.scrub_src = ScrubSrc.PATIENT 

2343 ssi.scrub_method = ScrubMethod.CODE 

2344 ssi.omit() # just to be explicit 

2345 

2346 elif is_in(colname, COLS_PATIENT_TABLE_OK_UNMODIFIED[context]): 

2347 # These are OK. 

2348 ssi.include() 

2349 

2350 else: 

2351 # If anything else is put into this table, it may be sensitive. 

2352 pass # omit anything else in the master patient table 

2353 

2354 # Via a separate "if" statement: 

2355 if is_in(colname, COLS_REQUIRED_SCRUBBERS): 

2356 ssi.add_src_flag(SrcFlag.REQUIRED_SCRUBBER) 

2357 

2358 return ssi 

2359 

2360 # ------------------------------------------------------------------------- 

2361 # Proceed for all other tables. 

2362 # ------------------------------------------------------------------------- 

2363 handled = process_generic_table_column( 

2364 tablename=tablename, colname=colname, ddr=ddr, ssi=ssi, context=context 

2365 ) 

2366 if handled: 

2367 # Recognized and handled as a generic column. 

2368 return ssi 

2369 

2370 if eq(tablename, S1Table.ADDRESS_HISTORY): 

2371 # --------------------------------------------------------------------- 

2372 # Address table. 

2373 # --------------------------------------------------------------------- 

2374 if is_in(colname, COLS_ADDRESS_PHRASES): 

2375 ssi.scrub_src = ScrubSrc.PATIENT 

2376 ssi.scrub_method = ScrubMethod.PHRASE 

2377 

2378 elif is_in(colname, COLS_ADDRESS_PHRASE_UNLESS_NUMBER): 

2379 ssi.scrub_src = ScrubSrc.PATIENT 

2380 ssi.scrub_method = ScrubMethod.PHRASE_UNLESS_NUMERIC 

2381 

2382 elif eq(colname, S1AddressCol.POSTCODE): 

2383 ssi.scrub_src = ScrubSrc.PATIENT 

2384 ssi.scrub_method = ScrubMethod.CODE 

2385 

2386 else: 

2387 # omit anything else in the address table, e.g. 

2388 # CPFTAddressCol.POSTCODE_NOSPACE 

2389 pass 

2390 

2391 elif eq(tablename, S1Table.CONTACT_DETAILS): 

2392 # --------------------------------------------------------------------- 

2393 # Contact details table. 

2394 # --------------------------------------------------------------------- 

2395 if eq(colname, S1ContactCol.NUMBER): 

2396 # Could be patient; ?could be third party; mostly patient? 

2397 ssi.scrub_src = ScrubSrc.PATIENT 

2398 ssi.scrub_method = ScrubMethod.NUMERIC 

2399 

2400 else: 

2401 pass # omit anything else in the contact details table 

2402 

2403 elif eq(tablename, S1Table.RELATIONSHIPS): 

2404 # --------------------------------------------------------------------- 

2405 # Third-party (relationships) table. 

2406 # --------------------------------------------------------------------- 

2407 if is_in(colname, COLS_RELATIONSHIP_XREF_ID): 

2408 # "Go fetch that linked patient, and use their identity information 

2409 # as a third-party scrubber for our index patient." 

2410 ssi.scrub_src = ScrubSrc.THIRDPARTY_XREF_PID 

2411 ssi.scrub_method = ScrubMethod.NUMERIC 

2412 ssi.include() 

2413 

2414 elif is_in(colname, COLS_RELATIONSHIP_WORDS[context]): 

2415 ssi.scrub_src = ScrubSrc.THIRDPARTY 

2416 ssi.scrub_method = ScrubMethod.WORDS 

2417 

2418 elif is_in(colname, COLS_RELATIONSHIP_DATES[context]): 

2419 ssi.scrub_src = ScrubSrc.THIRDPARTY 

2420 ssi.scrub_method = ScrubMethod.DATE 

2421 

2422 elif is_in(colname, COLS_RELATIONSHIP_PHRASES): 

2423 ssi.scrub_src = ScrubSrc.THIRDPARTY 

2424 ssi.scrub_method = ScrubMethod.PHRASE 

2425 

2426 elif is_in(colname, COLS_RELATIONSHIP_PHRASE_UNLESS_NUMERIC): 

2427 ssi.scrub_src = ScrubSrc.THIRDPARTY 

2428 ssi.scrub_method = ScrubMethod.PHRASE_UNLESS_NUMERIC 

2429 

2430 elif is_in(colname, COLS_RELATIONSHIP_CODES): 

2431 ssi.scrub_src = ScrubSrc.THIRDPARTY 

2432 ssi.scrub_method = ScrubMethod.CODE 

2433 

2434 elif is_in(colname, COLS_RELATIONSHIP_NUMBERS): 

2435 ssi.scrub_src = ScrubSrc.THIRDPARTY 

2436 ssi.scrub_method = ScrubMethod.NUMERIC 

2437 

2438 elif is_in(colname, COLS_RELATIONSHIP_OK_UNMODIFIED[context]): 

2439 ssi.include() 

2440 

2441 else: 

2442 pass # omit anything unknown in the relationship table 

2443 

2444 elif context == SystmOneContext.CPFT_DW and eq( 

2445 tablename, CPFTTable.REL_MOTHER 

2446 ): 

2447 # --------------------------------------------------------------------- 

2448 # A CPFT partial duplicate table: from the relationship table where 

2449 # that relationship is "Mother". 

2450 # --------------------------------------------------------------------- 

2451 if is_in(colname, COLS_RELATIONSHIP_XREF_ID): 

2452 ssi.scrub_src = ScrubSrc.THIRDPARTY_XREF_PID 

2453 ssi.scrub_method = ScrubMethod.NUMERIC 

2454 ssi.include() 

2455 

2456 elif eq(colname, CPFTOtherCol.REL_MOTHER_COL_NHSNUM): 

2457 # Likely a duplicate as a scrubber. But that's not a problem for 

2458 # CRATE and this also marks it as something to remove. 

2459 ssi.scrub_src = ScrubSrc.THIRDPARTY 

2460 ssi.scrub_method = ScrubMethod.NUMERIC 

2461 

2462 elif is_in(colname, CPFT_REL_MOTHER_OK_UNMODIFIED): 

2463 ssi.include() 

2464 

2465 else: 

2466 pass # omit anything unown 

2467 

2468 elif tcmatch( 

2469 tablename, colname, S1Table.HOSP_AE_NUMBERS, S1HospNumCol.HOSPNUM 

2470 ): 

2471 # --------------------------------------------------------------------- 

2472 # A hospital number. 

2473 # --------------------------------------------------------------------- 

2474 ssi.scrub_src = ScrubSrc.PATIENT 

2475 ssi.scrub_method = ScrubMethod.CODE # can contain text 

2476 

2477 elif tcmatch( 

2478 tablename, 

2479 colname, 

2480 S1Table.SAFEGUARDING_PERSON_AT_RISK, 

2481 S1PatientCol.NHSNUM, 

2482 ): 

2483 # --------------------------------------------------------------------- 

2484 # Another person's NHS number. 

2485 # --------------------------------------------------------------------- 

2486 ssi.scrub_src = ScrubSrc.THIRDPARTY 

2487 ssi.scrub_method = ScrubMethod.NUMERIC 

2488 

2489 elif is_pair_in(tablename, colname, OMIT_TABLENAME_COLNAME_PAIRS_S1): 

2490 # --------------------------------------------------------------------- 

2491 # A column to omit specifically. 

2492 # --------------------------------------------------------------------- 

2493 pass # omit 

2494 

2495 else: 

2496 # --------------------------------------------------------------------- 

2497 # A generic field in a generic table. 

2498 # --------------------------------------------------------------------- 

2499 if include_generic: 

2500 ssi.include() 

2501 else: 

2502 # Don't change anything except the comment: 

2503 ssi.change_comment_and_indexing_only = True 

2504 

2505 return ssi 

2506 

2507 

2508def get_index_flag( 

2509 tablename: str, 

2510 colname: str, 

2511 ddr: DataDictionaryRow, 

2512 context: SystmOneContext, 

2513) -> Optional[IndexType]: 

2514 """ 

2515 Should this be indexed? Returns an indexing flag, or ``None`` if it should 

2516 not be indexed. 

2517 """ 

2518 if is_pk(tablename, colname, context, ddr): 

2519 # PKs should have a unique index. 

2520 return IndexType.UNIQUE 

2521 pid = is_pid(colname, context) 

2522 if is_master_patient_table(tablename) and pid: 

2523 # In the master patient table, PIDs are unique. 

2524 # (MPIDs aren't -- they can be NULL.) 

2525 return IndexType.UNIQUE 

2526 if pid or is_mpid(colname, context): 

2527 # We index all patient IDs. 

2528 return IndexType.NORMAL 

2529 if is_pair_in_re( 

2530 tablename, colname, NORMAL_INDEX_TABLENAME_COLNAME_REGEX_PAIRS[context] 

2531 ): 

2532 # Additional columns to index 

2533 return IndexType.NORMAL 

2534 if colname in GENERIC_COLS_TO_INDEX[context]: 

2535 return IndexType.NORMAL 

2536 if should_be_fulltext_indexed(tablename, colname): 

2537 # Full-text indexes 

2538 return IndexType.FULLTEXT 

2539 return IndexType.NONE 

2540 

2541 

2542# ============================================================================= 

2543# Modify a data dictionary row according to detected features 

2544# ============================================================================= 

2545 

2546 

2547class TableCommentWorking: 

2548 """ 

2549 Class used to store data temporarily about table comments, during SystmOne 

2550 data dictionary annotation. Slightly complex because 

2551 """ 

2552 

2553 def __init__( 

2554 self, 

2555 dd: DataDictionary, 

2556 specifications: SystmOneSRESpecs, 

2557 append_comments: bool = False, 

2558 allow_unprefixed_tables: bool = False, 

2559 ) -> None: 

2560 """ 

2561 Args: 

2562 dd: 

2563 The data dictionary. 

2564 specifications: 

2565 Details of the TPP SRE specifications. 

2566 append_comments: 

2567 Append comments to any that were autogenerated, rather than 

2568 replacing them. (If you use the SRE specifications, you may as 

2569 well set this to False as the SRE specification comments are 

2570 much better.) 

2571 allow_unprefixed_tables: 

2572 Permit tables that don't start with the expected contextual 

2573 prefix? Discouraged; you may get odd tables and views. 

2574 """ 

2575 self.dd = dd 

2576 self.specifications = specifications 

2577 self.append_comments = append_comments 

2578 self.allow_unprefixed_tables = allow_unprefixed_tables 

2579 

2580 # "Core" tablenames for which the DD has a table comment already: 

2581 self.coretable_comments_already_exist = set() # type: Set[str] 

2582 for ddr in dd.rows: 

2583 if not ddr.is_table_comment: 

2584 continue 

2585 tablename_core = self._get_core_tablename(ddr.src_table) 

2586 if not tablename_core: 

2587 continue 

2588 self.coretable_comments_already_exist.add(tablename_core) 

2589 

2590 # We maintain a list of extra DDRs to add: 

2591 self.extra_table_comment_rows = [] # type: List[DataDictionaryRow] 

2592 

2593 def _get_core_tablename(self, dd_tablename: str) -> str: 

2594 """ 

2595 Returns the equivalent "core" (unprefixed) table name, which we use 

2596 with our SystmOne specifications file. 

2597 

2598 If the input doesn't have the right prefix and allow_unprefixed_tables 

2599 is False, this will return a blank string. 

2600 """ 

2601 return core_tablename( 

2602 dd_tablename, 

2603 from_context=self.specifications.context, 

2604 allow_unprefixed=self.allow_unprefixed_tables, 

2605 ) 

2606 

2607 def maybe_add_table_comment(self, ddr: DataDictionaryRow): 

2608 """ 

2609 We scan each data dictionary row via this function. 

2610 

2611 - If we already have seen a comment for this table in the data 

2612 dictionary, we don't do anything, UNLESS this row is itself that 

2613 comment, and then if 

2614 

2615 - Otherwise, we add the SystmOne comment, if found, as an extra DDR, 

2616 storing it in our "extra_table_comment_rows" list. 

2617 """ 

2618 tablename_core = self._get_core_tablename(ddr.src_table) 

2619 if not tablename_core: 

2620 return 

2621 s1_table_comment = self.specifications.table_comment(tablename_core) 

2622 if not s1_table_comment: 

2623 return 

2624 if tablename_core in self.coretable_comments_already_exist: 

2625 if ddr.is_table_comment: 

2626 # This DDR is itself the table comment. Modify it. 

2627 if self.append_comments: 

2628 ddr.comment = join_comments( 

2629 [ddr.comment, s1_table_comment] 

2630 ) 

2631 else: 

2632 ddr.comment = s1_table_comment 

2633 return 

2634 # Add a new DDR for the table comment: 

2635 tc_ddr = DataDictionaryRow(self.dd.config) 

2636 tc_ddr.set_as_table_comment( 

2637 ddr.src_db, ddr.src_table, s1_table_comment 

2638 ) 

2639 self.extra_table_comment_rows.append(tc_ddr) 

2640 self.coretable_comments_already_exist.add(tablename_core) 

2641 

2642 

2643def annotate_systmone_dd_row( 

2644 ddr: DataDictionaryRow, 

2645 context: SystmOneContext, 

2646 specifications: SystmOneSRESpecs, 

2647 append_comments: bool = False, 

2648 include_generic: bool = False, 

2649 allow_unprefixed_tables: bool = False, 

2650 table_info_in_comments: bool = True, 

2651) -> None: 

2652 """ 

2653 Modifies (in place) a data dictionary row for SystmOne. 

2654 

2655 Args: 

2656 ddr: 

2657 The data dictionary row to amend. 

2658 context: 

2659 The context from which SystmOne data is being extracted (e.g. the 

2660 raw TPP Strategic Reporting Extract (SRE), or a local version 

2661 processed into CPFT's Data Warehouse). 

2662 specifications: 

2663 Details of the TPP SRE specifications. 

2664 append_comments: 

2665 Append comments to any that were autogenerated, rather than 

2666 replacing them. (If you use the SRE specifications, you may as well 

2667 set this to False as the SRE specification comments are much 

2668 better.) 

2669 include_generic: 

2670 Include all fields that are not known about by this code and 

2671 treated specially? If False, the config file settings are used 

2672 (which may omit or include). If True, all such fields are included. 

2673 allow_unprefixed_tables: 

2674 Permit tables that don't start with the expected contextual prefix? 

2675 Discouraged; you may get odd tables and views. 

2676 A few (see INCLUDE_TABLES_REGEX) are explicitly included anyway. 

2677 table_info_in_comments: 

2678 Include table descriptions in column comments? 

2679 """ 

2680 tablename = core_tablename( 

2681 ddr.src_table, 

2682 from_context=context, 

2683 allow_unprefixed=allow_unprefixed_tables, 

2684 ) 

2685 if not tablename: 

2686 # It didn't have the right prefix and allow_unprefixed_tables is False. 

2687 ddr.decision = Decision.OMIT 

2688 return 

2689 colname = core_columnname(tablename, ddr.src_field, from_context=context) 

2690 

2691 debugmsg = f"Considering: {ddr.src_table}.{ddr.src_field}" 

2692 if tablename != ddr.src_table or colname != ddr.src_field: 

2693 debugmsg += f" [translated to 'core' version: {tablename}.{colname}]" 

2694 log.debug(debugmsg) 

2695 

2696 # Do our thinking 

2697 ssi = get_scrub_alter_details( 

2698 tablename=tablename, 

2699 colname=colname, 

2700 ddr=ddr, 

2701 include_generic=include_generic, 

2702 context=context, 

2703 ) 

2704 

2705 if not ssi.change_comment_and_indexing_only: 

2706 # Source information 

2707 ddr.src_flags = ssi.src_flags 

2708 ddr.scrub_src = ssi.scrub_src 

2709 ddr.scrub_method = ssi.scrub_method 

2710 

2711 # Output decision 

2712 ddr.decision = ssi.decision 

2713 

2714 # Alterations 

2715 ddr.set_alter_methods_directly(ssi.alter_methods) 

2716 

2717 # Destination -- mostly automatic 

2718 ddr.dest_field = ssi.dest_field or ddr.dest_field 

2719 ddr.dest_datatype = ssi.dest_datatype or ddr.dest_datatype 

2720 

2721 # Indexing 

2722 ddr.index = get_index_flag(tablename, colname, ddr, context) 

2723 

2724 # Improve comment 

2725 spec = specifications.get_spec_row(tablename, colname) 

2726 if spec: 

2727 spec_comment = spec.comment(context, with_table=table_info_in_comments) 

2728 # If we have no new comment, leave the old one alone. 

2729 if spec_comment: 

2730 if append_comments: 

2731 ddr.comment = join_comments([ddr.comment, spec_comment]) 

2732 else: 

2733 ddr.comment = spec_comment 

2734 

2735 

2736# ============================================================================= 

2737# Modify a data dictionary 

2738# ============================================================================= 

2739 

2740 

2741def modify_dd_for_systmone( 

2742 dd: DataDictionary, 

2743 context: SystmOneContext, 

2744 sre_spec_csv_filename: str = "", 

2745 debug_specs: bool = False, 

2746 append_comments: bool = False, 

2747 include_generic: bool = False, 

2748 allow_unprefixed_tables: bool = False, 

2749 alter_loaded_rows: bool = False, 

2750 table_info_in_comments: bool = True, 

2751) -> None: 

2752 """ 

2753 Modifies a data dictionary in place. 

2754 

2755 Args: 

2756 dd 

2757 The data dictionary to amend. 

2758 context: 

2759 The context from which SystmOne data is being extracted (e.g. the 

2760 raw TPP Strategic Reporting Extract (SRE), or a local version 

2761 processed into CPFT's Data Warehouse). 

2762 sre_spec_csv_filename: 

2763 Optional filename for the TPP SRE specification file, in 

2764 comma-separated value (CSV) format. If present, this will be used 

2765 to add proper descriptive comments to all known fields. Highly 

2766 recommended. 

2767 debug_specs: 

2768 Report the SRE specifications to the log. 

2769 append_comments: 

2770 Append comments to any that were autogenerated, rather than 

2771 replacing them. (If you use the SRE specifications, you may as well 

2772 set this to False as the SRE specification comments are much 

2773 better.) 

2774 include_generic: 

2775 Include all fields that are not known about by this code and 

2776 treated specially? If False, the config file settings are used 

2777 (which may omit or include). If True, all such fields are included. 

2778 allow_unprefixed_tables: 

2779 Permit tables that don't start with the expected contextual prefix? 

2780 Discouraged; you may get odd tables and views. 

2781 alter_loaded_rows: 

2782 Alter rows that were loaded from disk (not read from a database)? 

2783 The default is to leave such rows untouched. 

2784 table_info_in_comments: 

2785 Include table descriptions in column comments? 

2786 """ 

2787 log.info(f"Modifying data dictionary for SystmOne. Context = {context}") 

2788 specs = SystmOneSRESpecs(context, sre_spec_csv_filename) 

2789 if debug_specs: 

2790 specs.debug_specs() 

2791 table_comment_working = TableCommentWorking( 

2792 dd=dd, 

2793 specifications=specs, 

2794 append_comments=append_comments, 

2795 allow_unprefixed_tables=allow_unprefixed_tables, 

2796 ) 

2797 for ddr in dd.rows: 

2798 table_comment_working.maybe_add_table_comment(ddr) 

2799 if ddr.from_file and not alter_loaded_rows: 

2800 # Skip rows that were loaded from disk. 

2801 continue 

2802 if ddr.is_table_comment: 

2803 continue # those were handled above 

2804 annotate_systmone_dd_row( 

2805 ddr=ddr, 

2806 context=context, 

2807 specifications=specs, 

2808 append_comments=append_comments, 

2809 include_generic=include_generic, 

2810 allow_unprefixed_tables=allow_unprefixed_tables, 

2811 table_info_in_comments=table_info_in_comments, 

2812 ) 

2813 dd.rows += table_comment_working.extra_table_comment_rows 

2814 log.info("... done") 

2815 # We may have added table comments, so sort. 

2816 dd.sort()