Metadata-Version: 2.1
Name: ukbiobank-loaders
Version: 1.0.0
Summary: Utility package for handling UK Biobank data
Home-page: https://github.com/BenevolentAI/ukbiobank-loaders
Author: BenevolentAI
Author-email: ukbiobank.loaders@benevolent.ai
License: MIT
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Topic :: Software Development
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >3.7
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: boto3 (==1.17.106)
Requires-Dist: numpy (>=1.21.6)
Requires-Dist: pandas (>=1.3.5)
Requires-Dist: pyarrow
Requires-Dist: s3fs (==2021.11.0)

# ukbiobank-loaders

This repository provides an easy way to load UK Biobank data. It is composed of a pre-processing script, which converts the UK Biobank data into parquets that are easier to read,
and a library that provides different methods to access the data.

## Installation
To install this package, simply run
```bash
pip install ukbiobank-loaders
```
Please note that python 3.7 or newer is needed.

## Usage

We will now describe how to use this library. Please note that data can be read from both local directories, and aws s3 directories.

### Pre-processing
These are the UK Biobank files that are needed in order to run the pre-processing, all saved in the same directory <DATA_FOLDER>:
```
death.txt
death_cause.txt
gp_clinical.txt
gp_scripts.txt
hesin.txt
hesin_diag.txt
hesin_oper.txt
```

Additionally, also the withdrawn consent file is needed:
```
withdrawn_consent.txt
```

From the terminal, run
```bash
update_data.py --raw_dir <DATA_FOLDER> --withdrawn_file <WITHDRAWN_CONSENT_FILE_PATH> --out_dir <OUTPUT_DIR_FOLDER>
```

The processed data will be saved in a folder named `<OUTPUT_DIR_FOLDER>/final`.

We found this process to take about 14 minutes in a pod composed of 4 CPUs and 32GB of RAM. If the process is Killed, it might be
because there is not enough RAM available.

### Accessing the data

This is a simple example on how to use the library. Specific documentation about the methods is given below.
```bash
>>> from ukbb_loaders.loaders import load
>>> dl = load.DataLoader(data_dir = "<OUTPUT_DIR_FOLDER>/final")
>>> dl.get_hospital_data("icd10")
    date_of_visit source feature  value
eid
68     1986-04-22  icd10    N181      1
68     1945-05-03  icd10    N181      1
68     1950-04-03  icd10    N181      1
68     1966-08-07  icd10    N181      1
67     1991-03-12  icd10    N181      1
..            ...    ...     ...    ...
73            NaT  icd10    N181      1
48     1997-06-20  icd10    N181      1
48     1945-03-05  icd10    N181      1
48     1956-02-25  icd10    N181      1
48     1981-04-08  icd10    N181      1
```

### Documentation for ukbb\_loaders.loaders.load

#### Table of Contents

* [DataLoader](#ukbb_loaders.loaders.load.DataLoader)
  * [\_\_init\_\_](#ukbb_loaders.loaders.load.DataLoader.__init__)
  * [get\_hospital\_data](#ukbb_loaders.loaders.load.DataLoader.get_hospital_data)
  * [get\_death\_data](#ukbb_loaders.loaders.load.DataLoader.get_death_data)
  * [get\_gp\_clinical\_data](#ukbb_loaders.loaders.load.DataLoader.get_gp_clinical_data)
  * [get\_gp\_medication\_data](#ukbb_loaders.loaders.load.DataLoader.get_gp_medication_data)


Loaders for versioned UKBB data.

<a id="ukbb_loaders.loaders.load.DataLoader"></a>

#### DataLoader Objects

```python
class DataLoader()
```

<a id="ukbb_loaders.loaders.load.DataLoader.__init__"></a>

#### \_\_init\_\_

```python
def __init__(data_dir: str)
```

Class for loading UKBB data.

**Arguments**:

- `data_dir` _str_ - The path to the directory containing the processed data. Note that on Windows the path must
  have forward slashes, e.g.  "C:/Users/john/Documents/data_dir"

<a id="ukbb_loaders.loaders.load.DataLoader.get_hospital_data"></a>

#### get\_hospital\_data

```python
def get_hospital_data(source: Union[str, List[str]],
                      level=None,
                      patient_list: np.ndarray = None) -> pd.DataFrame
```

**Arguments**:

- `source` _str or list_ - The coding/representation/source we would like to fetch.
  It needs to be one or more of:
  - `icd10` - for fetching all icd10 related diagnoses.
  - `icd9` - for fetching all icd9 related diagnoses.
  - `opcs3` - for fetching all opcs4 related operational codes.
  - `opcs4` - for fetching all opcs4 related operational codes.
- `level` _list or string_ - The level/significance of diagnoses we would like to fetch.
  It needs to be one or both of:
  - `primary` - for fetching only the primary code related to one diagnosis.
  - `secondary` - for fetching all the secondary (complementary) codes for one
    diagnosis.
  - `external` - For fetching diagnosis codes from external sources.
    Defaults to all of them.
- `patient_list` _np.ndarray_ - The patients to fetch characteristics for. If this is empty,
  all UKBB patients will be used.

**Returns**:

- `df` _pandas dataframe_ - A long canonical dataframe with patients as the index and the
  following columns:
  - date_of_visit: pandas datetime for each hospital visit
  - feature: the different codes used (e.g. the different icd10 codes)
  - source: this is relevant to the source the feature is referring to (e.g. icd10)
  - value: the occurrence value for each row combination (initially 1.)

<a id="ukbb_loaders.loaders.load.DataLoader.get_death_data"></a>

#### get\_death\_data

```python
def get_death_data(level=None,
                   patient_list: np.ndarray = None) -> pd.DataFrame
```

Method that fetches death information for the UKBB population.

**Arguments**:

- `level` _list or string_ - The level/significance of deaths we would like to fetch.
  It needs to be one or both of: primary (main reason of death), secondary. Defaults to both.
- `patient_list` _np.ndarray_ - The patients to fetch characteristics for.
  If this is empty, all UKBB patients will be used.

**Returns**:

- `df` _pandas dataframe_ : A long canonical dataframe with patients as the index and all
  recorded death information including death date in the right format.

<a id="ukbb_loaders.loaders.load.DataLoader.get_gp_clinical_data"></a>

#### get\_gp\_clinical\_data

```python
def get_gp_clinical_data(source=None, patient_list: np.ndarray = None)
```

Method that fetches gp diagnosis information for the UKBB population.

**Arguments**:

- `source` _str or list_ - Whether to load read_2, read_3 or both. Defaults to both.
- `patient_list` _np.ndarray_ - The patients to fetch characteristics for.
  If this is empty, all UKBB patients will be used.

**Returns**:

- `df` _pandas dataframe_: A long canonical dataframe with patients as the index and all
  recorded gp information including date in the right format.

<a id="ukbb_loaders.loaders.load.DataLoader.get_gp_medication_data"></a>

#### get\_gp\_medication\_data

```python
def get_gp_medication_data(patient_list: np.ndarray = None) -> pd.DataFrame
```

**Arguments**:

- `patient_list` _np.ndarray_ - The patients to fetch medication data for.
  If this is empty, all UKBB patients will be used.

**Returns**:
- `df` _pandas dataframe_ : A canonical long dataframe with patients as the index and
  features as columns.

## Acknowledgments
This package is developed using the UK Biobank Resource under Application Number 43138.
