Metadata-Version: 2.1
Name: sickness-screening
Version: 1.0.0
Summary: Module for sepsis predictions
Home-page: https://github.com/sslavian812/sepsis-predictions.git
Author: @Margo78, @akp1n
Author-email: timtimk30@yandex.ru
License: UNKNOWN
Keywords: sepsis,predictions,python,disease,screening
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3.11
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.7
Description-Content-Type: text/markdown
Requires-Dist: requests (>=2.25.1)
Requires-Dist: pandas (>=1.3.3)
Requires-Dist: tqdm (>=4.62.3)
Requires-Dist: numpy (>=1.21.2)
Requires-Dist: scikit-learn (>=0.24.2)
Requires-Dist: imbalanced-learn (>=0.8.0)
Requires-Dist: pytorch-tabnet (>=3.1.1)
Requires-Dist: torch (>=1.9.0)
Requires-Dist: nona (>=0.0.2)

# Predictions sepsis

## Instruction

Predictions sepsis is a module based on pandas, torch, and scikit-learn that allows users to perform simple operations with the MIMIC dataset.
With this module, using just a few functions, you can train your model to predict whether some patients have certain diseases or not. 
By default, the module is designed to train and predict sepsis. 
The module also allows users to change different names of tables to aggregate data from.

### Installation

To install the module, use the following command:

```bash
pip install predictions-sepsis
```
or
```bash
pip3 install predictions-sepsis
```
### Usage

You can import functions from the module into your Python file to aggregate data from MIMIC, 
fill empty spots, compress data between patients, and train your model.

### Examples

#### Aggregate patient diagnoses Data
```python
import predictions_sepsis as ps

ps.get_diagnoses(patient_diagnoses_csv='path_to_patient_diagnoses.csv', 
                 all_diagnoses_csv='path_to_all_diagnoses.csv',
                 output_file_csv='gottenDiagnoses.csv')
```

#### Aggregate patient ssir Data
```python
import predictions_sepsis as ps

ps.get_ssir(chartevents_csv='chartevents.csv', subject_id_col='subject_id', itemid_col='itemid',
             charttime_col='charttime', value_col='value', valuenum_col='valuenum', valueuom_col='valueuom',
             itemids=None, rest_columns=None, output_csv='ssir.csv'):
```

#### Combine Diagnoses and SSIR Data
```python
import predictions_sepsis as ps

ps.combine_diagnoses_and_ssir(gotten_diagnoses_csv='gottenDiagnoses.csv', 
                              ssir_csv='path_to_ssir.csv',
                              output_file='diagnoses_and_ssir.csv')
```

#### Aggregate patient blood analysis data from chartevents.csv and labevents.csv and combine it with diagnoses and SSIR Data
```python
import predictions_sepsis as ps

ps.merge_diagnoses_and_ssir_with_blood(diagnoses_and_ssir_csv='diagnoses_and_ssir.csv', 
                                       blood_csv='path_to_blood.csv',
                                       chartevents_csv='path_to_chartevents.csv',
                                       output_csv='merged_data.csv')
)
```

#### Compress Data by patient
```python
import predictions_sepsis as ps

ps.compress(df_to_compress='balanced_data.csv', 
            output_csv='compressed_data.csv')

```

#### Choose top non-sepsis patients to balance
```python
import predictions_sepsis as ps

ps.choose(compressed_df_csv='compressed_data.csv', 
          output_file='final_balanced_data.csv')
```

#### Fill missing values with mode
```python
import predictions_sepsis as ps

ps.fill_values(balanced_csv='final_balanced_data.csv', 
               strategy='most_frequent', 
               output_csv='filled_data.csv')
```

#### Aggregate patient diagnoses Data
```python
import predictions_sepsis as ps

# Aggregate diagnoses data
ps.get_diagnoses(patient_diagnoses_csv='path_to_patient_diagnoses.csv', 
                 all_diagnoses_csv='path_to_all_diagnoses.csv',
                 output_file_csv='gottenDiagnoses.csv')
```

#### Train model
```python
import predictions_sepsis as ps
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import MinMaxScaler
model = ps.train_model(df_to_train_csv='filled_data.csv', 
                       categorical_col=['Large Platelets'], 
                       columns_to_train_on=['Amylase'], 
                       model=RandomForestClassifier(), 
                       single_cat_column='White Blood Cells', 
                       has_disease_col='has_sepsis', 
                       subject_id_col='subject_id', 
                       valueuom_col='valueuom', 
                       scaler=MinMaxScaler(), 
                       random_state=42, 
                       test_size=0.2)
```

## Second way
#### Collecting features of the dataset
```python
with open(file_path) as f:
    headers = f.readline().replace('\n', '').split(',')
    i = 0
    for line in tqdm(f):
        values = line.replace('\n', '').split(',')
        subject_id = values[0]
        item_id = values[6]
        valuenum = values[8]
        if item_id in item_ids_set:
            if subject_id not in result:
                result[subject_id] = {}
            result[subject_id][item_id] = valuenum
        i += 1

table = pd.DataFrame.from_dict(result, orient='index')
table['subject_id'] = table.index

table.to_csv(output_path, index=False)
```

#### Add a target to the dataset
```python
target_subjects = drgcodes.loc[drgcodes['drg_code'].isin([870, 871, 872]), 'subject_id']
merged_data.loc[merged_data['subject_id'].isin(target_subjects), 'diagnosis'] = 1
```

#### Filling in the blanks using the NoNa library
```python
nona(
    data=X,
    algreg=make_pipeline(StandardScaler(with_mean=False), Ridge(alpha=0.1)),
    algclass=RandomForestClassifier(max_depth=2, random_state=0)
)
```

#### Removing class imbalance using SMOTE
```python
smote = SMOTE(random_state=random_state)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
```

#### Train model TabNet
```python
unsupervised_model = TabNetPretrainer(
    optimizer_fn=torch.optim.Adam,
    optimizer_params=dict(lr=pretraining_lr),
    mask_type=mask_type
)

unsupervised_model.fit(
    X_train=X_train.values,
    eval_set=[X_val.values],
    pretraining_ratio=pretraining_ratio,
)

clf = TabNetClassifier(
    optimizer_fn=torch.optim.Adam,
    optimizer_params=dict(lr=training_lr),
    scheduler_params=scheduler_params,
    scheduler_fn=torch.optim.lr_scheduler.StepLR,
    mask_type=mask_type
)

clf.fit(
    X_train=X_train.values, y_train=y_train.values,
    eval_set=[(X_val.values, y_val.values)],
    eval_metric=['auc'],
    max_epochs=max_epochs,
    patience=patience,
    from_unsupervised=unsupervised_model
)
```

#### Looking at the metrics
```python
result = loaded_clf.predict(X_test.values)
accuracy = (result == y_test.values).mean()
precision = precision_score(y_test.values, result)
recall = recall_score(y_test.values, result)
f1 = f1_score(y_test.values, result)
```


