Metadata-Version: 2.4
Name: pyGSLModel
Version: 1.0.0
Summary: A python package for modeling GSL metabolism and performing transcriptomic integration
Home-page: https://github.com/JackWJW/pyGSLModel
Author: Jack Welland
License: GNU
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: requests
Requires-Dist: cobra
Requires-Dist: pyfastcore
Requires-Dist: mygene
Requires-Dist: pandas
Requires-Dist: matplotlib
Requires-Dist: seaborn
Requires-Dist: imatpy
Requires-Dist: numpy
Requires-Dist: huggingface_hub
Requires-Dist: joblib
Requires-Dist: tqdm
Requires-Dist: scikit-learn
Requires-Dist: skorch
Requires-Dist: scipy
Requires-Dist: torch
Requires-Dist: pyvis
Dynamic: author
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: license
Dynamic: license-file
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# pyGSLModel

pyGSLModel is a python package designed to facilitate performing metabolic simulations of human GlycoSphingoLipid (GSL) metabolism. It also provides an easy to use strategy for performing transcriptomic integration with iMAT on TCGA and GTEX expression data (accessed from the Xena database).

## Quick Start Guide

### Model Preparation
pyGSLModel includes functions to download the Human-GEM () genome scale metabolic model of human metabolism and to subsequently convert gene names (with mygene API) and prune the model with fastcore (with pyfastcore) to create a smaller GSL specific model if desired. This has notable performance benefits.

```python
# Imports
from pyGSLModel import download_model, convert_genes, prune_model

# Downloading the HUMAN_GEM
model = download_model()

# Converting genes from ensembl IDs to gene names
model = convert_genes(model)

# Pruning the model with fastcore to  focus on GSL metabolism
model = prune_model(model)

# Removing unwanted transport reactions
model = remove_GSL_transport(model)
```

The pruning step takes longer as the function runs a number of test simulations on the full-scale model with different obejctives to ensure it doesn't miss any core reactions, it then subsequently performs fastcore pruning with the pyfastcore package. If you know you want to use the pruned GSL focused model, we made a dedicated function that downloads a pre-pruned and tidied model.

```python
# Imports
from pyGSLModel import download_GSL_model

# Downloading the pruned GSL model
model = download_GSL_model()
```

### Standard Simulations of Metabolism
With a prepared model, you are ready to run some metabolic simulations!
To do this you can make use of the `run_metabolic_model` function. This function allows you to simulate your model given a particular method and objective function.
The methods include standard FBA which would be enforced via `method = "FBA"`, a multiple-FBA approach in which multiple solutions are found for different lienar paths within GSL metabolism based on the selected objective and recombined into a distribution representative ensemble model: `method = "mFBA"`. To specify the GSL objective function for simulation you select one of `D14_Neuron` for a Day 14 I3Neuron, `D28_Neuron` for a Day 28 I3Neuron, `AC` for an IAstrocyte or `MG` for an IMicroglia in the `objective_choice=` argument. These define the set of reactions in the objective function that the solution will optimise for.

```python
# Imports
from pyGSLModel import run_metabolic_model

# Running a simulation
solution = run_metabolic_model(model, method="mFBA", objective_choice="D14_Neuron")
```

The `run_metabolic_model` function also allows for easy analysis of gene knockouts with the `knockout` argument. This is by default set to `"WT"` which won't change the model, however if you input a gene symbol, this will be knocked out of the model before the simulation is performed e.g., `knockout = "B4GALNT1"`.

```python
# Running a simulation with a gene knocked out.
solution_ko = run_metabolic_model(model, method="mFBA", objective_choice="D14_Neuron", knockout="B4GALNT1")
```

### Analysing and Plotting Results
As pyGSLModels focus is to simplify analysis of GSL metabolism, there are also included functions to analyse the metabolic simulation outputs.
These include a tabulation function, a standard bar chart plotting function, and also a network visualisation function that draws the graph of GSL metabolism, with edge widths weighted by reaction flux from the simulation. This network graph is saved to a target folder as an interactive .html file which can be opened in a browser. This allows the user to customise the network to their liking.

```python
# Imports
from pyGSLModel import tabulate_model_results, plot_model_results, visualise_flux_network

# First you can tabule the model results for key GSLs
results_df = tabulate_model_results(model, solution)

# You can then use that data frame as input to the default plotting function
results_fig = plot_model_results(results_df)

# You can also produce an interactive .html representation of the GSL metabolic network with edges weighted by flux
visualise_flux_network(model,solution,file_path="C:/path/to/folder/my_network.html",height="1080px",width="100%",rxn_col="#1f77b4",met_col="#abdcff")
```

### Transcriptomic Integration
pyGSLModel includes a number of functions to integrate transcriptomic data with your metabolic model to generate fluxes.

There are four functions to integrate transcriptomic data with the iMAT method (implemented here with imatpy). This method doesn't require a conventional objective function, instead it uses the gene expression data to drive the simulation. This is useful for complex multicellular organisms and tissues, where a clear biological objective is difficult to discern.

Two of the functions utilise TCGA and GTEX data from the TCGA_TARGET_GTEX dataset (Xena).

The first of these functions: `TCGA_iMAT_Integrate()` takes TCGA data averaged over multiple samples for each cancer/normal tissue and performs iMAT based on genes of GSL metabolism. You can tune the function by adjusting the upper/lower quantiles for how the data is processed. An `upper_quantile = 0.25` and a `lower_quantile = 0.75` means the top 25% of expression values will be assigned a +1, the bottom 25% will be assigned -1 and the remaining values assigned a 0. This is the format needed for iMAT to optimise the model to prioritise high expression gene associated reactions. This function gives two outputs, a dataframe with the flux results in each tissue for each key lipid, but also a dictionary containing the solution objects for further analysis.

The second of these functions: `TCGA_iMAT_sample_integrate()` performs simarly to `TCGA_iMAT_Integrate()`, however it doesn't average the data for a particular cancer/tissue, instead it calculates fluxes for each individual sample in the data set for the given `tissue` argument. You can also decide if you want to run simulations for both TCGA and GTEX data or just TCGA data with the dataset argument. This produces a lot more data and needs to run a lot of simulations so can take some time depending on the tissue you choose. Due to large numbers of samples this function only gives one output, a dataframe with flux results for each sample.

```python
# Imports
from pyGSLModel import TCGA_iMAT_integrate, TCGA_iMAT_sample_integrate

# Performing iMAT on TCGA and GTEX data where expression has been averaged across samples of the same disease/tissue
results_avg_df, sol_dict_TCGA = TCGA_iMAT_integrate(model, upper_quantile = 0.25, lower_quantile = 0.75, epsilon=1, threshold=0.01)

# Performing iMAT on TCGA and GTEX data for each individual sample for a given tissue
results_sample_df = TCGA_iMAT_sample_integrate(model, tissue="Brain", dataset="TCGA", upper_quantile = 0.25, lower_quantile=0.75, epsilon=1, threshold=0.01)
```

The third and fourth iMAT integration methods allows the user to input their own data. The required format is a dataframe where each column represents a different sample and the index contains the gene symbols. The first of these methods (`iMAT_integrate`) take a dataframe with only a single column for the sample and returns a cobrapy solution object, the second of these methods (`iMAT_multi_integrate`) can take multiple sample columns but returns a dataframe with the fluxes for each sample stored.

```python
# Imports
import pandas as pd
from pyGSLModel import iMAT_integrate

### Single Integration ###

import pandas as pd
# Making a dummy dataframe
d = {
    "Gene" : ["B4GALNT1", "ST3GAL5", "ST8SIA1","A4GALT"],
    "Sample_1" : [8,6,4,2]
}

# Converting the dictionary to a pandas dataframe and setting the index to Gene
iMAT_df_1 = pd.DataFrame(d)
iMAT_df_1 = iMAT_df_1.set_index("Gene").copy()

sol_4 = iMAT_integrate(model_3,iMAT_df_1)


### Multi Integration ###
from pyGSLModel import iMAT_multi_integrate

# Making a mock dataframe
d = {
    "Gene" : ["B4GALNT1", "ST3GAL5", "ST8SIA1"],
    "Sample_1" : [10,5,1],
    "Sample_2" : [1,5,10],
    "Sample_3" : [5,1,10],
}

mock_df = pd.DataFrame(d)
mock_df = mock_df.set_index("Gene")

# Perfoming iMAT
results_custom_df = iMAT_multi_integrate(model, mock_df, upper_quantile = 0.25, lower_quantile = 0.75, epsilon=1, threshold=0.01)
```

### Calculating GSL risk scores for prognosis prediction in lower grade glioma based on RNA-Seq data
pyGSLModel enables the calculation of GSl risk scores for lower grade glioma RNA-Seq data. The risk scores are calculated through an ensemble machine learning approach. For more details see the github repository: [JACKWJW Github LGG Prognosis Prediction](https://github.com/JackWJW/LGG_Prognosis_Prediction). Check the repository for updates as to when the paper is published!

The Ensemble machine learning approach was validated by training on TCGA data, before validation against CGGA data. Final models were then trained using both TCGA and CGGA data. Models can be accessed via huggingface: [JACKWJW Hugging Face](https://huggingface.co/JackWJW/LGG_Prognosis_Ensemble)

To calculate the GSL risk score your your data, you can use the calculate_GSL_score function. Data should be provided in the form of pandas dataframe with gene symbols as index and samples as columns. It is essential that expression values are recorded as TPMs (the function will log2 normalise the TPM values internally, so raw TPMs must be provided).

```python
# Imports
from pyGSLModel import calculate_GSL_score

# Data should be in the format with gene symbols as index and samples as columns.
# You can provide your full RNA-seq datasets in this format, the function will trim the data to include only the relevant columns internally.
# Expression data should be in the TPM format
d = {
    "Gene" : ["B4GALNT1", "ST3GAL5", "ST8SIA1","A4GALT"],
    "Sample_1" : [8,6,4,2],
    "Sample_2" : [2,4,6,8]
}

# Converting the dictionary to a pandas dataframe and setting the index to Gene
test_df = pd.DataFrame(d)
test_df = test_df.set_index("Gene").copy()

# Perfoming GSL risk score calculation
results_df = calculate_GSL_score(data=test_df)
```



## Dependencies and License
pyGSLModel is under a GNU license.
Dependecies include:
- requests (Apache)
- cobra (GNU)
- pyfastcore (MIT)
- mygene (BSD)
- pandas (BSD)
- matplotlib (PSF)
- seaborn (BSD)
- imatpy (MIT)
- numpy (BSD)
- networkx (BSD)
- pyvis (BSD)
- pytorch (BSD)
- sklearn (BSD)
