Metadata-Version: 2.4
Name: gaia-soil
Version: 0.1.0
Summary: A foundation model for soil microbiome understanding
Author: Gaia Project Contributors
License: Apache-2.0
Project-URL: Homepage, https://github.com/Kimchikilla/ProjectGaia
Project-URL: Repository, https://github.com/Kimchikilla/ProjectGaia
Project-URL: Issues, https://github.com/Kimchikilla/ProjectGaia/issues
Keywords: soil,microbiome,foundation-model,metagenomics,deep-learning
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=2.0
Requires-Dist: transformers>=4.30
Requires-Dist: pandas>=2.0
Requires-Dist: numpy>=1.24
Requires-Dist: scikit-learn>=1.3
Requires-Dist: scipy>=1.11
Requires-Dist: anndata>=0.10
Requires-Dist: biom-format>=2.1
Requires-Dist: requests>=2.31
Requires-Dist: pyyaml>=6.0
Requires-Dist: tqdm>=4.65
Provides-Extra: dev
Requires-Dist: pytest>=7.4; extra == "dev"
Requires-Dist: pytest-cov>=4.1; extra == "dev"
Requires-Dist: black>=23.7; extra == "dev"
Requires-Dist: isort>=5.12; extra == "dev"
Requires-Dist: flake8>=6.1; extra == "dev"
Requires-Dist: mypy>=1.5; extra == "dev"
Provides-Extra: viz
Requires-Dist: matplotlib>=3.7; extra == "viz"
Requires-Dist: seaborn>=0.12; extra == "viz"
Requires-Dist: umap-learn>=0.5; extra == "viz"
Provides-Extra: tracking
Requires-Dist: wandb>=0.15; extra == "tracking"
Provides-Extra: all
Requires-Dist: gaia-soil[dev,tracking,viz]; extra == "all"
Dynamic: license-file

# Gaia: Soil Microbiome Foundation Model

> *Gaia — the Greek goddess of Earth. Decoding the hidden language of soil microbiomes.*

**"The AlphaFold of Soil Microbiomes, built open-source."**

**English** | [한국어](README_KO.md)

Gaia is a foundation model that understands the "language" of soil microbial communities. Pre-trained on public metagenomic data, it enables soil health diagnosis, yield prediction, and microbial consortium design.

---

## Key Features

- **Pre-trained Foundation Model**: Transformer-based model pre-trained on 10,000+ soil microbiome samples from MGnify, NEON, and EMP
- **Soil Health Diagnosis**: Predict soil chemical properties (pH, organic carbon, total nitrogen) from microbial profiles
- **Biome Classification**: Identify soil biome types (agricultural, forest, grassland, desert, wetland)
- **Drought Stress Detection**: Binary classification of drought stress from microbial signatures
- **Interpretability Tools**: Attention-based keystone genera identification
- **Synthetic Data Generation**: Generate realistic microbial abundance profiles for target soil conditions

## Quick Start

### Installation

```bash
pip install gaia-soil
```

Or install from source:

```bash
git clone https://github.com/Kimchikilla/ProjectGaia.git
cd ProjectGaia
pip install -e ".[dev]"
```

### Basic Usage

```python
from gaia.inference import GaiaPredictor

# Load pre-trained model
predictor = GaiaPredictor.from_pretrained("gaia-v0.1")

# Predict soil properties from microbial profile
result = predictor.diagnose("path/to/abundance_profile.csv")
print(result.soil_health_report)
```

## Project Structure

```
gaia/
├── README.md
├── LICENSE                    # Apache 2.0
├── CONTRIBUTING.md
├── docs/
│   ├── roadmap.md
│   ├── data_standard.md       # Data standardization guide
│   └── tutorials/
├── data/
│   ├── scripts/               # Data collection & preprocessing scripts
│   ├── configs/               # Data source configurations
│   └── README.md              # Data catalog
├── gaia/
│   ├── preprocessing/         # Preprocessing modules
│   ├── models/                # Model architectures
│   ├── training/              # Training scripts
│   ├── evaluation/            # Evaluation modules
│   └── inference/             # Inference modules
├── benchmarks/                # Benchmark datasets & evaluation criteria
├── notebooks/                 # Tutorial Jupyter notebooks
└── tests/
```

## Data Sources

| Source | Description | Samples |
|--------|------------|---------|
| [MGnify](https://www.ebi.ac.uk/metagenomics/) | Taxonomic abundance tables from soil biomes | 5,000-15,000 |
| [NEON](https://www.neonscience.org/) | Paired microbiome + environmental data | ~2,000 |
| [Earth Microbiome Project](https://earthmicrobiome.org/) | Standardized global soil samples | ~5,000 |
| [SMAG](https://genome.jgi.doe.gov/) | 40,039 soil MAGs from 3,304 metagenomes | Reference DB |

## Benchmarks

| Task | Metric | Description |
|------|--------|-------------|
| Biome Classification | ROC-AUC, F1 | Classify soil biome type from microbial profile |
| Soil Chemistry Prediction | R², RMSE | Predict pH, organic C, total N |
| Tillage Classification | Accuracy, Kappa | Classify tillage practice |
| Drought Stress Detection | Accuracy, F1 | Detect drought stress (binary) |
| Abundance Reconstruction | Cosine Similarity | Reconstruct masked microbial profiles |

## Model Architecture

- **Base**: Multi-layer Transformer Decoder
- **Layers**: 6-12 (adjustable)
- **Attention Heads**: 8-16
- **Embedding Dim**: 256-512
- **Vocabulary**: ~5,000 soil-associated genera
- **Pre-training**: Continual pre-training from [MGM](https://github.com/HUST-NingKang-Lab/MGM) weights

## Tech Stack

| Area | Tool |
|------|------|
| Language | Python 3.10+ |
| Deep Learning | PyTorch 2.x |
| Transformers | Hugging Face Transformers |
| Data | Pandas, AnnData, Biom-format |
| Bioinformatics | QIIME2, Kraken2, MetaPhlAn |
| Visualization | Matplotlib, Seaborn, UMAP |
| Experiment Tracking | Weights & Biases |
| Model Hosting | Hugging Face Hub |

## Contributing

We welcome contributions! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.

### Ways to Contribute

- **Code**: Bug fixes, new features, pipeline improvements
- **Data**: Standardized soil microbiome datasets
- **Science**: New benchmark tasks, ecological validation, domain expertise

## Community

- **GitHub Discussions**: Technical discussions and Q&A
- **Discord**: Real-time community chat
- **Monthly Meetings**: Online direction-setting meetings (1st Thursday of each month)

## Citation

```bibtex
@software{gaia2026,
  title={Gaia: A Foundation Model for Soil Microbiome Understanding},
  year={2026},
  url={https://github.com/Kimchikilla/ProjectGaia}
}
```

## License

This project is licensed under the Apache License 2.0 - see [LICENSE](LICENSE) for details.

---

*This project is under active development. Star this repo to stay updated!*
