Metadata-Version: 2.4
Name: synthedu
Version: 1.7.0
Summary: Agent-Based Synthetic Educational Data Generation for ODL Research
Author-email: Halis Aykut Cosgun <h.aykut.cosgun@gmail.com>
License-Expression: MIT
Project-URL: Homepage, https://github.com/theaiagent/SynthEd
Project-URL: Repository, https://github.com/theaiagent/SynthEd
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.24.0
Requires-Dist: scipy>=1.10.0
Requires-Dist: uuid_utils<1.0.0,>=0.9.0
Requires-Dist: SALib>=1.4.0
Requires-Dist: optuna>=3.0.0
Provides-Extra: llm
Requires-Dist: openai>=1.0.0; extra == "llm"
Provides-Extra: dashboard
Requires-Dist: shiny>=1.0; extra == "dashboard"
Requires-Dist: plotly>=5.0; extra == "dashboard"
Requires-Dist: htmltools; extra == "dashboard"
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
Requires-Dist: ruff; extra == "dev"
Dynamic: license-file

# SynthEd: From synthetic data to simulated learners

[![GitHub release](https://img.shields.io/github/v/release/theaiagent/SynthEd)](https://github.com/theaiagent/SynthEd/releases/latest)
[![CI](https://github.com/theaiagent/SynthEd/actions/workflows/ci.yml/badge.svg)](https://github.com/theaiagent/SynthEd/actions/workflows/ci.yml)
[![pytest](https://img.shields.io/endpoint?url=https://gist.githubusercontent.com/theaiagent/cbf1abd6cdc2134e7e26374de286f2c9/raw/synthed-test-badge.json)](#test-suite)
[![Python 3.10+](https://img.shields.io/badge/python-3.10%2B-blue.svg)](https://www.python.org/downloads/)
[![Code style: ruff](https://img.shields.io/badge/code%20style-ruff-000000.svg)](https://github.com/astral-sh/ruff)
[![codecov](https://codecov.io/gh/theaiagent/SynthEd/graph/badge.svg)](https://codecov.io/gh/theaiagent/SynthEd)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.19334118.svg)](https://doi.org/10.5281/zenodo.19334118)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

**Agent-based simulation environment for Open & Distance Learning (ODL) research.** SynthEd generates behaviorally grounded and temporally coherent learning trajectories by combining persona-driven agent modeling with 10 established theoretical frameworks. Built for researchers in learning analytics, educational data mining, and dropout prediction.

```bash
pip install -e ".[dev]"
python run_pipeline.py --n 200      # or: pip install synthedu
```

> **From statistical similarity to behavioral fidelity.** Traditional synthetic data methods optimize for distributional match. SynthEd optimizes for *behavioral coherence* -- each data point emerges from a simulated student's evolving motivations, decisions, and life context.

---

## Why SynthEd?

| Challenge | Traditional Approach | SynthEd Approach |
|-----------|---------------------|-----------------|
| **Privacy regulations** (GDPR/KVKK) | Anonymization (re-identification risk) | Agents are fictional -- no real individuals |
| **Class imbalance** in dropout data | Oversampling (SMOTE) -- loses context | Parameter-level control of dropout rates |
| **Temporal incoherence** | GAN/VAE post-hoc smoothing | Persona + memory produces coherent trajectories |

---

## Key Features

### Simulation Engine
- **10 Theory Modules** -- Tinto, Bean & Metzner, Kember, SDT, Garrison CoI, Moore, Rovai, Baulke, Epstein & Axtell, Gonzalez (+ unavoidable withdrawal mechanism)
- **TheoryModule Protocol** -- 4-phase dispatch (individual, network, post-peer, engagement) with auto-discovery and `_ENGAGEMENT_ORDER` composition. New theories added with zero engine changes
- **Continuous Persona Spectrum** -- Employment intensity, family responsibility, internet reliability as [0,1] floats with Beta distributions. No binary gates -- all theory effects scale continuously
- **Multi-Semester Simulation** -- Carry-over mechanics for engagement, GPA, coping, dropout phases
- **GPA Feedback Loop** -- Cumulative GPA anchors cost-benefit, non-fit perception, and competence beliefs

### Calibration & Validation
- **Sobol Sensitivity** -- 68-parameter sensitivity analysis identifying dominant dropout/engagement drivers
- **NSGA-II Calibration** -- Multi-objective optimization with Pareto front, parallel `--workers N` support, adaptive parameter bounds
- **5-Level Validation Suite** -- 21 statistical tests (distributions, correlations, temporal coherence, privacy, backstory)

### Configuration
- **InstitutionalConfig** -- 5 institution-level quality parameters that modulate theory constants. `support_services_quality` scales 13 Baulke dropout phase thresholds
- **GradingConfig** -- Beta/Normal/Uniform grade distributions, dual-hurdle pass requirements, exam-only and continuous assessment modes, relative grading with t-score cohort normalization
- **EngineConfig** -- 70 frozen engine constants with validation, overridable via `dataclasses.replace()`
- **PipelineConfig** -- Frozen dataclass grouping 16 pipeline params with JSON serialization for reproducibility

### Data & Integration
- **OULAD-Compatible Export** -- 7-table CSV matching the Open University Learning Analytics Dataset schema
- **Optional LLM Enrichment** -- Persona-grounded narrative backstories via OpenAI, Ollama, or any compatible provider
- **Benchmark Reports** -- Customizable default profile with CLI report generation (`--benchmark`)

---

## Quick Start

```bash
git clone https://github.com/theaiagent/SynthEd.git
cd SynthEd
pip install -e ".[dev]"              # Dev install (no LLM)
pip install -e ".[dev,llm]"          # Dev install with LLM support
python run_pipeline.py              # 200 students, 14 weeks
python run_pipeline.py --n 500      # Custom population
python run_pipeline.py --oulad      # OULAD-compatible export
python run_pipeline.py --benchmark  # Run default benchmark profile
python run_calibration.py --workers 4  # Parallel NSGA-II calibration
```

```python
from synthed.pipeline import SynthEdPipeline
from synthed.pipeline_config import PipelineConfig

config = PipelineConfig(output_dir="./output", seed=42)
pipeline = SynthEdPipeline(config=config)
report = pipeline.run(n_students=300)
print(f"Dropout: {report['simulation_summary']['dropout_rate']:.1%}")
```

---

## Use Cases

1. **Dropout Prediction** -- Generate labeled training data with known ground-truth trajectories
2. **Intervention Simulation** -- Model "what-if" scenarios by adjusting population parameters
3. **Privacy-Safe Benchmarking** -- Share synthetic datasets publicly for reproducible research

---

## Documentation

| Document | Content |
|----------|---------|
| **[User Guide](docs/GUIDE.md)** | Installation, configuration, calibration pipeline, OULAD export, LLM enrichment, troubleshooting |
| **[Theory & Architecture](docs/THEORY.md)** | 10 theoretical anchors, factor clusters, architecture diagram, project structure, validation suite, test inventory |

---

## Roadmap

- [x] Multi-semester simulation with carry-over
- [x] 10 theory modules (Tinto, Bean & Metzner, Kember, SDT, Garrison, Moore, Rovai, Baulke, Epstein & Axtell, Gonzalez)
- [x] Trait-based calibration (Sobol + Optuna + OULAD validation)
- [x] Benchmark reports with CLI (`--benchmark`)
- [x] OULAD-compatible 7-table export
- [x] LLM enrichment with cost control and streaming
- [x] Disability severity (Beta distribution)
- [x] InstitutionalConfig (5 quality parameters modulating theory constants)
- [x] NSGA-II multi-objective calibration with Pareto front
- [x] GradingConfig (configurable grading policy: Beta/Normal/Uniform, dual-hurdle, exam-only)
- [x] EngineConfig (70 frozen engine constants with validation)
- [x] Relative grading (t-score cohort normalization)
- [x] PipelineConfig (frozen pipeline configuration with JSON serialization)
- [x] TheoryModule Protocol (phase-based dispatch with auto-discovery)
- [x] Engine modularization (state.py, grading.py, statistics.py -- engine.py 834→590 lines)
- [x] Engagement protocol unification (4th phase: `contribute_engagement_delta`)
- [x] Spectrum refactoring (binary → continuous for employment/family/internet)
- [ ] GraphRAG integration (curriculum modeling)
- [ ] LLM-augmented mode (forum posts, assignment text)
- [ ] Parquet/Arrow export
- [x] PyPI package publication (`pip install synthedu`)
- [x] Interactive dashboard (Shiny + Plotly, dark/light themes, presets, validation)

---

## Legal Disclaimer

> **SynthEd generates entirely fictional synthetic data.** No real individuals are represented or identifiable. Outputs are intended for research, development, and educational purposes. SynthEd is under active development -- APIs and output formats may change between versions.

See full [Legal Disclaimer](docs/GUIDE.md#%EF%B8%8F-legal-disclaimer) and [Responsible Use](docs/GUIDE.md#-responsible-use) guidelines.

---

## Contributing

Contributions welcome! See the [User Guide](docs/GUIDE.md) for development setup.

```bash
ruff check synthed/ tests/
python -m pytest tests/ -v --tb=short
```

---

## License

MIT License. See [LICENSE](LICENSE).

## Citation

If you use SynthEd in your research, please cite using the [CITATION.cff](CITATION.cff) file or the Zenodo DOI above.

## Contributors

| Contributor | Role |
|-------------|------|
| [Halis Aykut Cosgun](https://orcid.org/0000-0003-0166-6237) | Lead Developer, Data Scientist & AI Engineer, Researcher -- Yozgat Bozok University |
| [Evrim Genc Kumtepe](https://orcid.org/0000-0002-2568-8054) | Research Advisor -- Anadolu University |
| [Claude](https://claude.ai) (Anthropic) | AI pair programmer -- implementation, testing, code review |

## Acknowledgments

Conceptually inspired by [TinyTroupe](https://github.com/microsoft/tinytroupe) (Microsoft), [MiroFish](https://github.com/666ghj/MiroFish), and [Agent Lightning](https://github.com/microsoft/agent-lightning). OULAD reference data: [Kuzilek et al. (2017)](https://doi.org/10.1038/sdata.2017.171).
