Metadata-Version: 2.4
Name: atlas-gmp-engine
Version: 1.0.0
Summary: Bayesian inference engine for geographic place guessing
License: MIT
Project-URL: Homepage, https://guessmyplace.vercel.app
Project-URL: Repository, https://github.com/GuessMyPlace/atlas-gmp-engine
Project-URL: Documentation, https://guessmyplace.vercel.app/docs
Project-URL: Bug Tracker, https://github.com/GuessMyPlace/atlas-gmp-engine/issues
Keywords: geography,game,bayesian,inference,ai,nlp
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Games/Entertainment
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.26
Requires-Dist: scipy>=1.13
Provides-Extra: ml
Requires-Dist: scikit-learn>=1.5; extra == "ml"
Provides-Extra: embeddings
Requires-Dist: sentence-transformers>=3.0; extra == "embeddings"
Provides-Extra: faiss
Requires-Dist: faiss-cpu>=1.8; extra == "faiss"
Provides-Extra: cpp
Requires-Dist: pybind11>=2.13; extra == "cpp"
Provides-Extra: all
Requires-Dist: scikit-learn>=1.5; extra == "all"
Requires-Dist: sentence-transformers>=3.0; extra == "all"
Requires-Dist: faiss-cpu>=1.8; extra == "all"
Requires-Dist: pybind11>=2.13; extra == "all"
Dynamic: license-file

<div align="center">

<br />

# ⚡ Atlas GMP Engine

**Bayesian inference engine for geographic place guessing**

The AI brain powering [GuessMyPlace](https://guessmyplace.vercel.app) —  
identifies any place on Earth through intelligent yes/no questions.

<br />

[![PyPI version](https://img.shields.io/pypi/v/atlas-gmp-engine?color=00C2FF&labelColor=0F1623&style=for-the-badge)](https://pypi.org/project/atlas-gmp-engine)
[![Python 3.11+](https://img.shields.io/badge/Python-3.11+-3776AB?style=for-the-badge&logo=python&logoColor=white&labelColor=0F1623)](https://python.org)
[![License: MIT](https://img.shields.io/badge/License-MIT-00E5A0?style=for-the-badge&labelColor=0F1623)](LICENSE)
[![Tests](https://img.shields.io/github/actions/workflow/status/GuessMyPlace/atlas-gmp-engine/tests.yml?label=Tests&style=for-the-badge&labelColor=0F1623)](https://github.com/GuessMyPlace/atlas-gmp-engine/actions)

<br />

</div>

---

## What is Atlas GMP Engine?

Atlas GMP Engine is a standalone Python package implementing a Bayesian inference system for geographic place identification. Given a dataset of places (countries, cities, landmarks, etc.) and a bank of yes/no questions, the engine:

1. Maintains a probability distribution across all places
2. Selects the most informative next question using information gain + Bayesian scoring
3. Updates probabilities after each answer using likelihood multipliers
4. Eliminates low-probability candidates through soft filtering
5. Returns a confident prediction with accuracy metrics

**Live performance:** ~94% accuracy on 115 world countries, averaging 10 questions per game.

---

## How It Works

```
                    ┌──────────────────────────────────────────┐
                    │            Atlas GMP Engine               │
                    │                                          │
  User Answer ──→  │  ProbabilityManager                      │
                    │    ↓  Bayesian likelihood updates         │
                    │  BayesianNetwork                         │
                    │    ↓  Belief propagation across attrs     │
                    │  InformationGain  ←── FeatureImportance  │
                    │    ↓  Shannon entropy (NumPy + C++)       │
                    │  QuestionSelector                        │
                    │    ↓  5-factor weighted scoring           │
                    │  ConfidenceCalculator                    │
                    │    ↓  4-signal composite score (0–100%)   │
                    │  Prediction / Next Question              │
                    └──────────────────────────────────────────┘
```

### Core Components

| Component | File | Purpose |
|-----------|------|---------|
| `InferenceEngine` | `inference_engine.py` | Main coordinator — manages game sessions |
| `ProbabilityManager` | `probability_manager.py` | Bayesian likelihood updates + soft filtering |
| `BayesianNetwork` | `bayesian_network.py` | Belief propagation across related attributes |
| `InformationGain` | `information_gain.py` | Shannon entropy calculation (NumPy + C++) |
| `QuestionSelector` | `question_selector.py` | 5-factor question scoring + disambiguation |
| `ConfidenceCalculator` | `confidence_calculator.py` | 4-signal composite confidence score |
| `FeatureImportance` | `feature_importance.py` | ML-learned attribute weights |
| `Embeddings` | `embeddings.py` | MiniLM-L6-v2 semantic similarity |
| `FAISSIndex` | `faiss_index.py` | Fast last-mile disambiguation |

---

### Question Selection Algorithm

Every candidate question is scored with a weighted formula:

```python
score = (information_gain  × 0.40)   # How much entropy does this reduce?
      + (stage_bonus        × 0.35)   # continent→region→culture→specific
      + (answer_balance     × 0.10)   # prefer questions that split ~50/50
      + (bayesian_belief    × 0.10)   # prior probability of this attribute value
      + (feature_importance × 0.05)   # weight learned from real game data (ML)
```

**Stage ordering** ensures the engine always asks broad questions first:

```
Stage 0 → continent, type
Stage 1 → region, sub-region
Stage 2 → coast, landlocked, island, climate, mountains
Stage 3 → population, size, GDP level
Stage 4 → government, religion, drive side
Stage 5 → language, flag, colonial history, UNESCO
Stage 6 → exports, famous for, neighbors
Stage 7 → capital, currency (very specific — asked last)
```

---

### Probability Updates

Each answer multiplies all place probabilities using likelihood ratios:

| Answer | Match multiplier | Mismatch multiplier |
|--------|----------------:|--------------------:|
| **Yes** | ×10.0 | ×0.001 |
| **Probably** | ×3.5 | ×0.15 |
| **Don't Know** | ×1.0 | ×1.0 |
| **Probably Not** | ×0.15 | ×3.5 |
| **No** | ×0.001 | ×10.0 |

After each update, probabilities are normalized and a soft filter eliminates candidates below 0.5% of the top probability (keeping at least 5).

---

### Confidence Score

The confidence signal is a weighted combination of 4 measurements:

```python
confidence = (probability_gap   × 0.40)   # gap between top-1 and top-2 probability
           + (normalized_prob   × 0.30)   # top probability / total
           + (item_count_score  × 0.20)   # fewer remaining = more confident
           + (entropy_score     × 0.10)   # inverse of distribution entropy
```

The engine triggers a guess when confidence crosses a stage-dependent threshold:
- Questions 1–10: requires **99%**
- Questions 11–25: requires **95%**
- Questions 26–50: requires **88%**
- Questions 50+: requires **78%**

---

## Installation

```bash
pip install atlas-gmp-engine
```

**With C++ extensions** (recommended — 8× faster probability operations):
```bash
pip install atlas-gmp-engine[cpp]
```

**With semantic embeddings** (for FAISS disambiguation):
```bash
pip install atlas-gmp-engine[embeddings]
```

**Full installation:**
```bash
pip install atlas-gmp-engine[all]
```

---

## Quick Start

```python
from atlas_engine import InferenceEngine

# Define your places
places = [
    {
        "id": "bd",
        "name": "Bangladesh",
        "type": "country",
        "emoji": "🇧🇩",
        "description": "A South Asian nation known for the Sundarbans and the Padma River.",
        "fun_fact": "Bangladesh is home to the world's largest river delta.",
        "attributes": {
            "continent":    "asia",
            "subRegion":    "south asia",
            "landlocked":   False,
            "hasCoast":     True,
            "hasDelta":     True,
            "climate":      "tropical",
            "mainReligion": "islam",
            "language":     "Bengali",
            "population":   "verylarge",
            "driveSide":    "left",
            "famousFor":    ["Sundarbans", "Padma River", "garments industry", "rickshaws"],
        },
    },
    # ... more places
]

# Define your questions
questions = [
    {
        "id": "q1",
        "question_text": "🌏 Is it located in Asia?",
        "attribute": "continent",
        "value": "asia",
        "stage": 0,
        "base_weight": 1.0,
    },
    {
        "id": "q2",
        "question_text": "🌊 Does it have a coastline?",
        "attribute": "hasCoast",
        "value": True,
        "stage": 2,
        "base_weight": 1.2,
    },
    # ... more questions
]

# Initialize engine
engine = InferenceEngine()

# Optionally load ML-learned feature importance
engine.load_feature_importance({
    "continent":    0.95,
    "subRegion":    0.90,
    "mainReligion": 0.88,
    "famousFor":    0.85,
    "language":     0.90,
})

# Start a game session
session = engine.start_game(places, questions)

# Game loop
while True:
    question = engine.get_next_question(session)

    if question is None:
        break  # Engine is ready to guess

    print(f"\n{question['question_text']}")
    answer = input("(yes / probably / dontknow / probablynot / no): ").strip()

    result = engine.process_answer(session, answer)
    print(f"  Confidence: {result['confidence']:.1f}%")
    print(f"  Remaining:  {result['active_places_count']} places")

    if result["should_stop"]:
        break

# Get prediction
pred = engine.get_prediction(session)

if pred["prediction"]:
    p = pred["prediction"]
    print(f"\n🎯 Atlas guesses: {p['emoji']} {p['name']}")
    print(f"   Confidence: {pred['confidence']}%")
    print(f"   Questions asked: {pred['questions_asked']}")
```

---

## Data Format

### Place object

```python
{
    "id":          str,              # unique identifier
    "name":        str,              # display name
    "type":        str,              # "country" | "city" | "landmark" | ...
    "emoji":       str | None,       # optional emoji flag or symbol
    "description": str | None,       # 2-3 sentence description
    "fun_fact":    str | None,       # surprising fact
    "attributes": {                  # key-value pairs matching your questions
        "continent":    str,         # "asia" | "europe" | "africa" | ...
        "subRegion":    str,         # "south asia" | "western europe" | ...
        "landlocked":   bool,
        "hasCoast":     bool,
        "hasMountains": bool,
        "climate":      str,         # "tropical" | "desert" | "temperate" | ...
        "population":   str,         # "small" | "medium" | "large" | "verylarge"
        "mainReligion": str,
        "language":     str,
        "famousFor":    list[str],   # list values supported
        "neighbors":    list[str],
        # ... any attributes your questions reference
    }
}
```

### Question object

```python
{
    "id":            str,     # unique identifier
    "question_text": str,     # "🌏 Is it in Asia?" — emoji prefix recommended
    "attribute":     str,     # "continent" — must match place attributes key
    "value":         any,     # "asia" — the value for which answer is YES
    "stage":         int,     # 0–7 (see stage ordering above)
    "base_weight":   float,   # 1.0 default, higher = preferred
}
```

---

## Advanced Usage

### Load ML-learned feature importance

```python
engine = InferenceEngine()

# Scores from 0.0 to 1.0 — higher = more important for discrimination
engine.load_feature_importance({
    "continent":    0.95,
    "type":         0.98,
    "subRegion":    0.90,
    "mainReligion": 0.88,
    "language":     0.90,
    "famousFor":    0.85,
    "capital":      0.95,
    "landlocked":   0.80,
})
```

### Handle user correction (feedback)

```python
# When Atlas guesses wrong and user corrects it:
engine.apply_feedback(session, correct_place_id="bd")
# Boosts Bangladesh ×25, reduces all others ×0.04
# Engine can then continue asking and make a new prediction
```

### Use semantic embeddings for disambiguation

```python
from atlas_engine.embeddings import embed_place

# Generate embedding for a place
place_data = {"name": "Bangladesh", "description": "...", "attributes": {...}}
embedding = embed_place(place_data)   # returns numpy array (384-dim)
# Store in your vector DB (e.g. Supabase pgvector)
```

### Build FAISS index for fast similarity search

```python
from atlas_engine.faiss_index import build_index, load_index

# Build from places with embeddings
places_with_embeddings = [
    {"id": "bd", "name": "Bangladesh", "type": "country", "embedding": [...]},
    # ...
]
build_index(places_with_embeddings)

# Load into memory (call once at startup)
load_index()
```

---

## C++ Extensions

For large datasets (10,000+ places), the hot-path operations are implemented in C++ via pybind11:

```
atlas_engine/cpp/probability_ops.cpp
  ├── normalize_probabilities()   ← called after every answer
  ├── soft_filter()               ← eliminates near-zero candidates
  ├── shannon_entropy()           ← information gain inner loop
  └── information_gain_binary()   ← runs for every candidate question
```

**Performance comparison:**

| Dataset | Python (NumPy) | C++ (pybind11) |
|---------|:--------------:|:--------------:|
| 100 places | ~3ms | ~1ms |
| 1,000 places | ~25ms | ~5ms |
| 10,000 places | ~600ms | ~70ms |
| 50,000 places | ~8s | ~400ms |

The engine automatically falls back to NumPy if C++ is not compiled.

**Build C++ extensions manually:**

```bash
cd atlas_engine/cpp
pip install pybind11
python setup.py build_ext --inplace
```

---

## Performance

| Dataset Size | Avg Response | Memory | C++ Required |
|-------------|:------------:|:------:|:------------:|
| ≤ 1,000 | < 20ms | ~150MB | No |
| ≤ 10,000 | < 80ms | ~800MB | Recommended |
| ≤ 50,000 | < 400ms | ~5GB | Yes |

---

## Requirements

**Core (always required):**
```
numpy >= 1.26
scipy >= 1.13
structlog >= 24.2   (optional, falls back to stdlib logging)
```

**Optional extras:**
```
scikit-learn >= 1.5       (ML feature importance training)
sentence-transformers >= 3.0  (semantic embeddings)
faiss-cpu >= 1.8          (fast vector similarity search)
pybind11 >= 2.13          (C++ hot-path extensions)
```

---

## Changelog

### [1.0.0] — 2026

Initial release as a standalone package.

**Features:**
- Bayesian inference engine with 5-factor question selection
- Probability Manager with likelihood multipliers
- Bayesian Network for belief propagation across attributes
- Information Gain Calculator (NumPy + C++ pybind11)
- Confidence Calculator (4-signal composite score)
- FAISS semantic index for last-mile disambiguation
- MiniLM-L6-v2 embeddings (384-dim)
- Soft filtering with configurable thresholds
- Stage-ordered question selection
- Feature importance (both static and ML-learned)
- C++ extensions for hot-path operations (8× speedup)
- Graceful fallback to pure Python when C++ unavailable

---

## Used By

- **[GuessMyPlace](https://guessmyplace.vercel.app)** — the geography guessing game this engine was built for

---

## License

MIT License — see [LICENSE](LICENSE) for details.

---

<div align="center">

Part of the [GuessMyPlace](https://github.com/GuessMyPlace) project

**[PyPI](https://pypi.org/project/atlas-gmp-engine) · [GuessMyPlace](https://guessmyplace.vercel.app) · [Docs](https://guessmyplace.vercel.app/docs)**

</div>
