Metadata-Version: 2.4
Name: jdatamunch-mcp
Version: 0.8.3
Summary: Token-efficient MCP server for tabular data retrieval via CSV/Excel/Parquet/JSONL indexing
Project-URL: Homepage, https://github.com/jgravelle/jdatamunch-mcp
Project-URL: Repository, https://github.com/jgravelle/jdatamunch-mcp
Project-URL: Bug Tracker, https://github.com/jgravelle/jdatamunch-mcp/issues
Author-email: "J. Gravelle" <j@gravelle.us>
License-File: LICENSE
Requires-Python: >=3.10
Requires-Dist: charset-normalizer>=3.0.0
Requires-Dist: mcp<2.0.0,>=1.10.0
Provides-Extra: all
Requires-Dist: anthropic>=0.40.0; extra == 'all'
Requires-Dist: google-generativeai>=0.8.0; extra == 'all'
Requires-Dist: openpyxl>=3.1.0; extra == 'all'
Requires-Dist: pyarrow>=14.0.0; extra == 'all'
Requires-Dist: sentence-transformers>=2.2.0; extra == 'all'
Requires-Dist: xlrd>=2.0.0; extra == 'all'
Provides-Extra: anthropic
Requires-Dist: anthropic>=0.40.0; extra == 'anthropic'
Provides-Extra: excel
Requires-Dist: openpyxl>=3.1.0; extra == 'excel'
Requires-Dist: xlrd>=2.0.0; extra == 'excel'
Provides-Extra: gemini
Requires-Dist: google-generativeai>=0.8.0; extra == 'gemini'
Provides-Extra: parquet
Requires-Dist: pyarrow>=14.0.0; extra == 'parquet'
Provides-Extra: semantic
Requires-Dist: sentence-transformers>=2.2.0; extra == 'semantic'
Description-Content-Type: text/markdown

Quickstart - https://github.com/jgravelle/jdatamunch-mcp/blob/main/QUICKSTART.md

<!-- mcp-name: io.github.jgravelle/jdatamunch-mcp -->

## FREE FOR PERSONAL USE
**Use it to make money, and Uncle J. gets a taste. Fair enough?** [details](#commercial-licenses)

---

## Documentation

| Doc | What it covers |
|-----|----------------|
| [QUICKSTART.md](QUICKSTART.md) | Zero-to-indexed in three steps |
| [USER-MANUAL.md](USER-MANUAL.md) | Full guide for analysts, ops, and non-developers |

---

## Cut spreadsheet token usage by **99.997%**

Most AI agents explore tabular data the expensive way:

dump the whole file into the prompt → skim a million irrelevant rows → repeat.

That is not "a little inefficient."
That is a **token incinerator**.

A 255 MB CSV file with 1 million rows costs **111 million tokens** if you paste it raw.
A single `describe_dataset` call answers the same orientation question in **3,849 tokens**.

That is a **25,333× reduction** — measured, not estimated, on a real 1M-row public dataset.

**jDataMunch indexes the file once and lets agents retrieve only the exact data they need**: column profiles, filtered rows, server-side aggregations, cross-dataset joins, and semantic search — with SQL precision.

> **Benchmark:** LAPD crime records — 1,004,894 rows, 28 columns, 255 MB
> Baseline (raw file): 111,028,360 tokens &nbsp;|&nbsp; jDataMunch: ~3,849 tokens &nbsp;|&nbsp; **25,333× reduction**
> [Methodology & harness](benchmarks/METHODOLOGY.md) · [Full results](benchmarks/results.md)

| Task | Traditional approach | With jDataMunch |
|------|----------------------|-----------------|
| Understand a dataset | Paste entire CSV | `describe_dataset` → column names, types, cardinality, samples |
| Find relevant columns | Read every row | `search_data` → column-level results with IDs |
| Answer a filtered question | Load millions of rows | `get_rows` with structured filters → only matching rows |
| Compute a group-by | Return all data | `aggregate` → server-side SQL, one result set |
| Compare two datasets | Load both entirely | `join_datasets` → SQL JOIN across indexed stores |
| Find column relationships | Export to spreadsheet | `get_correlations` → pairwise Pearson correlations |

Index once. Query cheaply. Keep moving.
**Precision retrieval beats brute-force context.**

---

# jDataMunch MCP

### Structured tabular data retrieval for AI agents

![License](https://img.shields.io/badge/license-dual--use-blue)
![MCP](https://img.shields.io/badge/MCP-compatible-purple)
![Local-first](https://img.shields.io/badge/local--first-yes-brightgreen)
![SQLite](https://img.shields.io/badge/storage-SQLite-9cf)
![jMRI](https://img.shields.io/badge/jMRI-Full-blueviolet)
[![PyPI version](https://img.shields.io/pypi/v/jdatamunch-mcp)](https://pypi.org/project/jdatamunch-mcp/)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/jdatamunch-mcp)](https://pypi.org/project/jdatamunch-mcp/)

> ## Commercial licenses
>
> jDataMunch-MCP is **free for non-commercial use**.
>
> **Commercial use requires a paid license.**
>
> **jDataMunch-only licenses**
>
> * [Builder — $39](https://j.gravelle.us/jCodeMunch/descriptions.php#builder) — 1 developer
> * [Studio — $149](https://j.gravelle.us/jCodeMunch/descriptions.php#studio) — up to 5 developers
> * [Platform — $799](https://j.gravelle.us/jCodeMunch/descriptions.php#platform) — org-wide internal deployment
>
> **Want the full jMunch suite?**
>
> * [Munch Trio Builder Bundle — $99](https://j.gravelle.us/jCodeMunch/descriptions.php#builder)
> * [Munch Trio Studio Bundle — $449](https://j.gravelle.us/jCodeMunch/descriptions.php#studio)
> * [Munch Trio Platform Bundle — $2,499](https://j.gravelle.us/jCodeMunch/descriptions.php#platform)

**Stop paying your model to read the whole damn spreadsheet.**

jDataMunch turns tabular data exploration into **structured retrieval**.

Instead of forcing an agent to load an entire CSV, scan millions of rows, and burn through context just to find the right column name, jDataMunch lets it navigate by **what the data is** and retrieve **only what matters**.

That means:

* **25,333× lower data-reading token usage** on a 1M-row CSV (measured)
* **less irrelevant context** polluting the prompt
* **faster dataset orientation** — one call tells you everything about the schema
* **accurate filtered queries** — the agent asks for Hollywood assaults, it gets Hollywood assaults
* **server-side aggregations** — GROUP BY runs in SQLite, not inside the context window
* **cross-dataset joins** — combine two indexed files in a single SQL query
* **semantic search** — find columns by meaning, not just keyword match
* **natural-language summaries** — auto-generated descriptions of every column and dataset

It indexes your files once using a streaming parser and SQLite, stores column profiles and row data with proper type affinity, and retrieves exactly what the agent asked for instead of re-loading the entire file on every question.

---

## Supported file formats

| Format | Extensions | Install extra |
|--------|-----------|---------------|
| CSV / TSV | `.csv`, `.tsv` | — (built-in) |
| Excel | `.xlsx`, `.xls` | `pip install "jdatamunch-mcp[excel]"` |
| Parquet | `.parquet` | `pip install "jdatamunch-mcp[parquet]"` |
| JSONL / NDJSON | `.jsonl`, `.ndjson` | — (built-in) |

---

## Why agents need this

Most agents still handle spreadsheets like someone who prints the entire internet before reading one article:

* paste the whole CSV to answer a narrow question
* re-load the same file repeatedly across tool calls
* consume column headers, empty cells, malformed rows, and irrelevant records
* burn context window on data that was never part of the question

jDataMunch fixes that by giving them a structured way to:

* describe a dataset's schema before touching any row data
* search for the specific column that holds the answer — by keyword or meaning
* retrieve only the rows that match the filter
* run aggregations server-side and get back a single result set
* join two datasets without loading either into the prompt
* orient themselves with samples before committing to a full query
* detect data-quality issues and column correlations automatically

Agents do not need bigger context windows.

They need **better aim**.

---

## What you get

### Column-level retrieval

Understand a dataset's full schema — types, cardinality, null rates, value distributions, samples, and natural-language summaries — in a single sub-10ms call. No rows loaded.

### Filtered row retrieval

Structured filters with 10 operators (`eq`, `neq`, `gt`, `gte`, `lt`, `lte`, `contains`, `in`, `is_null`, `between`). All parameterized SQL — no injection surface. Hard cap of 500 rows per call to protect context budgets.

### Server-side aggregations

GROUP BY with `count`, `sum`, `avg`, `min`, `max`, `count_distinct`, `median`. The computation stays in SQLite. One compact result set comes back instead of the data the model would aggregate itself.

### Smart column search

`search_data` searches column names, value indexes, and AI summaries simultaneously. Ask for "weapon type" and get `Weapon Used Cd` back. Ask for "Hollywood" and get the column whose values contain it.

**Semantic search** (v0.8+): Enable `semantic=true` for embedding-based search. Queries like "where did the crime happen" match `AREA NAME` even without keyword overlap. Supports local embeddings (sentence-transformers), Gemini, or OpenAI as providers.

### Cross-dataset joins

`join_datasets` combines two indexed datasets via SQL `ATTACH DATABASE` — inner, left, right, or cross joins. Column projection, per-side filters, ordering, and pagination. No data leaves SQLite.

### Correlation discovery

`get_correlations` computes pairwise Pearson correlations between all numeric columns. Discover hidden relationships without manual exploration.

### Natural-language summaries

Every indexed dataset gets auto-generated summaries describing data shape, column types, ranges, cardinality, quality issues, and temporal spans — no external API calls needed.

### Data quality triage

`get_data_hotspots` ranks columns by composite risk: null rate, cardinality anomalies, and numeric outlier spread. `get_schema_drift` compares schema between two dataset versions and classifies changes as identical, additive, or breaking.

### Token savings telemetry

Every call reports `tokens_saved` and `cost_avoided` estimates. `get_session_stats` shows your cumulative savings across the session, with per-model cost breakdowns. Lifetime stats persist across sessions.

### GitHub repository indexing

`index_repo` discovers and indexes data files directly from a GitHub repository — CSV, Excel, Parquet, and JSONL. Incremental by HEAD SHA. Supports private repos via `GITHUB_TOKEN`.

### Local-first speed

Indexes are stored at `~/.data-index/` by default. No cloud. No API keys required for core functionality.

### Built-in guardrails

* **Token budget enforcement** — every response is capped at a configurable token limit (default 8,000)
* **Anti-loop detection** — warns when an agent is paginating row-by-row in a tight loop
* **Wide-table pagination** — `describe_dataset` auto-paginates at 60 columns
* **Hard caps** on all parameters to prevent runaway queries

---

## How it works

jDataMunch parses local CSV, Excel, Parquet, and JSONL files using a **streaming, single-pass pipeline**:

```
CSV/Excel/Parquet/JSONL file
  → Streaming parser (never loads full file into memory)
  → Column profiler (type inference, cardinality, min/max/mean/median, value indexes)
  → Natural-language summary generator (dataset + per-column descriptions)
  → SQLite writer (10,000-row batches, WAL mode, indexes on low-cardinality columns)
  → index.json (column profiles, stats, summaries, file hash for incremental detection)
```

When an agent queries:

```
describe_dataset  →  reads index.json in memory (< 10ms)
get_rows          →  parameterized SQL on data.sqlite (< 100ms on indexed columns)
aggregate         →  GROUP BY SQL on data.sqlite (< 200ms for simple group-by)
search_data       →  scans column profiles in memory (< 50ms)
join_datasets     →  ATTACH DATABASE + cross-store SQL (< 300ms)
```

**No raw file is ever re-read after the initial index.** The SQLite database serves all row-level queries.

For a 255 MB, 1,004,894-row CSV (measured on real data):
* Index time: ~43 seconds (one-time)
* `describe_dataset`: 35 ms, **3,849 tokens** vs 111,028,360 tokens raw — **25,333×**
* `describe_column` (single column deep-dive): 22–33 ms, ~600 tokens
* `get_rows` (indexed filter): < 100 ms
* Peak indexing memory: < 500 MB

---

## Start fast

### 1. Install it

```bash
pip install jdatamunch-mcp
```

For additional format support:

```bash
pip install "jdatamunch-mcp[excel]"       # Excel (.xlsx, .xls)
pip install "jdatamunch-mcp[parquet]"     # Parquet
pip install "jdatamunch-mcp[semantic]"    # Semantic search (local embeddings)
pip install "jdatamunch-mcp[all]"         # Everything
```

### 2. Add it to your MCP client

#### Claude Code (one command)

```bash
claude mcp add jdatamunch uvx jdatamunch-mcp
```

Restart Claude Code. Confirm with `/mcp`.

#### Claude Desktop

Add to your config file (`~/Library/Application Support/Claude/claude_desktop_config.json` on macOS, `%APPDATA%\Claude\claude_desktop_config.json` on Windows):

```json
{
  "mcpServers": {
    "jdatamunch": {
      "command": "uvx",
      "args": ["jdatamunch-mcp"]
    }
  }
}
```

#### OpenClaw

**Option A — CLI:**

```bash
openclaw mcp set jdatamunch '{"command":"uvx","args":["jdatamunch-mcp"]}'
```

**Option B — Edit `~/.openclaw/openclaw.json`:**

```json
{
  "mcpServers": {
    "jdatamunch": {
      "command": "uvx",
      "args": ["jdatamunch-mcp"],
      "transport": "stdio"
    }
  }
}
```

Restart the gateway: `openclaw gateway restart`. Verify: `openclaw mcp list`.

#### Other clients (Cursor, Windsurf, Roo, etc.)

Any MCP-compatible client accepts the same JSON block in its MCP config file.

### 3. Index a file and start querying

```
index_local(path="/path/to/data.csv", name="my-dataset")
describe_dataset(dataset="my-dataset")
get_rows(dataset="my-dataset", filters=[{"column": "City", "op": "eq", "value": "Los Angeles"}], limit=10)
```

### 4. Tell your agent to actually use it

Installing jDataMunch makes the tools available. It does **not** guarantee the agent will stop pasting entire CSVs into prompts unless you tell it to use structured retrieval first.

#### Claude Code / Claude Desktop

Add this to your `CLAUDE.md` (global or project-level):

```markdown
## Data Exploration Policy
Use jdatamunch-mcp for tabular data whenever available.
Always call describe_dataset first to understand the schema.
Use get_rows with filters rather than loading raw files.
Use aggregate for any group-by or summary questions.
```

#### OpenClaw

Add the same policy to your agent's system prompt file (e.g. `~/.openclaw/agents/analyst.md`), then reference it in `~/.openclaw/openclaw.json`:

```json
{
  "agents": {
    "named": {
      "analyst": {
        "systemPromptFile": "~/.openclaw/agents/analyst.md"
      }
    }
  }
}
```

### Check your token savings

Ask your agent: *"How many tokens has jDataMunch saved me?"*

The agent will call `get_session_stats`, which returns session and lifetime token savings with per-model cost breakdowns. Lifetime stats persist to `~/.data-index/session_stats.json` across sessions.

---

## Tools

### Indexing

| Tool | What it does |
|------|-------------|
| `index_local` | Index a local CSV, Excel, Parquet, or JSONL file. Profiles columns, generates NL summaries, loads rows into SQLite. Incremental by default (skips if file unchanged). |
| `index_repo` | Index data files from a GitHub repository. Discovers CSV, Excel, Parquet, and JSONL files via the Trees API and indexes each. Incremental by HEAD SHA. Max 50 MB/file, 20 files/repo. |

### Exploration

| Tool | What it does |
|------|-------------|
| `list_datasets` | List all indexed datasets with row counts, column counts, and file sizes. |
| `list_repos` | List GitHub repositories indexed via `index_repo`. Shows repo name, HEAD SHA, dataset count, total rows. |
| `describe_dataset` | Full schema profile: every column's name, type, cardinality, null%, sample values, and NL summary. Primary orientation tool. Auto-paginates at 60 columns. |
| `describe_column` | Deep profile of one column: full value distribution, histogram bins, temporal range, NL summary. |
| `search_data` | Search column names and values by keyword or semantically. Returns column IDs — tells the agent where to look, not the data. Supports hybrid keyword + embedding search. |
| `sample_rows` | Head, tail, or random sample. Good for first-look at an unfamiliar dataset. |

### Querying

| Tool | What it does |
|------|-------------|
| `get_rows` | Filtered row retrieval with 10 operators. Parameterized SQL. 500-row hard cap. Column projection to reduce tokens. |
| `aggregate` | Server-side GROUP BY: count, sum, avg, min, max, count_distinct, median. Pre-filter support. 1,000-group cap. |
| `join_datasets` | SQL JOIN across two indexed datasets. Supports inner, left, right, cross. Per-side filters and column projection. |

### Analysis

| Tool | What it does |
|------|-------------|
| `get_correlations` | Pairwise Pearson correlations between numeric columns. Sorted by strength, with labels and pair counts. |
| `get_schema_drift` | Compare schema between two datasets. Detects added/removed columns, type changes, null-rate shifts. |
| `get_data_hotspots` | Rank columns by data-quality risk: null rate, cardinality anomalies, numeric outlier spread. |

### Management

| Tool | What it does |
|------|-------------|
| `summarize_dataset` | Regenerate NL summaries for an already-indexed dataset without re-parsing the source file. |
| `embed_dataset` | Precompute column embeddings for semantic search. Optional warm-up to eliminate first-query latency. |
| `delete_dataset` | Remove an indexed dataset and its SQLite store. Irreversible. |
| `get_session_stats` | Cumulative token savings and cost avoided across the session. Lifetime stats persist across sessions. |

---

## Filter operators

`get_rows`, `aggregate`, and `join_datasets` accept structured filters:

```json
{"column": "AREA NAME",    "op": "eq",      "value": "Hollywood"}
{"column": "Vict Age",     "op": "between", "value": [25, 35]}
{"column": "Crm Cd Desc",  "op": "contains","value": "ASSAULT"}
{"column": "Weapon Used Cd","op": "is_null","value": true}
{"column": "AREA",         "op": "in",      "value": [1, 2, 7]}
```

| Operator | Meaning |
|----------|---------|
| `eq` | equals |
| `neq` | not equals |
| `gt`, `gte` | greater than (or equal) |
| `lt`, `lte` | less than (or equal) |
| `contains` | case-insensitive substring |
| `in` | value in list |
| `is_null` | null / not null check |
| `between` | inclusive range `[min, max]` |

Multiple filters are ANDed. No raw SQL accepted — injection surface is zero.

---

## Configuration

| Variable | Default | Purpose |
|----------|---------|---------|
| `DATA_INDEX_PATH` | `~/.data-index/` | Index storage location |
| `JDATAMUNCH_MAX_ROWS` | `5,000,000` | Row cap for indexing |
| `JDATAMUNCH_MAX_RESPONSE_TOKENS` | `8,000` | Token budget cap per response |
| `JDATAMUNCH_SHARE_SAVINGS` | `1` | Set `0` to disable anonymous token savings telemetry |
| `ANTHROPIC_API_KEY` | — | AI column summaries via Claude |
| `GOOGLE_API_KEY` | — | AI column summaries via Gemini |
| `GITHUB_TOKEN` | — | Private repo access for `index_repo` |
| `JDATAMUNCH_EMBED_MODEL` | — | Local sentence-transformers model for semantic search |
| `GOOGLE_EMBED_MODEL` | — | Gemini embedding model for semantic search |
| `OPENAI_API_KEY` | — | OpenAI embeddings for semantic search |
| `OPENAI_EMBED_MODEL` | — | OpenAI embedding model for semantic search |

---

## When does it help?

| Scenario | Without jDataMunch | With jDataMunch | Measured savings |
|----------|--------------------|-----------------|---------|
| Orient on a 255 MB CSV | Paste raw file → **111M tokens** | `describe_dataset` → **3,849 tokens** | **25,333×** |
| Schema + column deep-dive | Same 111M tokens | `describe_dataset` + `describe_column` → **~4,400 tokens** | **~25,000×** |
| Find the crime-type column | Scan headers manually | `search_data("crime type")` → column ID | structural |
| Find column by meaning | No way to search semantically | `search_data("where did it happen", semantic=true)` → `AREA NAME` | structural |
| Get Hollywood assault rows | Load all 1M rows | `get_rows` with 2 filters → matching rows only | ~99%+ |
| Crime count by area | Return all rows, aggregate in LLM | `aggregate(group_by=["AREA NAME"])` → 21 rows | ~99.9% |
| Understand weapon nulls | Load column, count manually | `describe_column("Weapon Used Cd")` → `null_pct: 64.2%` | ~99.9% |
| Compare two dataset versions | Load both files | `get_schema_drift(a, b)` → breaking/additive assessment | structural |
| Find correlated columns | Export, pivot, eyeball | `get_correlations` → ranked pairs with strength labels | structural |
| Combine two datasets | Load both into prompt | `join_datasets` → SQL JOIN, only matching rows | ~99%+ |
| Re-query an unchanged file | Re-load file every time | Hash check → instant skip if unchanged | 100% of re-read cost |

The case where it doesn't help: you genuinely need every row for ML training or full exports. For that, read the file directly. For everything else — exploration, filtering, aggregation, orientation — structured retrieval wins every time.

---

## ID scheme

Every column and row gets a stable ID:

```
{dataset}::{column_name}#column     →  "lapd-crime::AREA NAME#column"
{dataset}::row_{rowid}#row          →  "lapd-crime::row_4421#row"
{dataset}::{pk_col}={value}#row     →  "lapd-crime::DR_NO=211507896#row"
```

Pass column IDs directly to `describe_column`. Row IDs are returned in `get_rows` results.

---

## Part of the jMunch family

| Product | Domain | Unit of retrieval | PyPI |
|---------|--------|-------------------|------|
| [jcodemunch-mcp](https://github.com/jgravelle/jcodemunch-mcp) | Source code | Symbols (functions, classes) | `jcodemunch-mcp` |
| [jdocmunch-mcp](https://github.com/jgravelle/jdocmunch-mcp) | Documentation | Sections (headings) | `jdocmunch-mcp` |
| **jdatamunch-mcp** | **Tabular data** | **Columns, row slices, aggregations** | **`jdatamunch-mcp`** |

All three implement [jMRI](https://github.com/jgravelle/mcp-retrieval-spec) — the open retrieval interface spec. Same response envelope, same token tracking, same telemetry pattern.

---

## Best for

* analysts, finance, ops, and consultants working with large spreadsheets
* AI agents that answer questions about CSV, Excel, Parquet, or JSONL data
* anyone paying token costs to load files they query repeatedly
* teams that want structured, auditable data access instead of raw file dumps
* developers building data-aware agents who need a drop-in retrieval layer

---

## New here?

Start with the [QuickStart guide](QUICKSTART.md) — zero to indexed in three steps.

Or if you prefer learning by doing: index a file, run `describe_dataset`, and look at what comes back.

That single call — 35 milliseconds, 3,849 tokens — tells you everything that would have cost you 111 million tokens to read raw.

That's the whole idea...
 <picture>
   <source media="(prefers-color-scheme: dark)" srcset="https://api.star-history.com/image?repos=jgravelle/jdatamunch-mcp&type=date&theme=dark&legend=top-left" />
   <source media="(prefers-color-scheme: light)" srcset="https://api.star-history.com/image?repos=jgravelle/jdatamunch-mcp&type=date&legend=top-left" />
   <img alt="Star History Chart" src="https://api.star-history.com/image?repos=jgravelle/jdatamunch-mcp&type=date&legend=top-left" />
 </picture>
</a>
