Metadata-Version: 2.4
Name: skylakegrep
Version: 0.2.1
Summary: Fully-offline semantic search over your local files — powered by Ollama
License: # PolyForm Noncommercial License 1.0.0
        
        <https://polyformproject.org/licenses/noncommercial/1.0.0>
        
        ## Acceptance
        
        In order to get any license under these terms, you must agree
        to them as both strict obligations and conditions to all your
        licenses.
        
        ## Copyright License
        
        The licensor grants you a copyright license for the
        software to do everything you might do with the software
        that would otherwise infringe the licensor's copyright in
        it for any permitted purpose. However, you may only
        distribute the software according to [Distribution
        License](#distribution-license) and make changes or new
        works based on the software according to [Changes and New
        Works License](#changes-and-new-works-license).
        
        ## Distribution License
        
        The licensor grants you an additional copyright license to
        distribute copies of the software. Your license to
        distribute covers distributing the software with changes
        and new works permitted by [Changes and New Works
        License](#changes-and-new-works-license).
        
        ## Notices
        
        You must ensure that anyone who gets a copy of any part of
        the software from you also gets a copy of these terms or
        the URL for them above, as well as copies of any plain-text
        lines beginning with `Required Notice:` that the licensor
        provided with the software. For example:
        
        > Required Notice: Copyright Tianchi Chen
        > (https://github.com/danielchen26/local-mgrep)
        
        ## Changes and New Works License
        
        The licensor grants you an additional copyright license to
        make changes and new works based on the software for any
        permitted purpose.
        
        ## Patent License
        
        The licensor grants you a patent license for the software
        that covers patent claims the licensor can license, or
        becomes able to license, that you would infringe by using
        the software.
        
        ## Noncommercial Purposes
        
        Any noncommercial purpose is a permitted purpose.
        
        ## Personal Uses
        
        Personal use for research, experiment, and testing for
        the benefit of public knowledge, personal study, private
        entertainment, hobby projects, amateur pursuits, or
        religious observance, without any anticipated commercial
        application, is use for a permitted purpose.
        
        ## Noncommercial Organizations
        
        Use by any charitable organization, educational
        institution, public research organization, public safety
        or health organization, environmental protection
        organization, or government institution is use for a
        permitted purpose regardless of the source of funding or
        obligations resulting from the funding.
        
        ## Fair Use
        
        You may have "fair use" rights for the software under the
        law. These terms do not limit them.
        
        ## No Other Rights
        
        These terms do not allow you to sublicense or transfer any
        of your licenses to anyone else, or prevent the licensor
        from granting licenses to anyone else.  These terms do not
        imply any other licenses.
        
        ## Patent Defense
        
        If you make any written claim that the software infringes
        or contributes to infringement of any patent, your patent
        license for the software granted under these terms ends
        immediately. If your company makes such a claim, your
        patent license ends immediately for work on behalf of your
        company.
        
        ## Violations
        
        The first time you are notified in writing that you have
        violated any of these terms, or done anything with the
        software not covered by your licenses, your licenses can
        nonetheless continue if you come into full compliance with
        these terms, and take practical steps to correct past
        violations, within 32 days of receiving notice.  Otherwise,
        all your licenses end immediately.
        
        ## No Liability
        
        ***As far as the law allows, the software comes as is,
        without any warranty or condition, and the licensor will
        not be liable to you for any damages arising out of these
        terms or the use or nature of the software, under any kind
        of legal claim.***
        
        ## Definitions
        
        The **licensor** is the individual or entity offering these
        terms, and the **software** is the software the licensor
        makes available under these terms.
        
        **You** refers to the individual or entity agreeing to these
        terms.
        
        **Your company** is any legal entity, sole proprietorship,
        or other kind of organization that you work for, plus all
        organizations that have control over, are under the control
        of, or are under common control with that organization.
        **Control** means ownership of substantially all the assets
        of an entity, or the power to direct its management and
        policies by vote, contract, or otherwise.  Control can be
        direct or indirect.
        
        **Your licenses** are all the licenses granted to you for
        the software under these terms.
        
        **Use** means anything you do with the software requiring
        one of your licenses.
        
        **Trademark** means a feature of an item that distinguishes
        it from other items in commerce.
        
        ---
        
        ## Required Notice
        
        Copyright © 2024–2026 Tianchi Chen
        <https://github.com/danielchen26/local-mgrep>
        
        ## Commercial use
        
        Commercial use of this software is **not permitted** under this
        license. To obtain a commercial license, please open a GitHub
        issue at the repository above titled "Commercial license inquiry"
        or contact the author at <chentianchi@gmail.com>.
        
        Earlier releases of this software (v0.2.0 through v0.15.1) were
        distributed under the MIT License at the time of release; those
        binaries may legally exist in the wild under MIT terms. The
        project's current source tree, all future binary artifacts, and
        the retroactively rewritten git history are governed by the
        PolyForm Noncommercial License 1.0.0 above.
        
Project-URL: Homepage, https://github.com/danielchen26/skylakegrep
Project-URL: Repository, https://github.com/danielchen26/skylakegrep
Project-URL: Issues, https://github.com/danielchen26/skylakegrep/issues
Project-URL: Documentation, https://danielchen26.github.io/skylakegrep/
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: click>=8.0
Requires-Dist: requests>=2.31
Requires-Dist: numpy>=1.26
Requires-Dist: scikit-learn>=1.4
Requires-Dist: tree-sitter>=0.21
Requires-Dist: tree-sitter-python>=0.23
Requires-Dist: tree-sitter-javascript>=0.23
Requires-Dist: tree-sitter-typescript>=0.23
Requires-Dist: pygments>=2.0
Requires-Dist: pypdf>=3.0
Requires-Dist: python-docx>=1.0
Provides-Extra: rerank
Requires-Dist: sentence-transformers>=3.0; extra == "rerank"
Dynamic: license-file

<p align="center">
  <img alt="skylakegrep — fully-offline semantic search over your local files" src="docs/assets/hero-dark.svg" width="100%">
</p>

<p align="center">
  <a href="https://pypi.org/project/skylakegrep/"><img src="https://img.shields.io/pypi/v/skylakegrep?label=pypi&color=22d3ee&labelColor=0a0d12" alt="PyPI"></a>
  <a href="https://www.python.org/downloads/"><img src="https://img.shields.io/badge/python-3.9%2B-22d3ee?labelColor=0a0d12" alt="Python 3.9+"></a>
  <a href="LICENSE"><img src="https://img.shields.io/badge/license-PolyForm--NC--1.0.0-f59e0b?labelColor=0a0d12" alt="PolyForm Noncommercial 1.0.0"></a>
  <a href="https://danielchen26.github.io/skylakegrep/"><img src="https://img.shields.io/badge/docs-published-22d3ee?labelColor=0a0d12" alt="Documentation"></a>
  <a href="https://github.com/danielchen26/skylakegrep/releases/latest"><img src="https://img.shields.io/github/v/release/danielchen26/skylakegrep?label=release&color=22d3ee&labelColor=0a0d12" alt="Latest release"></a>
</p>

<p align="center">
  <a href="#quickstart"><b>Quickstart</b></a>
  &nbsp;·&nbsp;
  <a href="#in-30-seconds"><b>30s demo</b></a>
  &nbsp;·&nbsp;
  <a href="#what-you-can-search"><b>What you can search</b></a>
  &nbsp;·&nbsp;
  <a href="#whats-new-in-02x"><b>What's new in 0.2.x</b></a>
  &nbsp;·&nbsp;
  <a href="#performance"><b>Performance</b></a>
  &nbsp;·&nbsp;
  <a href="#how-it-works"><b>How it works</b></a>
  &nbsp;·&nbsp;
  <a href="https://github.com/danielchen26/skylakegrep/releases"><b>Releases</b></a>
  &nbsp;·&nbsp;
  <a href="https://danielchen26.github.io/skylakegrep/"><b>Docs</b></a>
</p>

---

`skygrep` is a fully-offline semantic search CLI for your local files —
**code, markdown, PDFs, Word docs, plain text, anything you index**.
**Ask in plain English, get the right file and line range.** The
content-agnostic retrieval substrate (Karpathy "knowledge graph as
prior") works across content types via a pluggable extractor registry —
not a code-only tool. Indexing, retrieval, and optional answer
synthesis all run locally against your own Ollama server. No remote
service, no subscription, no data leaves your machine.

## What you can search

The retrieval substrate is **content-agnostic** by design — the
embedder (`bge-m3`, multilingual XLM-RoBERTa), the cascade, and the
reference graph all abstract over "A references B" rather than over
any specific programming language or document format. New content
types plug in via a one-line `register_extractor()` call.

| Content type | How it's parsed | Reference graph | Since |
| --- | --- | --- | :-: |
| **Code** — Rust · Python · JS · TS | tree-sitter symbol-aware chunking + line-window fallback | imports / `use` / `require` / dynamic `import()` | 0.1.0 |
| **Markdown** | line-window chunks; `[](link)`, `![]()`, `[[wiki]]` link extraction | relative-path resolution + Obsidian-style wiki links | 0.2.0 |
| **PDF** | `pypdf` text extraction; opt-in OCR for scanned pages | — | 0.1.0 |
| **Word docs (`.docx`)** | `python-docx` paragraph extraction | — | 0.1.0 |
| **Plain text · TOML · YAML · CSV · JSON · …** | line-window chunking via the default text path | — | 0.1.0 |
| **Custom (your content type)** | register an extractor returning `(source, target)` edges | your call | 0.2.0 |

Register a new content type in one line:

```python
from skylakegrep.src.reference_graph import register_extractor

def yaml_anchor_extractor(path):
    """Return list of (source, target) reference edges."""
    ...

register_extractor("yaml", [".yaml", ".yml"], yaml_anchor_extractor)
```

## What's new in 0.2.x

| | What | Why it matters |
| --- | --- | --- |
| **Substrate** | `bge-m3` default embedder (1024-d, multilingual, symmetric, 8k context) replaces `mxbai-embed-large` | Single largest accuracy contributor — React bench 8/10 → 10/10. Existing indexes must rebuild (`--reset`). |
| **Reference graph** | `register_extractor()` registry; `code_graph.py` is a back-compat facade | Content-agnostic abstraction ("A references B"). Markdown shipped in 0.2.0. New types in one line, no retrieval-code changes. |
| **Cascade** | σ-adaptive threshold `max(τ_floor, k·σ_topK)` | Derived from cosine evidence (MacKay/Williams Bayesian framing), not a magic number. Auto-recalibrates when embedder swaps. New `tau_mode` telemetry. |
| **Path filter** | 24 universal aux-path conventions (`/fixtures/`, `/vendor/`, `/dist/`, `.development.js`, `.min.js`, …) | Structural prior, not language-specific. Applies to any corpus with similar conventions. |
| **Bench** | 30 / 30 on Django + React + Tokio public OSS bench | Was 28 / 30 in 0.1.0; latency aggregate −19 % (Tokio +57 % is the real trade-off). |
| **Symbol channel** | `multi_channel_search` with RRF k=60 fusion (internal, opt-in) | Tree-sitter symbol-as-retriever experimental primitive. Not in default CLI yet — auto-router is a 0.3.0 follow-up. |

Full release notes: [`docs/skylakegrep-0.2.0.md`](docs/skylakegrep-0.2.0.md) · [`docs/skylakegrep-0.2.1.md`](docs/skylakegrep-0.2.1.md)

## In 30 seconds

```console
$ pip install skylakegrep
$ ollama pull nomic-embed-text qwen2.5:1.5b qwen2.5:3b   # one-time
$ cd ~/your-project

$ skygrep "where is the cascade tau threshold defined?"
=== skylakegrep/src/storage.py:578-602 (score: 0.781) ===
CASCADE_DEFAULT_TAU = 0.015

def cascade_search(...
[0.51s · cascade=cheap (gap=0.020 τ=0.015) · index 20s ago · 36 files · L2 symbols on · graph prior on]
```

That is the entire happy path. **First query in a fresh project completes
in under 1 s** via a ripgrep fallback while a background process builds
the semantic index. Every query after that uses the full cascade with a
local LLM kept warm in memory.

## Quickstart

```bash
pip install skylakegrep
ollama pull nomic-embed-text qwen2.5:1.5b qwen2.5:3b   # ~3 GB total

# One-time: register skylakegrep with detected LLM CLIs
# (Claude Code / Codex / OpenCode / Gemini CLI / Cursor)
skygrep setup

cd /your/project
skygrep "<your question>"
skygrep doctor                # verify runtime + models + index + integrations
skygrep stats                 # show current project's index info
```

`skygrep` derives the project root from `git rev-parse --show-toplevel`
(falling back to the working directory) and keeps a per-project index
under `~/.skylakegrep/repos/`. Subcommand names (`index`, `doctor`,
`stats`, `watch`, `serve`, `setup`, `enrich`) take precedence — anything
else is treated as a query, so `skygrep "stats and metrics"` (quoted) is
unambiguous.

**`skygrep setup`** writes a small markdown snippet to each detected LLM
CLI's user-level instructions file (e.g. `~/.claude/CLAUDE.md`,
`~/.codex/AGENTS.md`, `~/.gemini/GEMINI.md`) telling the agent to
prefer `skygrep` for natural-language search across local files and fall back to `rg`
otherwise. Snippets are delimited by markers; `skygrep setup --uninstall`
removes them cleanly without touching your other instructions.

## Performance

Public, reproducible benchmark on three popular open-source codebases.
30 hand-labelled questions total (10 per repo). Anyone can clone the
repos and re-run with one command — see
[`benchmarks/public_oss_bench.py`](benchmarks/public_oss_bench.py)
and [`docs/parity-benchmarks.md`](docs/parity-benchmarks.md).

| Repo | Language | LOC ≈ | Tasks | skygrep recall | rg recall | Token reduction |
| --- | --- | :-: | :-: | :-: | :-: | :-: |
| **Django** | Python | 524 K | 10 | **10 / 10** | 10 / 10 | **703 ×** |
| **Tokio** | Rust | 80 K | 10 | **10 / 10** | 10 / 10 | 61 × |
| **React** | JS+TS | 270 K | 10 | **10 / 10** | 10 / 10 | **773 ×** |
| **Aggregate** | | | **30** | **30 / 30 (100 %)** | 30 / 30 | **60×–770×** |

Honest reading:

  - **Hit-rate parity across all three** (10/10 each on Django,
    Tokio, React). The two React misses that previously surfaced
    (test-fixture path bias on `react-007`, devtools-vs-reconciler
    on `react-010`) were resolved by upgrading the embedding
    substrate to `bge-m3` and extending the non-canonical-path
    filter — see
    [`docs/parity-benchmarks.md`](docs/parity-benchmarks.md) for the
    failure analysis and the resolution.
  - **`rg` "100 %" is a recall-ceiling baseline.** It returns
    20 M+ tokens per task (term-OR scan with 2-line context windows).
    Yes the answer is in the dump; no, the agent has to read the
    20 M tokens to find it.
  - **skygrep delivers the answer ranked top-10 in 30 / 30 cases**
    while emitting **60×–770× less context** for the agent's LLM
    round-trip downstream. That is the user-facing claim.

Recall counts a query as a hit when at least one returned chunk
matches the canonical `expected` path or any of the question's
`expected_alternatives`.

### Cascade-only ablation (Django, in-bench numbers)

| Tier | What it does | Cold first query | Warm avg s/q |
| --- | --- | :-: | :-: |
| **cascade** ⭐ default | rg prefilter → file-mean cosine → escalate to HyDE only when uncertain | ~10 s (Ollama loads) | **0.5–2 s** (warm) |
| cascade-cheap | early-exit only, no LLM call | <1 s | <0.2 s |
| cascade + small HyDE (`OLLAMA_HYDE_MODEL=qwen2.5:1.5b`) | uses 1.5 B for HyDE — faster, slightly lower recall | ~5 s | **2.0 s** |
| chunk + rerank | classic chunk cosine + cross-encoder rerank | ~10 s + 30 s reranker load | ~10 s |
| ripgrep raw | `rg -il -F` token-OR (file membership only) | <1 s | <1 s |

The cascade is bimodal by design: ~80 % of queries take the cheap path
(file-mean cosine, no LLM call) and complete in under 200 ms warm; the
remaining ~20 % escalate to a HyDE-augmented retrieval and complete in
the 1–2 s band. With Ollama models kept resident in memory
(`OLLAMA_KEEP_ALIVE=-1`, the 0.6.0 default) the second query in a shell
session no longer pays the 5–10 s Ollama cold-load.

A second [self-test benchmark](docs/token-benchmarking.md) compares
`skygrep` against a simulated grep agent over 30 navigation tasks against
this repo: 30 / 30 recall at top-k 10 with **2× total-token reduction**
and **2.9× context-token reduction** vs the agent baseline.

## How it works

```
your query
    │
    ▼
┌─────────────────────────────────────────────────────────────────┐
│ 1.  ripgrep prefilter      Fast surface-token narrowing          │
│ 2.  file-mean cosine       Rank files by mean of chunk vectors   │
│ 3.  cascade decision       Confident? return cheap. Else escalate│
│ 4.  HyDE escalation        LLM rewrites query → cosine union     │
│ 5.  symbol + graph         Tree-sitter symbol boost + PageRank   │
│     (L2 + L4)              tiebreaker on near-tied candidates    │
└─────────────────────────────────────────────────────────────────┘
    │
    ▼
top-K chunks (path · line range · score · snippet)
```

Each layer is **offline-paid and query-time-free where possible**:
embeddings are precomputed at index time, symbol extraction runs once
per project, the file-export PageRank is one regex pass over the corpus.
Only the cascade's HyDE-escalation path makes a query-time LLM call, and
it only runs on the ~20 % of queries the cheap path is uncertain about.

The full architecture diagram and module-by-module walk-through is at
[`docs/skylakegrep-0.1.0.md`](docs/skylakegrep-0.1.0.md) and
[`docs/roadmap.md`](docs/roadmap.md).

## When to use what

| You want | Use |
| --- | --- |
| Find code by concept ("how does X work?") | `skygrep "<query>"` |
| Find code with a known token | `rg <token>` (it's faster, no setup) |
| Synthesize an answer with citations | `skygrep "<query>" --answer` |
| Decompose a broad question | `skygrep "<query>" --agentic --max-subqueries 3 --answer` |
| Machine-readable output for an agent | `skygrep "<query>" --json` |
| Re-rank candidates with a cross-encoder | `skygrep "<query>" --no-cascade --rerank` |
| Continuously index a watched dir | `skygrep watch /path` |
| Keep the cross-encoder warm across queries | `skygrep serve & ; skygrep "<q>" --daemon-url http://127.0.0.1:7878` |

## Configuration

| Variable | Default | Effect |
| --- | --- | --- |
| `OLLAMA_URL` | `http://localhost:11434` | Ollama server URL. |
| `OLLAMA_EMBED_MODEL` | `nomic-embed-text` | Embedding model. Switching requires `skygrep index --reset`. |
| `OLLAMA_LLM_MODEL` | `qwen2.5:3b` | Used for `--answer` and `--agentic`. |
| `OLLAMA_HYDE_MODEL` | `qwen2.5:3b` | Used for cascade-escalation HyDE. Falls back to `OLLAMA_LLM_MODEL` if not installed. Set to `qwen2.5:1.5b` for ~30 % speedup at the cost of 1 task on 16-task Rust. |
| `OLLAMA_KEEP_ALIVE` | `-1` | Passed to every Ollama call. `-1` keeps models resident indefinitely (recommended). |
| `SKYGREP_DB_PATH` | per-project | When set, skygrep treats the index as curated and disables auto-mutation. |
| `SKYGREP_AUTO_PULL` | unset | Set `yes` to auto-`ollama pull` missing models without prompting. |
| `SKYGREP_AUTO_REFRESH_THROTTLE_SECONDS` | `30` | Skip the mtime scan if the previous refresh ran more recently. |
| `SKYGREP_RERANK_MODEL` | `mixedbread-ai/mxbai-rerank-large-v2` | Cross-encoder for `--rerank`. |
| `SKYGREP_RERANK_POOL` | `50` | Candidate pool before reranking. |

## Releases

This is the first public release of `skylakegrep`. See
[`docs/skylakegrep-0.1.0.md`](docs/skylakegrep-0.1.0.md) for the full
description of capabilities, architecture, CLI flags, and
environment variables.

  - **0.1.0** — first public release. LLM-driven query routing
    (filename / lexical / semantic cascade tiers, intent-aware
    merge), confidence-gated semantic cascade with HyDE escalation +
    cross-encoder rerank + PageRank tiebreaker, lazy PDF / docx
    content extraction (`pdftotext` + `pypdf` fallback, optional
    `--ocr`), framed Pygments-highlighted card rendering, four
    `--detail` levels, `skygrep setup` auto-registration with major
    LLM CLIs (Claude Code, Codex, OpenCode, Gemini CLI, Cursor).
    PolyForm Noncommercial 1.0.0 license.

## CLI reference

```
skygrep setup    [--list|--uninstall|--yes] # register with Claude Code / Codex / OpenCode / Gemini / Cursor
skygrep "<query>" [OPTIONS]                 # bare-form search
skygrep search   "<query>" [OPTIONS]        # explicit search
skygrep doctor                              # health check
skygrep stats                               # project index info
skygrep index    [PATH] [--reset]           # explicit reindex
skygrep watch    [PATH] --interval N        # poll for changes
skygrep serve    [--host H] [--port P]      # warm-reranker daemon
skygrep enrich   [--max N] [--batch B]      # opt-in doc2query enrichment
```

<details>
<summary><b><code>skygrep search</code> options</b></summary>

<br>

| Option | Default | Effect |
| --- | --- | --- |
| `-m`, `-n`, `--top` | 5 | Number of final results. |
| `--json` | off | Emit a JSON array; suppresses human formatting. |
| `--answer` | off | Synthesize an answer from retrieved snippets via Ollama. |
| `--content / --no-content` | on | Show or hide snippet bodies in human output. |
| `--language` | — | Restrict to one or more language keys; repeatable. |
| `--include` / `--exclude` | — | Glob filter (repeatable). |
| `--cascade / --no-cascade` | on | Confidence-gated retrieval. Off = chunk-only legacy path. |
| `--cascade-tau` | 0.015 | Top-1 / top-2 file-mean cosine gap above which to early-exit. |
| `--rerank / --no-rerank` | on | Cross-encoder rerank on the non-cascade path. |
| `--rerank-pool` | 50 | Candidate pool before reranking. |
| `--rerank-model` | env or default | HuggingFace cross-encoder id. |
| `--hyde / --no-hyde` | off | Force HyDE outside the cascade (rare; cascade decides per query). |
| `--multi-resolution / --no-multi-resolution` | on | File-level cosine top-N → chunk-level inside those files. |
| `--file-top` | 30 | Files surfaced by file-level retrieval. |
| `--lexical-prefilter / --no-lexical-prefilter` | on | Use ripgrep to narrow the candidate file set. |
| `--lexical-root` | cwd / git toplevel | Root directory ripgrep scans. |
| `--lexical-min-candidates` | 2 | Fall back to corpus-wide cosine when ripgrep returns fewer files. |
| `--rank-by` | `chunk` | `chunk` (per-file diversity cap) or `file` (one chunk per file). |
| `--auto-index / --no-auto-index` | on | Auto-build the index on first query and refresh on mtime change. |
| `--daemon-url` | — | Send the search to a running `skygrep serve` daemon. |
| `--agentic` | off | Decompose into subqueries via Ollama before search. |
| `--max-subqueries` | 3 | Upper bound on agentic subqueries. |
| `--semantic-only` | off | Skip lexical reranking; rank by cosine alone. |

</details>

<details>
<summary><b>Capability matrix (every feature, when introduced)</b></summary>

<br>

| Capability | Since |
| --- | --- |
| Semantic file search via local Ollama (code + docs + PDFs) | 0.1.0 |
| `bge-m3` multilingual substrate as default embedder | 0.2.0 |
| Content-agnostic reference-graph registry (`register_extractor()`) | 0.2.0 |
| Built-in markdown link extractor (`[](link)`, `![]()`, `[[wiki]]`) | 0.2.0 |
| σ-adaptive cascade threshold (`max(τ_floor, k·σ_topK)`) | 0.2.0 |
| Universal non-canonical-path filter (24 patterns: fixtures/vendor/dist/.min.js/…) | 0.2.0 |
| Symbol-as-retriever channel (internal, opt-in) | 0.2.0 |
| Public-OSS bench: 30 / 30 (Django · React · Tokio) | 0.2.0 |
| Tree-sitter chunking + line-window fallback | 0.2.0 |
| `.gitignore` / `.skygrepignore` hygiene | 0.2.0 |
| Incremental indexing (mtime-based) | 0.2.0 |
| Stale row cleanup | 0.2.0 |
| Watch mode | 0.2.0 |
| Hybrid lexical + semantic ranking | 0.2.0 |
| Stable JSON output | 0.2.0 |
| Local answer mode | 0.2.0 |
| Local agentic decomposition | 0.2.0 |
| Cross-encoder rerank | 0.3.0 |
| Asymmetric query/document embedding prefixes | 0.3.0 |
| HyDE query rewriting | 0.3.0 |
| Multi-resolution retrieval | 0.3.0 |
| Lexical prefilter (ripgrep first stage) | 0.3.0 |
| File-rank (one chunk per file) | 0.3.0 |
| Daemon mode | 0.3.0 |
| Quantisation / device knobs | 0.3.0 |
| Confidence-gated cascade | 0.3.0 (default in 0.4.0) |
| Bare-form invocation `skygrep "<q>"` | 0.4.0 |
| Per-project auto-index | 0.4.0 |
| `skygrep doctor` health check | 0.4.0 |
| Ripgrep fallback for first query | 0.4.1 |
| Symbol-aware indexing (L2) | 0.5.0 |
| doc2query enrichment (L3, opt-in via `skygrep enrich`) | 0.5.0 |
| File-export PageRank tiebreaker (L4) | 0.5.0 |
| Cascade file-mean cosine corpus-wide | 0.5.1 |
| Smaller default HyDE model + `keep_alive=-1` | 0.6.0 |

</details>

## Development

```bash
git clone https://github.com/danielchen26/skylakegrep.git
cd skylakegrep
python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[rerank]"

.venv/bin/pytest -q tests/
.venv/bin/python benchmarks/agent_context_benchmark.py --top-k 10 --summary-only
```

To reproduce the public OSS benchmark numbers, follow the instructions
in [`docs/parity-benchmarks.md`](docs/parity-benchmarks.md).

## License

**PolyForm Noncommercial 1.0.0** — see [`LICENSE`](LICENSE).

Personal, academic, research, hobby, and any other non-commercial
use is fully permitted, including modification and redistribution.
**Commercial use is NOT permitted** under this license. To obtain
a commercial license, open a GitHub issue titled "Commercial
license inquiry" or email <chentianchi@gmail.com>.

## Acknowledgments

- [Ollama](https://ollama.com/) for the local embedding and generation runtime.
- [tree-sitter](https://tree-sitter.github.io/tree-sitter/) for syntax-aware parsing.
- [ripgrep](https://github.com/BurntSushi/ripgrep) for the lexical prefilter stage.
- [Mixedbread](https://www.mixedbread.com/) for the open-source `mxbai-rerank-*-v2` cross-encoder family.
- [nomic-embed-text](https://www.nomic.ai/blog/posts/nomic-embed-text-v1) for the embedding model.
- Click, NumPy, and SQLite for the core runtime dependencies.
