Metadata-Version: 2.2
Name: embedded-finder
Version: 0.2.2
Summary: Semantic file search powered by Google Gemini embeddings
Author: Vlad Marian
License: MIT
Project-URL: Homepage, https://github.com/vladmarian20005/EmbeddedFinder
Project-URL: Repository, https://github.com/vladmarian20005/EmbeddedFinder
Project-URL: Issues, https://github.com/vladmarian20005/EmbeddedFinder/issues
Keywords: semantic-search,embeddings,gemini,vector-search,file-search
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Software Development :: Libraries
Classifier: Topic :: Text Processing :: Indexing
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: google-genai>=1.0.0
Requires-Dist: chromadb>=0.4.0
Requires-Dist: click>=8.0
Requires-Dist: python-dotenv>=1.0.0
Requires-Dist: rich>=13.0
Requires-Dist: watchdog>=3.0
Requires-Dist: pypdf>=4.0
Requires-Dist: python-docx>=1.0
Requires-Dist: Pillow>=10.0
Requires-Dist: mutagen>=1.47
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0; extra == "dev"

<p align="center">
  <h1 align="center">EmbeddedFinder</h1>
  <p align="center">
    <strong>Semantic file search for your local filesystem.</strong><br>
    Ask questions in plain English — find what you need across code, documents, images, audio, and video.
  </p>
  <p align="center">
    <a href="https://pypi.org/project/embedded-finder/"><img alt="PyPI" src="https://img.shields.io/pypi/v/embedded-finder"></a>
    <a href="https://pypi.org/project/embedded-finder/"><img alt="Python" src="https://img.shields.io/pypi/pyversions/embedded-finder"></a>
    <a href="https://github.com/vladmarian20005/EmbeddedFinder/blob/main/LICENSE"><img alt="License" src="https://img.shields.io/badge/license-MIT-blue"></a>
  </p>
</p>

Powered by [Google Gemini Embedding 2](https://ai.google.dev/gemini-api/docs/embeddings) and [ChromaDB](https://www.trychroma.com/).

```
❯ efind

╭─ ◆ EmbeddedFinder  v0.1.0 ─────────────────────────────────────╮
│   Semantic file search powered by Gemini Embedding 2             │
│   ● 142 files  (387 chunks)  │  .embeddedfinder/db              │
╰──────────────────────────────────────────────────────────────────╯

  Type a query to search, or /help for commands.

❯ functions that validate user authentication tokens

  5 results  (0.3s)  │  "functions that validate user authentication tokens"

   1  95%   PY   auth.py  4K
        src/auth/auth.py
        ▸ def validate_token(token: str) -> bool: ...

   2  87%   PY   middleware.py  2K
        src/middleware/middleware.py
        ▸ class AuthMiddleware: def process_request(self, req)...
```

## Why EmbeddedFinder?

Traditional file search (`grep`, `find`, `ag`) matches exact text. EmbeddedFinder understands *meaning*. Search for "error handling in payments" and find files about exception catching in billing code — even if those exact words never appear.

It works on everything: source code, config files, PDFs, Word documents, images, audio, and video — all in one index.

## Features

- **Natural language search** — describe what you're looking for, not keywords
- **Multimodal indexing** — code, text, PDFs, DOCX, images, audio, and video files
- **Interactive TUI** — rich terminal UI with slash commands, progress bars, and color-coded results
- **First-run setup wizard** — guided onboarding with API key validation
- **Incremental indexing** — content-hashed, only re-processes changed files
- **Batch embedding** — groups chunks into minimal API calls for fast indexing
- **File watching** — auto-reindex when files change on disk
- **One-shot CLI** — scriptable commands for CI/automation
- **Smart ranking** — filename matching, file type relevance, and content-aware scoring

## Quick start

### Install

```bash
pip install embedded-finder
```

Or from source:

```bash
git clone https://github.com/vladmarian20005/EmbeddedFinder.git
cd EmbeddedFinder
pip install .
```

### Run

```bash
efind
```

On first launch, a setup wizard walks you through:

1. Enter your [Google AI API key](https://aistudio.google.com/apikey) (free tier available)
2. The key is validated and saved securely to `~/.config/embeddedfinder/config.json`
3. Optionally index a directory right away

That's it — start searching.

### Already have a key?

```bash
# Option A: environment variable
export GOOGLE_API_KEY=your-key-here

# Option B: .env file in your project root
echo "GOOGLE_API_KEY=your-key-here" > .env

# Option C: set it interactively
efind
# then type: /key set
```

## Usage

### Interactive mode (default)

```bash
efind
```

Type natural language queries at the `❯` prompt:

```
❯ database migration scripts
❯ files that handle image resizing
❯ error handling in the payment module
❯ screenshots of the dashboard
❯ audio files with speech
```

Results show similarity scores, file types, paths, and content snippets — color-coded by relevance.

### Slash commands

| Command             | Description                                |
| ------------------- | ------------------------------------------ |
| `/index <path>`     | Index a directory                          |
| `/reindex <path>`   | Re-index only changed files                |
| `/status`           | Show index statistics                      |
| `/clear`            | Clear the entire index                     |
| `/watch <path>`     | Watch a directory and auto-reindex         |
| `/key`              | Show current API key info                  |
| `/key set`          | Set or change your API key                 |
| `/key delete`       | Remove saved API key                       |
| `/key show`         | Reveal the full API key                    |
| `/help`             | Show available commands                    |
| `/quit` or `Ctrl+C` | Exit                                      |

### CLI commands

For scripting and one-off use:

```bash
# Index a directory
efind index ./src

# Index specific file types only
efind index ./src -e .py -e .ts

# Search
efind search "authentication middleware"

# Search with options
efind search "config parsing" --top 5 --min-score 0.7

# Plain text output (no colors, good for piping)
efind search "database models" --plain

# Re-index changed files only
efind reindex ./src

# Watch for changes
efind watch ./src

# Show index stats
efind status

# Clear the index
efind clear

# Check version
efind --version
```

## Supported file types

| Category | Extensions |
| -------- | ---------- |
| **Code** | `.py` `.js` `.ts` `.jsx` `.tsx` `.java` `.c` `.cpp` `.h` `.hpp` `.go` `.rs` `.rb` `.php` `.swift` `.kt` `.scala` `.sh` `.bash` `.zsh` `.lua` `.pl` `.ex` `.exs` `.r` `.m` `.sql` |
| **Markup** | `.html` `.css` `.scss` `.less` `.xml` `.svg` |
| **Config** | `.json` `.yaml` `.yml` `.toml` `.ini` `.cfg` `.conf` |
| **Text** | `.txt` `.md` `.rst` `.csv` |
| **Documents** | `.pdf` `.docx` |
| **Images** | `.png` `.jpg` `.jpeg` `.gif` `.webp` `.bmp` |
| **Audio** | `.mp3` `.wav` `.ogg` `.flac` `.m4a` |
| **Video** | `.mp4` `.mov` `.avi` `.mkv` `.webm` |

Images, audio, and video are embedded natively using Gemini's multimodal capabilities — no transcription or OCR needed.

PDFs with 6 or fewer pages are embedded natively; larger PDFs use text extraction for efficiency.

## How it works

```
 Directory          EmbeddedFinder                    ChromaDB
 ─────────     ─────────────────────────     ─────────────────────

  files/ ──→  1. Crawl  (skip .git, etc.)
           ──→  2. Extract  (text / bytes)
           ──→  3. Chunk   (~2000 tokens)
           ──→  4. Hash    (SHA-256 dedup)
           ──→  5. Embed   (Gemini API)   ──→  Store vectors

  query  ──→  6. Embed query              ──→  Nearest-neighbor
           ──→  7. Deduplicate by file             search
           ──→  8. Re-rank & boost        ──→  Results
```

- **Content hashing** — files are fingerprinted with SHA-256; re-indexing skips anything unchanged
- **Batch embedding** — text chunks are grouped into batches (up to 100 per API call) for throughput
- **Rate limiting** — built-in token bucket limiter respects Gemini API quotas
- **Parallel processing** — multi-threaded extraction and embedding with up to 4 workers
- **Smart ranking** — results are boosted by filename match, file type relevance to query, content overlap, and path depth
- **Directory filtering** — hidden directories (starting with `.`) and common non-content directories (`node_modules`, `__pycache__`, `.venv`, `dist`, `build`, etc.) are automatically skipped during crawling

## Configuration

| Variable | Default | Description |
| -------- | ------- | ----------- |
| `GOOGLE_API_KEY` | — | Google AI API key (required) |
| `EMBEDDEDFINDER_DB_DIR` | `.embeddedfinder/db` | Path to the ChromaDB database |

The API key can also be stored via the setup wizard or `/key set`, which saves it to `~/.config/embeddedfinder/config.json` with owner-only permissions.

## Project structure

```
embedded_finder/
├── cli.py            # Click CLI — subcommands + TUI launcher
├── tui.py            # Interactive Rich-based REPL
├── config.py         # Settings, supported extensions, env vars
├── config_store.py   # Persistent config file management
├── crawler.py        # Recursive file discovery
├── extractor.py      # Text extraction, chunking, MIME detection
├── embedder.py       # Gemini Embedding API client + batching
├── store.py          # ChromaDB vector store
├── indexer.py        # Orchestrates crawl → extract → embed → store
├── search.py         # Query embedding + nearest-neighbor search
├── ranker.py         # Result ranking, dedup, and formatting
├── rate_limiter.py   # Token bucket rate limiter
└── watcher.py        # Filesystem watcher (watchdog)
```

## Development

```bash
# Clone and install with dev dependencies
git clone https://github.com/vladmarian20005/EmbeddedFinder.git
cd EmbeddedFinder
pip install -e ".[dev]"

# Run tests
pytest

# Run tests with coverage
pytest --cov=embedded_finder
```

## Contributing

Contributions are welcome! Please open an issue or submit a pull request.

1. Fork the repository
2. Create your feature branch (`git checkout -b feature/my-feature`)
3. Commit your changes (`git commit -m 'Add my feature'`)
4. Push to the branch (`git push origin feature/my-feature`)
5. Open a pull request

## License

MIT
