Metadata-Version: 2.4
Name: dbless
Version: 0.1.0
Summary: Lightweight, no-database document search engine using quantized numpy vectors with BM25 and definition-aware ranking.
Author-email: Rahul Reddy <Rahulreddy9725@gmail.com>
License-Expression: MIT
Project-URL: Homepage, https://github.com/rahulreddy9725/Dbless
Project-URL: Repository, https://github.com/rahulreddy9725/Dbless.git
Project-URL: Documentation, https://github.com/rahulreddy9725/Dbless#readme
Project-URL: Bug Tracker, https://github.com/rahulreddy9725/Dbless/issues
Project-URL: Changelog, https://github.com/rahulreddy9725/Dbless/releases
Keywords: search,document-search,pdf,numpy,bm25,no-database,semantic-search
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Text Processing :: Indexing
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.20.0
Requires-Dist: pymupdf>=1.23.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
Dynamic: license-file

# DBless

> **Lightweight, no-database PDF search engine** — pure Python, no servers, no setup.

[![PyPI version](https://badge.fury.io/py/dbless.svg)](https://pypi.org/project/dbless/)
[![Python](https://img.shields.io/pypi/pyversions/dbless)](https://pypi.org/project/dbless/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://github.com/rahulreddy9725/Dbless/blob/main/LICENSE)

---

## 💡 What is DBless?

**DBless** is a lightweight document search engine that works entirely in-memory — no database, no server, no external dependencies beyond numpy. Point it at a PDF and start searching in seconds.

### Why DBless?

- 🚀 **Zero setup** — no database to install or configure
- 🎯 **Definition-aware** — understands "What is X?" style queries
- 🪶 **Lightweight** — only requires `numpy` and `pymupdf`
- 📄 **Multi-domain** — works on legal, medical, corporate, and technical PDFs
- ⚡ **Fast** — sub-100ms search on CPU

---

## 📚 Documentation Contents

- **[Quick Start](#-quick-start)** — Up and running in 2 minutes
- **[Installation](#-installation)** — Install via pip or from source
- **[API Reference](#-api-reference)** — Full API for `DBlessEngine`
- **[CLI Usage](#-cli-usage)** — Command-line interface
- **[How It Works](#-how-it-works)** — Architecture overview
- **[Contributing](#-contributing)** — How to contribute

---

## 🚀 Quick Start

```python
from dbless.engine import DBlessEngine

# 1. Load and index a PDF
engine = DBlessEngine.from_pdf(
    "document.pdf",
    chunk_size=100,   # words per chunk
    overlap=20        # overlap between chunks
)

# 2. Search
results = engine.search("What is machine learning?", k=5)

# 3. Print results
for result in results:
    print(f"Score : {result['score']:.2f}")
    print(f"Snippet: {result['snippet']}")
    print("---")
```

---

## 📦 Installation

```bash
pip install dbless
```

Or install from source:

```bash
git clone https://github.com/rahulreddy9725/Dbless.git
cd Dbless
pip install -e .
```

---

## 🔧 API Reference

### `DBlessEngine.from_pdf(path, chunk_size, overlap, vector_dim, factor_rank)`

Loads a PDF, chunks it, embeds it, and builds the search index.

| Parameter | Type | Default | Description |
|---|---|---|---|
| `path` | `str / Path` | required | Path to the PDF file |
| `chunk_size` | `int` | `600` | Number of words per chunk |
| `overlap` | `int` | `150` | Word overlap between consecutive chunks |
| `vector_dim` | `int` | `512` | Hash vector dimensionality |
| `factor_rank` | `int` | `128` | SVD factorization rank |

**Returns:** `DBlessEngine` instance

---

### `engine.search(query, k)`

Search the indexed PDF for the most relevant chunks.

| Parameter | Type | Default | Description |
|---|---|---|---|
| `query` | `str` | required | Natural language query |
| `k` | `int` | `5` | Number of results to return |

**Returns:** `list[dict]` — each result contains:

| Key | Description |
|---|---|
| `text` | Full chunk text |
| `snippet` | Best-matching sentence(s) from the chunk |
| `score` | Relevance score (0–100) |
| `page` | Page number in the PDF |
| `chunk_id` | Chunk index |

**Example:**
```python
results = engine.search("What are exceptions?", k=3)
for r in results:
    print(r["snippet"])   # best answer sentence
    print(r["score"])     # relevance score 0-100
    print(r["page"])      # page number
```

---

## 🖥️ CLI Usage

DBless ships with a command-line tool:

```bash
# Index a PDF and show chunk count
dbless index document.pdf

# Query a PDF and get top results
dbless query document.pdf "What is machine learning?" -k 5
```

---

## ⚙️ How It Works

| Step | Description |
|---|---|
| 1. **Chunk** | PDF is split into overlapping word-based chunks |
| 2. **Embed** | Each chunk is hashed into a sparse numpy vector |
| 3. **IDF Weight** | Term frequency weighting applied across all chunks |
| 4. **SVD Compress** | Dimensionality reduced via matrix factorization |
| 5. **Quantize** | Vectors quantized to Int8 for memory efficiency |
| 6. **Search** | Query vector matched via dot product + BM25 boosting |
| 7. **Re-rank** | Definition-style queries get special re-ranking |

---

## 🧪 Testing

```bash
pytest tests/
```

Test coverage includes:
- ✅ PDF ingestion and chunking
- ✅ Vector quantization accuracy
- ✅ Phrase and keyword search
- ✅ End-to-end engine from PDF to results

---

## 📊 Performance

| Metric | Value |
|---|---|
| Memory per 100-page PDF | ~10–50 MB |
| Search speed | < 100ms on CPU |
| Top-3 accuracy (definitions) | 85–95% |
| Python support | 3.8 – 3.12 |

---

## 🤝 Contributing

Contributions are welcome!

1. Fork the repository
2. Create a feature branch: `git checkout -b feature/my-feature`
3. Commit your changes: `git commit -m "Add my feature"`
4. Push and open a Pull Request

Please open an [issue](https://github.com/rahulreddy9725/Dbless/issues) first for major changes.

---

## 📄 License

MIT License — see [LICENSE](LICENSE) for details.

---

## 🔗 Quick Links

- [PyPI Package](https://pypi.org/project/dbless/)
- [GitHub Repository](https://github.com/rahulreddy9725/Dbless)
- [Bug Tracker](https://github.com/rahulreddy9725/Dbless/issues)
- [Changelog](https://github.com/rahulreddy9725/Dbless/releases)

---

**Built with ❤️ by [Rahul Reddy](https://github.com/rahulreddy9725)**
