Metadata-Version: 2.4
Name: justembed
Version: 0.1.0a2
Summary: A semantic engine that just works - offline-first semantic search for everyday laptops
Author-email: Krishnamoorthy Sankaran <your.email@example.com>
License: MIT
Project-URL: Homepage, https://github.com/sekarkrishna/justembed
Project-URL: Documentation, https://github.com/sekarkrishna/justembed/tree/main/docs
Project-URL: Repository, https://github.com/sekarkrishna/justembed
Project-URL: Issues, https://github.com/sekarkrishna/justembed/issues
Keywords: semantic-search,embeddings,offline,onnx,nlp,justembed
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: onnxruntime>=1.15.0
Requires-Dist: tokenizers>=0.13.0
Requires-Dist: numpy<2.0.0,>=1.20.0
Requires-Dist: polars>=0.19.0
Requires-Dist: pyarrow>=10.0.0
Requires-Dist: psutil>=5.9.0
Provides-Extra: dev
Requires-Dist: pytest>=7.4.0; extra == "dev"
Requires-Dist: pytest-cov>=4.1.0; extra == "dev"
Requires-Dist: hypothesis>=6.82.0; extra == "dev"
Requires-Dist: black>=23.7.0; extra == "dev"
Requires-Dist: ruff>=0.0.285; extra == "dev"
Requires-Dist: mypy>=1.5.0; extra == "dev"
Dynamic: license-file

# JustEmbed

**A semantic engine that just works.**

Offline-first semantic search for everyday laptops.

---

## ⚠️ Alpha Release

**This is v0.1.0a2 - Working Implementation!**

Core functionality is now complete and ready for testing. Full release v0.1.0 coming soon!

---

## What is JustEmbed?

JustEmbed is an offline-first semantic search library designed for everyday laptops. No cloud. No API keys. No telemetry. Just embed your documents and search.

### Philosophy

- **One model only**: e5-small (English, fast and efficient)
- **Offline-first**: Zero network dependencies
- **Just works**: No configuration, no choices, no surprises
- **Hardware-aware**: Automatic limits based on your laptop
- **Privacy-first**: Everything stays on your machine

---

## Quick Start

```python
import justembed as je

# Load documents from a folder
result = je.load("./documents")
print(f"Found {result['files_total']} files")

# Generate embeddings (first time only)
if not result['indexed']:
    stats = je.embed()
    print(f"Embedded {stats['files_embedded']} files in {stats['time_taken']:.2f}s")

# Search semantically
results = je.search("fruits that are red in color")
for r in results:
    print(f"Score: {r['score']:.3f} | {r['file']}")
    print(f"  {r['text'][:100]}...")

# Check status
status = je.status()
print(f"Loaded: {status['loaded']}")
print(f"Chunks: {status['chunks_used']}/{status['chunks_limit']}")

# Clear query cache
je.clear_cache()

# Unload when done
je.unload()
```

### Core Features

- ✅ Single model (e5-small.onnx - English)
- ✅ Offline-first (zero network dependencies)
- ✅ Python 3.8+ support
- ✅ Polars-based storage (Parquet files)
- ✅ Hardware-aware limits (2-3s soft, 10s hard)
- ✅ Query caching for fast repeated searches
- ✅ Simple API (5 functions + 1 utility)
- ✅ Comprehensive error handling

---

## Installation

```bash
pip install justembed
```

**Current version: v0.1.0a2** - Core functionality implemented and ready for testing!

---

## API Reference

### Main Functions

#### `load(path: str) -> dict`
Load documents from a folder or file.

```python
result = je.load("./documents")
# Returns: {"status": "loaded"|"not_indexed", "files_total": int, "indexed": bool}
```

#### `embed() -> dict`
Generate embeddings for loaded documents.

```python
stats = je.embed()
# Returns: {"files_embedded": int, "chunks_created": int, "time_taken": float}
```

#### `search(query: str, top_k: int = 5) -> list`
Search indexed documents semantically.

```python
results = je.search("red fruits", top_k=10)
# Returns: [{"score": float, "file": str, "text": str}, ...]
```

#### `status() -> dict`
Get current index status.

```python
status = je.status()
# Returns: {"loaded": bool, "path": str, "files_indexed": int, 
#           "chunks_used": int, "chunks_limit": int, "query_cache_size": int}
```

#### `unload() -> None`
Unload current index and clear memory.

```python
je.unload()
```

### Utility Functions

#### `clear_cache() -> None`
Clear query cache to free disk space.

```python
je.clear_cache()
```

### Exception Classes

- `JustEmbedError` - Base exception
- `NotLoadedError` - No folder loaded
- `InvalidInputError` - Invalid path or input
- `ChunkLimitError` - Too many chunks for system
- `TimeoutError` - Operation exceeded time limit

---

## Requirements

- Python 3.8+
- ~100MB disk space (model + dependencies)
- 4GB+ RAM recommended

---

## Dependencies

- `onnxruntime` - ONNX inference
- `tokenizers` - Tokenization (standalone, not transformers!)
- `numpy` - Array operations
- `polars` - DataFrame operations
- `pyarrow` - Parquet I/O
- `psutil` - Hardware detection

**No pandas. No transformers. No network dependencies.**

---

## Roadmap

### v0.1.0a1 (December 2025) - Name Reservation
- ✅ Package name locked on PyPI
- ✅ Basic structure
- ✅ Placeholder functions

### v0.1.0a2 (January 2026) - Working Implementation
- ✅ Full implementation complete
- ✅ All core functions working
- ✅ Property-based tests
- ✅ Hardware-aware limits
- ✅ Query caching
- ✅ Comprehensive error handling

### v0.1.0 (February 2026) - First Stable Release
- ⏳ Production testing
- ⏳ Performance optimization
- ⏳ Complete documentation
- ⏳ Example projects

### v0.2.0 (Future)
- ⏳ Multilingual model support (100+ languages)
- ⏳ Advanced search filters
- ⏳ Batch operations API
- ⏳ Progress callbacks

---

## Why "JustEmbed"?

Because that's all you need to do:

1. **Just embed** your documents
2. **Just search** with natural language
3. **Just works** - no configuration needed

---

## Design Decisions

### One Model Only
We use **e5-small.onnx** (384 dimensions, English). Fast, efficient, and fits PyPI's 100MB limit. Multilingual support coming in v0.2.0.

### Offline-First
Zero network dependencies. Everything runs locally. No telemetry. No surprises.

### Hardware-Aware
Automatic limits based on your laptop's capabilities. Soft limit: 2-3s. Hard limit: 10s.

### Polars, Not Pandas
We use Polars for speed and efficiency. No pandas dependency.

### Tokenizers, Not Transformers
We use the standalone `tokenizers` library (3MB) instead of `transformers` (40MB). 93% smaller!

---

## Target Users

- Non-ML engineers learning AI for the first time
- Business users in paranoid/restricted environments
- Developers who need offline semantic search
- Anyone who wants a safe sandbox to experiment

---

## License

MIT License - see LICENSE file for details.

---

## Author

Krishnamoorthy Sankaran

---

## Links

- **GitHub**: https://github.com/sekarkrishna/justembed
- **PyPI**: https://pypi.org/project/justembed/
- **Issues**: https://github.com/sekarkrishna/justembed/issues

---

## Status

✅ **Core Functionality Complete!** ✅

v0.1.0a2 includes:
- ✅ Document loading and scanning
- ✅ Embedding generation with ONNX
- ✅ Semantic search with cosine similarity
- ✅ Query caching for performance
- ✅ Status monitoring and management
- ✅ Hardware-aware resource limits
- ✅ Comprehensive error handling
- ✅ Property-based testing

Ready for testing and feedback! Full v0.1.0 release coming soon.

---

**JustEmbed - A semantic engine that just works.**
