Metadata-Version: 2.4
Name: drift-spark
Version: 0.3.0
Summary: Spark-native embedding lifecycle — produce, CDC refresh, migrate, audit.
Project-URL: Homepage, https://github.com/aayush4vedi/drift-spark
Project-URL: Repository, https://github.com/aayush4vedi/drift-spark
Author-email: Aayush Chaturvedi <4vedi.aayush@gmail.com>
License: MIT
Keywords: cdc,delta-lake,embeddings,qdrant,rag,spark,vector-database
Classifier: Development Status :: 2 - Pre-Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.9
Requires-Dist: typer>=0.12.0
Provides-Extra: all
Requires-Dist: delta-spark>=3.0.0; extra == 'all'
Requires-Dist: pandas>=2.0; extra == 'all'
Requires-Dist: psycopg2-binary>=2.9.0; extra == 'all'
Requires-Dist: pyspark>=3.4.0; extra == 'all'
Requires-Dist: qdrant-client>=1.9.0; extra == 'all'
Provides-Extra: delta
Requires-Dist: delta-spark>=3.0.0; extra == 'delta'
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.4; extra == 'dev'
Provides-Extra: pgvector
Requires-Dist: psycopg2-binary>=2.9.0; extra == 'pgvector'
Provides-Extra: qdrant
Requires-Dist: qdrant-client>=1.9.0; extra == 'qdrant'
Provides-Extra: spark
Requires-Dist: pandas>=2.0; extra == 'spark'
Requires-Dist: pyspark>=3.4.0; extra == 'spark'
Description-Content-Type: text/markdown

# drift-spark

> **Spark-native embedding lifecycle** — produce, CDC refresh, model-migrate, audit.

`pip install drift-spark` · `import drift` · MIT

---

**Status: pre-alpha (v0.0.1 placeholder).** 

Drift is a Python library that turns the standard 300-line PySpark embedding pipeline into three declarative commands:

```bash
drift embed --table my_catalog.docs --text-col body --sink qdrant://localhost:6333/docs
drift watch --table my_catalog.docs --text-col body --sink qdrant://localhost:6333/docs
drift status --sink qdrant://localhost:6333/docs
```

**What it does:**
- `embed()` — Spark-native embedding with dedup, batching, multi-model, Qdrant + pgvector sinks
- `watch()` — incremental CDC refresh via Delta Change Data Feed → only changed rows re-embedded
- `migrate()` — dual-write model migration with lineage tracking (v1.0); adapter projection (v2.0)
- Lineage ledger — per-embedding cost, source tracing, GDPR-delete proof (SQLite, queryable)

**GitHub:** https://github.com/aayush4vedi/drift-spark
**PyPI**: https://pypi.org/project/drift-spark/0.0.1/