Metadata-Version: 2.4
Name: grepctl
Version: 0.3.4
Summary: One-command orchestration for multimodal semantic search in BigQuery
Author-email: Gregory Mulla <gregory.cr.mulla@gmail.com>
Maintainer-email: Gregory Mulla <gregory.cr.mulla@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/gregorymulla/grepctl
Project-URL: Documentation, https://github.com/gregorymulla/grepctl#readme
Project-URL: Repository, https://github.com/gregorymulla/grepctl.git
Project-URL: Issues, https://github.com/gregorymulla/grepctl/issues
Keywords: bigquery,semantic-search,vector-search,multimodal,google-cloud,machine-learning,embeddings,gcs,vertex-ai,document-search,rag,vector-database,retrieval-augmented-generation,llm-orchestration,multimodal-search,document-ai,vision-api,speech-to-text,video-intelligence,semantic-similarity,text-embeddings,hybrid-search,knowledge-base,data-lake
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Database
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing :: Indexing
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: click>=8.2.1
Requires-Dist: google-cloud-aiplatform>=1.113.0
Requires-Dist: google-cloud-bigquery>=3.37.0
Requires-Dist: google-cloud-documentai>=3.6.0
Requires-Dist: google-cloud-speech>=2.33.0
Requires-Dist: google-cloud-videointelligence>=2.16.2
Requires-Dist: google-cloud-vision>=3.10.2
Requires-Dist: google-cloud-storage>=2.10.0
Requires-Dist: pypdf2>=3.0.1
Requires-Dist: pyyaml>=6.0.2
Requires-Dist: rich>=14.1.0
Requires-Dist: tqdm>=4.67.1
Requires-Dist: fastapi>=0.104.0
Requires-Dist: uvicorn[standard]>=0.24.0
Requires-Dist: pydantic>=2.5.0
Requires-Dist: python-multipart>=0.0.6
Provides-Extra: dev
Requires-Dist: pytest>=7.4.0; extra == "dev"
Requires-Dist: pytest-cov>=4.1.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Requires-Dist: flake8>=6.0.0; extra == "dev"
Requires-Dist: mypy>=1.5.0; extra == "dev"
Requires-Dist: diagrams>=0.24.4; extra == "dev"
Provides-Extra: multimedia
Requires-Dist: imageio>=2.37.0; extra == "multimedia"
Requires-Dist: imageio-ffmpeg>=0.6.0; extra == "multimedia"
Requires-Dist: matplotlib>=3.10.6; extra == "multimedia"
Provides-Extra: research
Requires-Dist: arxiv>=2.2.0; extra == "research"
Requires-Dist: datasets>=4.0.0; extra == "research"
Requires-Dist: tensorflow-datasets>=4.9.9; extra == "research"
Requires-Dist: yt-dlp>=2025.9.5; extra == "research"
Requires-Dist: graphviz>=0.20.3; extra == "research"
Provides-Extra: server
Requires-Dist: fastapi>=0.104.0; extra == "server"
Requires-Dist: uvicorn[standard]>=0.24.0; extra == "server"
Requires-Dist: pydantic>=2.5.0; extra == "server"
Requires-Dist: python-multipart>=0.0.6; extra == "server"
Requires-Dist: aiofiles>=23.2.1; extra == "server"
Dynamic: license-file

<div align="center">
  <img src="https://raw.githubusercontent.com/gregorymulla/grepctl/master/images/grepctl_logo.png" alt="grepctl logo" width="200">

  # grepctl - Semantic Search For Your Data Lake

  [![PyPI version](https://badge.fury.io/py/grepctl.svg)](https://pypi.org/project/grepctl/)
  [![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)
  [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
</div>


**grepctl** is a command-line and programmatic utility that enables semantic search across heterogeneous data lakes. By leveraging Google Cloud's advanced AI services and BigQuery's vector search capabilities, grepctl transforms unstructured data into a semantically searchable index. We describe the data ingestion pipeline, multimodal processing architecture, and the multiple interfaces—CLI, Web, Python, and SQL—that make this system both powerful and accessible.

<br>

<a href="https://youtu.be/KiJm0RMkHG0">
  <img src="images/sample_screen.png" alt="FireSense Demo" width="800px">
</a>




## Data Modalities & Processing

grepctl processes **9 different data types** automatically:

| Modality | Processing Method |
|----------|-------------------|
| **Text/Markdown** | Direct content extraction, preserving structure |
| **PDF** | OCR via Google Document AI for text extraction |
| **Office Documents** | Document AI extracts content from .docx, .xlsx, .pptx |
| **Images** | Vision API extracts labels, text, objects, and faces |
| **Audio** | Speech-to-Text API transcribes to searchable text |
| **Video** | Video Intelligence API analyzes frames and transcribes speech |
| **JSON/CSV** | Structured data parsing with field preservation |

grepctl supports nine data modalities, including text, PDFs, office documents, images, audio, video, and structured JSON/CSV files. Each modality undergoes tailored extraction and processing steps, such as OCR for scanned documents and transcription for audio/video. All processed content is chunked and embedded into a 768-dimensional vector space.


## Getting Started

```bash
grepctl ingest -b <bucket>
```


## Four Search Interfaces

Access your indexed data through multiple interfaces:

1. **CLI** - Command-line search:
   ```bash
   grepctl search "your query"
   ```

2. **Web Interface** - Interactive UI:
   ```bash
   grepctl serve
   ```

3. **Python Interface** - Programmatic access:
   ```python
   from grepctl.search.vector_search import SemanticSearch
   results = searcher.search("query", top_k=10)
   ```

4. **SQL Interface** - Direct BigQuery queries:
   ```sql
   WITH query_embedding AS (
     SELECT ml_generate_embedding_result AS embedding
     FROM ML.GENERATE_EMBEDDING(
       MODEL `project.mmgrep.text_embedding_model`,
       (SELECT 'your search string' AS content)
     )
   )
   SELECT doc_id, text_content, distance AS score
   FROM VECTOR_SEARCH(
     TABLE `project.mmgrep.search_corpus`,
     'embedding',
     (SELECT embedding FROM query_embedding),
     top_k => 10
   )
   ```

All interfaces leverage BigQuery's VECTOR_SEARCH for sub-second semantic search across your entire data lake.

## Detailed Processing Pipeline

### 1. Text Files (.txt, .log, .md)
- **Direct extraction** from Google Cloud Storage via BigQuery's `EXTERNAL_QUERY` function
- No intermediate processing needed - text content read directly into BigQuery tables
- Content is chunked into 1000-character segments with 100-character overlap
- Markdown structure preserved with heading hierarchy
- Each chunk maintains context through overlapping windows

### 2. PDF Documents (.pdf)
- **Google Document AI** performs OCR on all pages
- Handles both text-based and scanned PDFs
- Extracted text is chunked semantically by paragraphs
- Page numbers and document structure preserved in metadata

### 3. Office Documents (.docx, .xlsx, .pptx)
- **Document AI** extracts text content
- Preserves document structure (headings, tables, slides)
- Excel sheets converted to structured text representation
- PowerPoint slides maintain slide order and notes

### 4. Audio Files (.mp3, .wav, .m4a, .flac)
- **Speech-to-Text API v2** provides accurate transcription
- Automatic punctuation and speaker diarization
- Supports long-form audio (up to 480 minutes)
- Transcripts chunked by natural speech boundaries
- Timestamps preserved for temporal search

### 5. Video Files (.mp4, .avi, .mov, .mkv)
- **Video Intelligence API** analyzes visual content:
  - Shot detection and scene changes
  - Object tracking and label detection
  - OCR on text appearing in frames
  - Face and logo detection
- **Speech-to-Text** transcribes audio track separately
- Frame descriptions combined with transcripts
- Temporal alignment between visual and audio elements

Each processing pipeline outputs structured text that is then embedded using Vertex AI's text-embedding-004 model, creating 768-dimensional vectors optimized for semantic similarity search.

## SQL Interface Functions

### Setup

The search functions are automatically created when you run:

```bash
grepctl setup
```

This creates the following functions in your BigQuery dataset:

1. **`search(query)`** - Simple search with defaults
2. **`semantic_search(query, top_k, min_relevance)`** - Full control search
3. **`search_by_source(query, sources, top_k)`** - Filter by file types
4. **`search_by_date(query, start_date, end_date, top_k)`** - Date range search
5. **`search_content(query, limit)`** - Just return content strings

### Function Reference

#### Core Search Functions

```sql
-- Simple search (defaults: top_k=10, min_relevance=0.0)
CALL `your-project.grepmm.search`("your query");

-- Full semantic search
CALL `your-project.grepmm.semantic_search`(
    "query text",           -- Search query
    20,                     -- Number of results
    0.7                     -- Minimum relevance (0-1)
);

-- Returns:
-- doc_id, uri, source, modality, text_content,
-- relevance_score, created_at, metadata
```

#### Filtered Search Functions

```sql
-- Search by source types
CALL `your-project.grepmm.search_by_source`(
    "query",
    ["pdf", "markdown"],    -- Array of sources
    10                      -- Top K results
);

-- Search by date range
CALL `your-project.grepmm.search_by_date`(
    "query",
    DATE('2024-01-01'),     -- Start date
    CURRENT_DATE(),         -- End date
    15                      -- Top K results
);

-- Get just content
CALL `your-project.grepmm.search_content`(
    "query",
    5                       -- Limit
);
```

The functions handle all the complexity of embeddings and vector search - you just write simple SQL queries!

## Python API Functions

### Installation

```bash
# Using uv (recommended)
uv add grepctl

# Using pip
pip install grepctl

# For development
git clone <repo>
cd bq-semgrep
uv sync
```

### Configuration

The SearchClient will automatically use your existing grepctl configuration from `~/.grepctl/config.yaml`:

```yaml
project_id: your-project
dataset_name: grepmm
location: us-central1
```

Or you can specify a custom config path:

```python
client = SearchClient(config_path="/path/to/config.yaml")
```

Or override the project ID:

```python
client = SearchClient(project_id="my-project-id")
```

### API Reference

#### SearchClient Methods

```python
# Full search with all options
results = client.search(
    query="search text",           # Search query
    top_k=10,                      # Number of results
    sources=['pdf', 'text'],       # Filter by source types
    rerank=False,                  # Use LLM reranking
    regex_filter=r"pattern",       # Regex filter
    start_date="2023-01-01",       # Date range start
    end_date="2024-12-31"          # Date range end
)

# Simple search - just returns content strings
contents = client.search_simple("query", limit=5)

# Get system statistics
stats = client.get_stats()
```

#### Convenience Function

```python
from grepctl import search

# Quick search without client
results = search("query", top_k=10, rerank=True)
```

You now have a powerful, simple Python API for semantic search across all your data. The SearchClient handles all the complexity of BigQuery connections, embedding models, and vector search - you just focus on building great applications!

## Documentation

- [Python Interface Guide](PYTHON_INTERFACE.md) - Complete examples and API reference for Python integration
- [SQL Interface Guide](SQL_INTERFACE.md) - BigQuery SQL functions and advanced query examples
