Metadata-Version: 2.4
Name: similarity_tool
Version: 0.1.2
Summary: A modular, hybrid, and customizable document similarity framework.
Author-email: Stephen Meisenbacher <sjmeis@gtgd.com>
Maintainer-email: Stephen Meisenbacher <sjmeis@gtgd.com>
License: MIT License
        
        Copyright (c) 2026 Stephen Meisenbacher
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy
Requires-Dist: scipy
Requires-Dist: faiss-cpu
Requires-Dist: torch
Requires-Dist: sentence-transformers
Requires-Dist: pyyaml
Requires-Dist: rank_bm25
Requires-Dist: scikit-learn
Requires-Dist: tqdm
Dynamic: license-file

# SimilarityTool

`SimilarityTool` is a high-performance, asynchronous Information Retrieval (IR) and re-ranking pipeline designed for accurate matching across large-scale, long-text professional corpora (e.g., curricula, job descriptions, CVs, and project portfolios). `SimilarityTool` follows a SSS approach, leaning on **semantic**, **syntactic**, and **structured** features to match documents based on core meaning, regardless of domain.

The framework implements a highly optimized **Waterfall Architecture**:
1. **Abstractive Ingestion Pass**: A local small language model processes long text chunks concurrently to strip fluff and isolate core meaning.
2. **Semantic Encoding**: Blends multilingual, structural, and domain-focused transformers into a highly descriptive, high-dimensional embedding.
3. **Syntatic Encoding**: Supports semantic encoding with n-gram and keyword encoding, taking a more syntatic approach.
4. **Structured Encoding**: Incorporate domain- and use case-specific structured features, adding a more structural perpsective to document matching.
5. **Stage-1 Recall**: Lightning-fast retrieval of candidates using a vectorized FAISS index.
6. **Stage-2 Re-ranking**: Evaluates retrieved candidates via multi-channel linear fusion containing point-to-point token syntactic analysis, attribute-level Tversky set overlaps, and deep token-interaction cross-encoding.

## Configuration Setup

The framework is governed by two clean YAML files. Update your parameters inside your project directory configuration files:

### 1. Main Pipeline Configuration (`configs/main_config.yaml`)
```yaml
semantic_engine:
  models:
    - name: "sentence-transformers/all-mpnet-base-v2"
      weight: 1.0
      device: "cuda"
    - name: "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
      weight: 0.6
      device: "cuda"
    - name: "shawhin/distilroberta-ai-job-embeddings"
      weight: 1.5
      device: "cuda"

storage:
  db_path: "data/corpus.db"
  index_path: "data/corpus.index"
  vector_dimension: 1920  # Matched perfectly to Concatenated Model Vectors (768 + 384 + 768)

orchestrator:
  strategy: "concatenate"
  weights:
    semantic: 0.5   # Cross-Encoder definitive strength
    syntactic: 0.2  # Min-Max Normalized pool token matching 
    structured: 0.3 # Tversky criteria matching`
```

### 2. Domain-specific Schema Rules (`configs/schema_config.yaml`)

```yaml
text_fields:
  - name: "text"
    semantic_weight: 0.7
    syntactic_weight: 0.3
  - name: "title"
    semantic_weight: 1.0
    syntactic_weight: 0.0

structured_collections:
  - name: "tasks"
    alpha: 1.0
    beta: 1.0
    weight: 0.5
  - name: "skills"
    alpha: 0.2
    beta: 2.5   # Heavy penalty for candidates missing requested skills
    weight: 0.3
  - name: "ai"
    alpha: 1.0
    beta: 1.0
    weight: 0.2
```

## Pipeline Usage Guide
### Batch Ingestion
Ingest vast datasets from a Pandas DataFrame.

```python
import pandas as pd
from similarity_tool import SimilarityTool
from similarity_tool.utils import DataMapper

# 1. Initialize the tool system layers (boots LLM and embedding models)
tool = SimilarityTool(
    main_config="configs/main_config.yaml", 
    schema_config="configs/schema_config.yaml",
    use_llm_distillation=True
)

# 2. Ingestion example
raw_data = {
    "doc_id": ["id_843125", "id_941012"],
    "title": ["Senior Deep Learning Architect", "Full-Stack Dev"],
    "description": [
        "Massive long 3000-word corporate description containing boilerplate benefits...",
        "Looking for a web application developer specializing in React and Python..."
    ],
    "skills": ["Python,PyTorch,CUDA,Docker", "JavaScript,React,Postgres"],
    "tasks": ["architecture,deployment", "frontend,api"],
    "ai": ["LLMs"]
}
df = pd.DataFrame(raw_data)

# 3. Trigger optimized transactional batch ingestion
DataMapper.batch_ingest_dataframe(
    tool=tool,
    df=df,
    text_columns={"description": "full_text", "title": "job_title"},
    collection_columns={"skills_required": "skills", "core_tasks": "tasks", "ai": "ai"},
    id_column="doc_id",
    delimiter=",",
    batch_size=16 
)
```

### Query Search (1:N)
Execute a query on a target document.

```python
# Construct a target query mapping document matching schema attributes
query = {
    "text_fields": {
        "job_title": "AI Infrastructure Engineer",
        "full_text": "Deploying deep learning models at scale using PyTorch and tuning custom CUDA kernels."
    },
    "collections": {
        "skills": ["Python", "PyTorch", "CUDA"],
        "tasks": ["architecture", "deployment"]
    }
}

# Run the queryt 
# limit: FAISS candidate subset retrieval boundary (lower is quicker, but less broad of a search)
# top_k: Final returned target slice
results = tool.search(query, limit=50, top_k=3)

# Display results
for rank, match in enumerate(results, 1):
    print(f"Rank {rank}: Doc ID = {match['id']} | Total Score = {match['total_score']}")
    print(f"  └─ Sem Cross: {match['breakdown']['semantic_cross']} | Syn: {match['breakdown']['syntactic']} | Str: {match['breakdown']['structured']}\n")
```

### N:N Composite Document Search
Find documents that match the combined profile of multiple query documents simultaneously. 

```python
queries = [
    {
        "text_fields": {"job_title": "AI Architect", "full_text": "Expertise optimizing distributed CUDA clusters."},
        "collections": {"skills": ["CUDA", "C++"], "tasks": ["infrastructure"]}
    },
    {
        "text_fields": {"job_title": "ML DevOps Engineer", "full_text": "Building orchestration templates via Docker and PyTorch."},
        "collections": {"skills": ["PyTorch", "Docker"], "tasks": ["deployment"]}
    }
]

# Find the best matches across the corpus that fit this combined query documents
fused_results = tool.search_composite(queries, limit=50, top_k=5)

for rank, match in enumerate(fused_results, 1):
    print(f"Composite Rank {rank}: Doc ID = {match['id']} | Unified Score = {match['total_score']}")
```

### 1:1 Document Comparison

```python
doc_a = {
    "text_fields": {"job_title": "Data Scientist", "full_text": "Focusing on pandas and scikit-learn models."},
    "collections": {"skills": ["Python", "Scikit-Learn"], "tasks": ["modeling"]}
}

doc_b = {
    "text_fields": {"job_title": "ML Engineer", "full_text": "Building predictive scikit-learn setups in python."},
    "collections": {"skills": ["Python", "Scikit-Learn", "Docker"], "tasks": ["modeling", "devops"]}
}

comparison = tool.compare(doc_a, doc_b)
```

### Hyperparameter Tuning and Hot-Swapping Configuration (Advanced)
Fine-tune structural weights, Tversky penalties, and any other paramters on the fly without re-instantiating the tool.

```python
tool.update_config('orchestrator', 'weights', {'semantic': 0.8, 'syntactic': 0.1, 'structured': 0.1})
run_a = tool.search(query, limit=50, top_k=1)

tool.update_config(
    category='schema', 
    key='structured_collections', 
    value={'alpha': 0.2, 'beta': 3.5, 'weight': 0.9}, 
    target_name='skills'
)

tool.update_config('orchestrator', 'weights', {'semantic': 0.2, 'syntactic': 0.1, 'structured': 0.7})

run_b = tool.search(query, limit=50, top_k=1)
```
