Metadata-Version: 2.4
Name: semshift
Version: 0.2.0
Summary: Local-first semantic review assistant that flags likely risky meaning changes in edited text.
Project-URL: Homepage, https://semshift-landing.vercel.app/
Project-URL: Repository, https://github.com/VeerajSai/SemShift
Project-URL: Issues, https://github.com/VeerajSai/SemShift/issues
Project-URL: Documentation, https://github.com/VeerajSai/SemShift/tree/main/docs
Project-URL: Changelog, https://github.com/VeerajSai/SemShift/blob/main/CHANGELOG.md
Project-URL: Security, https://github.com/VeerajSai/SemShift/security/policy
Author: Veeraj Sai
Maintainer: Veeraj Sai
License: MIT
License-File: LICENSE
Keywords: cli,documentation,embeddings,github-actions,local-first,nlp,policy,prompt-engineering,semantic-diff,semantic-shift
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Quality Assurance
Classifier: Topic :: Software Development :: Version Control
Classifier: Topic :: Text Processing :: General
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Typing :: Typed
Requires-Python: >=3.10
Requires-Dist: numpy<3,>=1.24
Requires-Dist: rich<15,>=13.7
Requires-Dist: scikit-learn<2,>=1.3
Requires-Dist: typer<1,>=0.12
Provides-Extra: dev
Requires-Dist: build<2,>=1.2; extra == 'dev'
Requires-Dist: pre-commit<5,>=3.7; extra == 'dev'
Requires-Dist: pytest-cov<7,>=5; extra == 'dev'
Requires-Dist: pytest<10,>=8.0; extra == 'dev'
Requires-Dist: ruff<1,>=0.5; extra == 'dev'
Requires-Dist: twine<7,>=5.0; extra == 'dev'
Provides-Extra: dev-models
Requires-Dist: build<2,>=1.2; extra == 'dev-models'
Requires-Dist: pre-commit<5,>=3.7; extra == 'dev-models'
Requires-Dist: pytest-cov<7,>=5; extra == 'dev-models'
Requires-Dist: pytest<10,>=8.0; extra == 'dev-models'
Requires-Dist: ruff<1,>=0.5; extra == 'dev-models'
Requires-Dist: sentence-transformers<4,>=2.7; extra == 'dev-models'
Requires-Dist: twine<7,>=5.0; extra == 'dev-models'
Provides-Extra: models
Requires-Dist: sentence-transformers<4,>=2.7; extra == 'models'
Description-Content-Type: text/markdown

# SemShift

[![PyPI](https://img.shields.io/pypi/v/semshift.svg)](https://pypi.org/project/semshift/)
[![Python](https://img.shields.io/pypi/pyversions/semshift.svg)](https://pypi.org/project/semshift/)
[![CI](https://github.com/VeerajSai/SemShift/actions/workflows/ci.yml/badge.svg)](https://github.com/VeerajSai/SemShift/actions/workflows/ci.yml)
[![Security](https://github.com/VeerajSai/SemShift/actions/workflows/security.yml/badge.svg)](https://github.com/VeerajSai/SemShift/actions/workflows/security.yml)
[![License: MIT](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)

Catch risky meaning changes Git diff misses.

SemShift is a local-first review assistant for AI-rewritten and human-edited docs, prompts, policies, resumes, and research drafts. It flags likely semantic drift before you merge, publish, or submit text.

Current release line: `v0.2.x` alpha. The default backend is lexical + heuristic (`tfidf`). Optional SentenceTransformers embeddings are local semantic embeddings, not a claim of legal, factual, or scientific authority.

## 5-Second Demo

Before:

```text
We do not share personal data with third parties.
```

After:

```text
We may share personal data with trusted partners.
```

SemShift:

```text
CRITICAL: privacy commitment weakened.
Risk flag: third-party sharing.
Recommendation: hold approval until a human reviews the change.
```

## Install

```bash
pip install semshift
```

Optional local embedding backend:

```bash
pip install "semshift[models]"
```

Development:

```bash
pip install -e ".[dev]"
```

## Quick Start

```bash
semshift compare examples/old_policy.md examples/new_policy.md --mode policy
semshift compare examples/old_policy.md examples/new_policy.md --mode policy --json
semshift compare examples/old_policy.md examples/new_policy.md --mode policy --report semshift-report.md
```

Use limits for large or generated files:

```bash
semshift compare old.md new.md --max-file-size 5242880 --max-chunks 2000
```

## GitHub Action

```yaml
name: SemShift Check

on:
  pull_request:
    paths:
      - "**/*.md"
      - "**/*.txt"
      - "**/*.yml"

permissions:
  contents: read
  pull-requests: write

jobs:
  semshift:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - uses: VeerajSai/SemShift@v0.2.0
        with:
          mode: policy
          fail_on: high
          pr_comment: "true"
          model: tfidf
          report: semshift-report.md
```

Inputs include `files`, `mode`, `fail_on`, `model`, `report`, `base_ref`, `pr_comment`, `github_token`, `max_file_size`, and `max_chunks`.

> **Note:** `fail_on` defaults to `high`. The action exits with code 1 when any file reaches high or critical drift.

## Python API

```python
from semshift import compare_files, compare_text

result = compare_text(
    old="We do not share personal data.",
    new="We may share personal data with partners.",
    mode="policy",
)

print(result.drift_label)
print(result.summary)
print(result.risk_flags)
print(result.to_markdown())

file_result = compare_files("old_policy.md", "new_policy.md", mode="policy")
report = file_result.to_markdown()
```

Canonical fields include `drift_label`, `overall_score`, `drift_score`, `summary`, `matched_chunks`, `chunk_matches`, `claim_changes`, `tone_shift`, `risk_flags`, `warnings`, `metadata`, `to_dict()`, `to_json()`, and `to_markdown()`.

## Modes

| Mode | Maturity | Best for | Main signals |
| --- | --- | --- | --- |
| `policy` | stable | privacy policies, terms, consent language | sharing, retention, rights, obligations |
| `prompt` | stable | system prompts and instruction files | safety rules, hidden instructions, scope |
| `research` | experimental | research drafts and reports | metrics, datasets, baselines, limitations |
| `resume` | experimental | resumes and bios | titles, metrics, company/project names |
| `readme` | experimental | README and support docs | install requirements, guarantees, scope |
| `default` | stable | general text review | drift score, claims, tone, generic risk |

## How It Works

SemShift combines transparent signals:

1. Chunk alignment by headings and text structure.
2. Lexical TF-IDF similarity by default, or optional local SentenceTransformers embeddings.
3. Claim extraction, tone signals, and mode-specific risk rules.

TF-IDF is a lexical backend, not a true semantic model. Optional embedding models may download weights on first use; document text is processed locally unless you explicitly integrate external services.

## Benchmarks

SemShift includes a starter self-evaluation benchmark for regression tracking. See [docs/benchmarks.md](docs/benchmarks.md).

Do not treat starter benchmark numbers as external validation. Human-labeled external evaluation is still needed.

## Compared To

| Tool | What it catches | What it misses |
| --- | --- | --- |
| Git diff | exact text edits | risk, claims, weakened obligations |
| diff-match-patch | text similarity | domain-specific meaning changes |
| LLM judge | broad qualitative review | local determinism, reproducibility, privacy by default |
| Grammar checker | style and grammar | policy, prompt, research, and factual drift |
| SemShift | likely risky semantic drift | subtle context, truth verification, legal authority |

## Limitations

SemShift is:

- not legal advice
- not a fact-checker
- not scientific authority
- not a replacement for human review
- likely to miss subtle context-dependent changes
- likely to false-positive on harmless paraphrases
- lexical + heuristic by default

## Troubleshooting

`semshift: command not found`: Confirm the active environment is the one where you installed `semshift`.

Model import error: Install optional dependencies with `pip install "semshift[models]"`, or use `--model tfidf`.

Slow first model run: SentenceTransformers may download weights and initialize on first use.

Windows path issues: Quote paths with spaces and prefer PowerShell-compatible quoting.

GitHub Action fork PRs: PR comments can be unavailable for forks with restricted permissions; the report artifact is still written.

No files matched: Pass `files`, use `actions/checkout` with `fetch-depth: 0`, or check supported extensions.

Report too long: GitHub comments are truncated and the full report is uploaded as an artifact.

## Roadmap

- stronger external benchmark
- NLI-based deep mode for contradiction/entailment checks
- VS Code extension
- web demo
- docs site
- more file formats

## Author

Built by Veeraj Sai.

## Citation

Please cite SemShift using [CITATION.cff](CITATION.cff).

## License

MIT. See [LICENSE](LICENSE).

## Security

Report vulnerabilities through [GitHub Security Advisories](https://github.com/VeerajSai/SemShift/security/advisories/new). SemShift is local-first by default, but optional model downloads and external CI integrations should be reviewed in your environment.
