Metadata-Version: 2.4
Name: filewise-ai
Version: 0.1.0
Summary: AI-powered intelligent file organizer — find duplicates, track versions, identify the real final draft
Project-URL: repository, https://github.com/Maobuchiyugutou/FileWise
Author: Jamie
License: MIT
License-File: LICENSE
Keywords: ai,deduplication,embedding,file-management,rag
Classifier: Development Status :: 2 - Pre-Alpha
Classifier: Environment :: Console
Classifier: Intended Audience :: End Users/Desktop
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.10
Requires-Dist: rich>=13.0
Requires-Dist: typer>=0.9
Provides-Extra: all
Requires-Dist: filewise[embeddings,parsing,vector]; extra == 'all'
Provides-Extra: dev
Requires-Dist: pytest-cov>=4.0; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: ruff>=0.1; extra == 'dev'
Provides-Extra: embeddings
Requires-Dist: sentence-transformers>=2.2; extra == 'embeddings'
Provides-Extra: parsing
Requires-Dist: pypdf2>=3.0; extra == 'parsing'
Requires-Dist: python-docx>=1.0; extra == 'parsing'
Requires-Dist: unstructured>=0.10; extra == 'parsing'
Provides-Extra: vector
Requires-Dist: chromadb>=0.4; extra == 'vector'
Description-Content-Type: text/markdown

# FileWise

[English](README.md) | [中文](README_zh.md)

[![Test](https://github.com/Maobuchiyugutou/FileWise/actions/workflows/test.yml/badge.svg)](https://github.com/Maobuchiyugutou/FileWise/actions/workflows/test.yml)
[![Python](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/)
[![License](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)
[![Tests](https://img.shields.io/badge/tests-66%20passed-brightgreen.svg)](.)

AI-powered intelligent file organizer — find semantically similar files, trace
document version evolution, and identify the "real final draft."

Unlike hash-based deduplication tools (czkawka, fdupes, rdfind), FileWise uses
embedding similarity to recognize different versions of the same document, even
when content has been edited, renamed, or scattered across directories.

**Fully local** — the embedding model runs on your machine. No data ever leaves
your disk.

## Quick Start

```bash
git clone https://github.com/Maobuchiyugutou/FileWise.git
cd FileWise
pip install -e ".[all]"

# Scan a directory
filewise scan ~/Documents

# AI-powered analysis — find similar files and version chains
filewise analyze ~/Documents

# Compare two files
filewise diff proposal_v1.md proposal_v2.md

# Smart rename — add version prefixes based on analysis
filewise rename ~/Documents            # dry-run
filewise rename ~/Documents --apply    # apply renames

# Natural language search — find files by describing what you want
filewise search "budget proposal" ~/Documents

# Find files similar to a specific file
filewise find-similar draft.md
```

## Commands

| Command | Description |
|---------|-------------|
| `filewise scan <dir>` | List files by format, show supported/unsupported counts |
| `filewise analyze <dir>` | Full AI pipeline: find similar files and version chains |
| `filewise diff <A> <B>` | Line-level content comparison between two files |
| `filewise rename <dir>` | Rename files to show version order (`--apply` to execute) |
| `filewise search <query> <dir>` | Natural language search with auto mode detection |
| `filewise find-similar <file>` | Find files semantically similar to a given file |
| `filewise evaluate <dir>` | Run algorithm accuracy tests against ground-truth scenarios |
| `filewise info` | System info and supported formats |

## How It Works

```
Scan files → Extract text → Split into chunks → Generate embeddings
    → Cluster by similarity → Infer version chains → Display results
```

### Version Chain Algorithm

Three-stage, multi-signal approach:

1. **Clustering** (DBSCAN + hierarchical refinement) — group files by content
   similarity only, ignoring file names
2. **Ordering** — determine version direction using:
   - Content containment (primary): how much of A appears in B?
   - Filename dates: extract `2025-04-17` from filenames
   - Version patterns: `v1` → `v2`, `draft` → `final`, `第1版` → `第2版`
   - Modification time (secondary)
3. **Chain construction** — topological sort with confidence tiers
   (HIGH / MEDIUM / LOW)

Special cases handled: very short files (substring matching), heavily
rewritten documents (filename signal boost), format variants (same content,
different extensions).

### Evaluation

21 scenarios test the algorithm across typical edge cases (100% accuracy):

```
filewise evaluate tests/eval_scenarios
# 17 version chain scenarios (100%) + 4 search scenarios (R@5=100%)
```

## Supported Formats

| Category | Extensions |
|----------|-----------|
| Documents | `.pdf`, `.docx`, `.doc`, `.odt` |
| Text | `.txt`, `.md`, `.markdown`, `.rst`, `.log` |
| Code | `.py`, `.js`, `.ts`, `.go`, `.rs`, `.java`, `.c`, `.cpp`, `.h`, `.sh`, `.sql` |
| Config/Data | `.json`, `.yaml`, `.yml`, `.toml`, `.xml`, `.csv`, `.tsv` |
| Web | `.html`, `.css` |

## Tech Stack

| Layer | Choice |
|-------|--------|
| Embedding | `sentence-transformers` + `BAAI/bge-small-zh-v1.5` (Chinese/English) |
| Vector Store | `ChromaDB` (persistent, incremental) |
| Clustering | `scikit-learn` (DBSCAN + hierarchical refinement) |
| Document Parsing | `python-docx`, `PyPDF2` (with text cache) |
| CLI | `typer` + `rich` |
| CI | GitHub Actions (pytest + ruff on every push) |

## Roadmap

- [x] File scanner
- [x] Multi-format document parser
- [x] Text chunking (paragraph-first)
- [x] Embedding generation
- [x] Vector storage (ChromaDB, persistent)
- [x] Similarity clustering (DBSCAN + hierarchical)
- [x] Version chain inference (multi-signal scoring)
- [x] Content diff
- [x] Format variant detection (same-name, different extension)
- [x] Smart rename (version-aware file renaming)
- [x] Natural language search (semantic + keyword hybrid)
- [x] File-anchored similarity search (`find-similar`)
- [x] Evaluation framework (18 scenarios, 100% accuracy)
- [x] CI/CD pipeline (GitHub Actions)
- [ ] Incremental indexing (watchdog — auto-detect file changes)
- [ ] TUI interface (Textual, Yazi-like)
- [ ] PyPI package (`pip install filewise`)

## Requirements

- Python 3.10+
- ~100MB disk for embedding model (downloaded on first use, cached locally)
