Metadata-Version: 2.4
Name: article-tagger
Version: 0.4.1
Summary: RAG-based article tag generator using local embeddings and FAISS
License: MIT
Keywords: article-tagger,embeddings,faiss,nlp,rag,tagging
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Requires-Dist: faiss-cpu>=1.7.0
Requires-Dist: numpy>=1.24.0
Requires-Dist: pydantic-settings>=2.0.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: sentence-transformers>=2.2.0
Requires-Dist: typer>=0.9.0
Provides-Extra: all
Requires-Dist: anthropic>=0.30.0; extra == 'all'
Requires-Dist: fastapi>=0.100.0; extra == 'all'
Requires-Dist: gradio>=4.0.0; extra == 'all'
Requires-Dist: mcp>=1.0.0; extra == 'all'
Requires-Dist: pandas>=2.0.0; extra == 'all'
Requires-Dist: scikit-learn>=1.3.0; extra == 'all'
Requires-Dist: starlette>=0.27.0; extra == 'all'
Requires-Dist: umap-learn>=0.5.0; extra == 'all'
Requires-Dist: uvicorn[standard]>=0.20.0; extra == 'all'
Provides-Extra: api
Requires-Dist: fastapi>=0.100.0; extra == 'api'
Requires-Dist: starlette>=0.27.0; extra == 'api'
Requires-Dist: uvicorn[standard]>=0.20.0; extra == 'api'
Provides-Extra: dev
Requires-Dist: anthropic>=0.30.0; extra == 'dev'
Requires-Dist: fastapi>=0.100.0; extra == 'dev'
Requires-Dist: gradio>=4.0.0; extra == 'dev'
Requires-Dist: httpx>=0.24.0; extra == 'dev'
Requires-Dist: mcp>=1.0.0; extra == 'dev'
Requires-Dist: pandas>=2.0.0; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.21.0; extra == 'dev'
Requires-Dist: pytest-timeout>=2.2.0; extra == 'dev'
Requires-Dist: pytest>=7.0.0; extra == 'dev'
Requires-Dist: scikit-learn>=1.3.0; extra == 'dev'
Requires-Dist: starlette>=0.27.0; extra == 'dev'
Requires-Dist: umap-learn>=0.5.0; extra == 'dev'
Requires-Dist: uvicorn[standard]>=0.20.0; extra == 'dev'
Provides-Extra: export
Requires-Dist: pandas>=2.0.0; extra == 'export'
Provides-Extra: llm
Requires-Dist: anthropic>=0.30.0; extra == 'llm'
Provides-Extra: mcp
Requires-Dist: mcp>=1.0.0; extra == 'mcp'
Provides-Extra: ui
Requires-Dist: gradio>=4.0.0; extra == 'ui'
Provides-Extra: viz
Requires-Dist: scikit-learn>=1.3.0; extra == 'viz'
Requires-Dist: umap-learn>=0.5.0; extra == 'viz'
Description-Content-Type: text/markdown

# Article Tagger

RAG-based article tag generator with hybrid scoring, active learning, and full agent integration.

Input an article, get the most relevant tags from your tag library — with confidence scores. Supports Chinese (Traditional/Simplified) natively.

**100% accuracy** on 30 test articles across 611 tags.

## Install

```bash
# Core (CLI only — lightweight, ~500MB with model)
pip install article-tagger

# Full (API server + Web UI + export + visualization)
pip install article-tagger[all]

# Pick what you need
pip install article-tagger[api]       # REST API + Web UI server
pip install article-tagger[llm]       # Claude API reranker
pip install article-tagger[mcp]       # MCP server
pip install article-tagger[viz]       # t-SNE/UMAP visualization

# npm (agent integration)
npm install article-tagger-sdk        # TypeScript SDK
npx article-tagger-mcp                # MCP server for Claude Code / Cursor
```

## Quick Start

```bash
# Zero-config demo (creates sample tags, builds index, tags an article)
bash quickstart.sh
```

Or step by step:

```bash
# 1. Build index from tag library
article-tagger build-index --tags 51標籤庫.json

# 2. Tag an article
article-tagger tag --text "她頭上有角，背後長著翅膀，尾巴尖尖的..."

# 3. Interactive mode
article-tagger interactive

# 4. Pre-load model for faster queries
article-tagger warmup
```

## How It Works

Three-layer hybrid scoring:

1. **Embedding search** — sentence-transformers (multilingual) + FAISS vector similarity
2. **Keyword boost** — exact/partial match across Simplified/Traditional Chinese variants
3. **Composite patterns** — multi-feature detection (e.g., 角+翅膀 → 惡魔娘)

Plus:
- **Active Learning** — EMA-weighted feedback loop with dedup, converges over time
- **Tag hierarchy** — child tag matched → ancestors auto-boosted (with cycle detection)
- **LLM reranker** — optional Claude API second pass for precision
- **Text augmentation** — S/T Chinese variants + concept expansion in embedding space

## Five Interfaces

| Interface | Command | Use Case |
|-----------|---------|----------|
| **CLI** | `article-tagger <cmd>` | Human workflow, scripting |
| **Agent CLI** | `atag <cmd>` | Pure JSON, agent chaining |
| **REST API** | `article-tagger serve` | Web apps, microservices |
| **MCP Server** | `article-tagger mcp-serve` | Claude Code, Cursor |
| **Web UI** | `localhost:8000/ui` | Visual dashboard (Gradio) |

## CLI Commands

| Command | Description |
|---------|-------------|
| `build-index` | Build FAISS index from tag file (CSV/JSON) |
| `tag` | Tag single article (`--rerank` for LLM reranking) |
| `tag-batch` | Batch tag a directory |
| `interactive` | Interactive REPL with live feedback |
| `watch` | Watch directory, auto-tag new files |
| `discover` | **New** — Find new tag candidates from articles |
| `warmup` | **New** — Pre-load model + index into memory |
| `analytics` | Usage stats dashboard |
| `quality` | Tag library quality report |
| `visualize` | t-SNE/UMAP embedding visualization |
| `cooccur` | Tag co-occurrence analysis |
| `weights` | Active Learning weights |
| `enrich` | LLM auto-fill tag descriptions/categories |
| `eval` | Evaluate tagging accuracy (precision, recall, F1, MAP) |
| `export` | Multi-format export (CSV/Markdown/JSONL/Notion) |
| `serve` | Start REST API + Web UI |
| `mcp-serve` | Start MCP Server |
| `state *` | State management (export/import/incremental sync) |
| `model *` | Model management (list/switch/compare) |
| `profile *` | Config profiles (strict/broad/fast/custom) |
| `daemon *` | Daemon mode (start/stop/status/tag) |

## Agent-Native CLI (`atag`)

Pure JSON output, designed for agent chaining:

```bash
# Tag
atag tag "article text..."
atag tag -f article.txt

# Batch
atag batch -d ./articles/

# Feedback loop
atag feedback "標籤名" true

# Pipe-friendly
echo "article" | atag tag | jq '.tags[0].tag_name'
```

## REST API

27 endpoints. Start with `article-tagger serve`.

| Method | Path | Description |
|--------|------|-------------|
| POST | `/api/tag` | Tag single article |
| POST | `/api/tag-batch` | Batch tag (up to 100) |
| GET/POST/PUT/DELETE | `/api/tags[/{id}]` | Tag CRUD |
| POST | `/api/build-index` | Upload tag file, rebuild index |
| POST | `/api/feedback` | Submit feedback (with dedup) |
| GET | `/api/feedback` | Get feedback + stats |
| POST | `/api/feedback/undo` | Undo last feedback |
| DELETE | `/api/feedback/{id}` | Delete feedback |
| GET | `/api/history` | Tagging history |
| GET | `/api/history/search` | Search history |
| DELETE | `/api/history/{id}` | Delete history |
| GET | `/api/export` | Export history (CSV/JSON) |
| GET | `/api/analytics` | Analytics dashboard |
| GET | `/api/quality` | Tag quality report |
| GET | `/api/cooccurrence` | Co-occurrence pairs |
| GET | `/api/cooccurrence/suggest` | Tag suggestions |
| GET | `/api/weights` | Active Learning weights |
| GET | `/api/profiles` | Config profiles |
| GET | `/api/health` | Health check |

## MCP Server (24 tools)

For Claude Code, Cursor, or any MCP-compatible agent:

```json
{
  "mcpServers": {
    "article-tagger": {
      "command": "python",
      "args": ["-m", "article_tagger.mcp_server"]
    }
  }
}
```

Or via npm:
```json
{
  "mcpServers": {
    "article-tagger": {
      "command": "npx",
      "args": ["article-tagger-mcp"]
    }
  }
}
```

**Tools:** tag_article, build_index, search_tags, list_tags, create_tag, update_tag, delete_tag, add_feedback, undo_feedback, delete_feedback, get_stats, get_weights, search_history, delete_history, export_history, get_analytics, get_quality_report, suggest_related_tags, discover_tags, export_state, import_state, switch_model, list_models, get_pipeline_info, list_profiles

## TypeScript SDK

```typescript
import { ArticleTaggerClient } from "article-tagger-sdk";

const client = new ArticleTaggerClient({ baseUrl: "http://localhost:8000" });

// Tag
const result = await client.tag("article text...", 5);
console.log(result.tags);

// Feedback
await client.addFeedback("article", "tag_name", true);

// CRUD
await client.createTag({ name: "新標籤", description: "描述", category: "分類" });
await client.updateTag("123", { description: "更新描述" });
await client.deleteTag("123");

// History & Analytics
const history = await client.searchHistory("keyword");
const analytics = await client.getAnalytics(30);

// Co-occurrence
const suggestions = await client.suggestTags(["標籤A", "標籤B"]);
```

## Tag Discovery

Semi-automatic new tag candidate extraction:

```bash
# Scan articles for frequently mentioned concepts not in tag library
article-tagger discover ./articles/ --min-freq 3 --top 10

# JSON output for automation
article-tagger discover ./articles/ --json
```

Reports frequency, confidence score, source files, and similar existing tags.

## Daemon Mode

Keep model in memory for millisecond-level tagging:

```bash
article-tagger daemon start
article-tagger daemon tag --text "..."   # ~10ms vs ~3s cold start
article-tagger daemon stop
```

## State Migration

Transfer everything (index, feedback, history, weights) between machines:

```bash
# Full export/import
article-tagger state export -o bundle.tar.gz
article-tagger state import -f bundle.tar.gz

# Incremental sync
article-tagger state export-incr -o incr.tar.gz --since 2026-03-01T00:00:00
article-tagger state merge -f incr.tar.gz
```

## Evaluation

```bash
# Create benchmark template
article-tagger eval --create-template benchmark.json

# Run evaluation (precision@k, recall@k, F1, MAP, NDCG)
article-tagger eval -b benchmark.json --top-k 5
```

## Configuration

All settings via `TAGGER_` prefixed environment variables:

| Variable | Default | Description |
|----------|---------|-------------|
| `TAGGER_MODEL_NAME` | `paraphrase-multilingual-MiniLM-L12-v2` | Embedding model |
| `TAGGER_TOP_K_RETURN` | `3` | Default tags returned |
| `TAGGER_SIMILARITY_THRESHOLD` | `0.1` | Minimum score |
| `TAGGER_MAX_TEXT_LENGTH` | `100000` | Max article size (chars) |
| `TAGGER_ENABLE_HIERARCHY` | `true` | Tag hierarchy |
| `TAGGER_ENABLE_SYNONYMS` | `true` | Synonym expansion |
| `TAGGER_ENABLE_RERANKER` | `false` | LLM reranker |
| `TAGGER_ENABLE_CACHE` | `true` | Result + embedding cache |
| `TAGGER_ANTHROPIC_API_KEY` | — | For LLM reranker/enricher |
| `TAGGER_API_KEY` | — | REST API auth |

## Project Structure

```
src/article_tagger/
├── tagger.py           # Core facade — orchestrates all components
├── embedder.py         # sentence-transformers wrapper
├── indexer.py          # FAISS index (Flat/IVF auto-select)
├── booster.py          # Keyword matching + composite patterns
├── text_augmenter.py   # Tag text enrichment (S/T Chinese, concept expansion)
├── tag_discovery.py    # New tag candidate extraction
├── hierarchy.py        # Parent/child resolution + cycle detection
├── active_learning.py  # EMA-based feedback weights + dedup
├── reranker.py         # LLM reranking (Claude API)
├── cooccurrence.py     # Tag co-occurrence graph
├── pipeline.py         # Pre/post processor plugins
├── cache.py            # LRU cache + persistent embedding cache
├── store.py            # Feedback + history (JSON)
├── tag_loader.py       # CSV/JSON tag loader
├── tag_quality.py      # Duplicate/orphan/isolated detection
├── tag_enricher.py     # LLM tag library enrichment
├── evaluation.py       # Benchmark framework
├── analytics.py        # Usage analytics
├── profiles.py         # Config profiles
├── exporter.py         # Multi-format export
├── visualizer.py       # t-SNE/UMAP visualization
├── state_manager.py    # State export/import
├── config.py           # pydantic-settings
├── models.py           # Data models
├── cli.py              # Typer CLI (26 commands)
├── agent_cli.py        # Agent CLI — pure JSON (atag)
├── api.py              # FastAPI (27 endpoints)
├── mcp_server.py       # MCP Server (24 tools)
├── ui.py               # Gradio Web UI (6 tabs)
├── repl.py             # Interactive REPL
├── daemon.py           # Unix socket daemon
├── watcher.py          # Directory watch mode
└── middleware.py        # API middleware
packages/
├── sdk/                # TypeScript SDK (17 methods)
├── mcp-server/         # TypeScript MCP Server (11 proxied tools)
└── tool-schemas/       # OpenAI + Anthropic tool formats
```

## Testing

```bash
# 206 tests covering all modules
pytest tests/ -v

# Skip model-loading tests (faster, avoids torch issues on some platforms)
SKIP_MODEL_TESTS=1 pytest tests/ -q
```

## Docker

```bash
docker compose up --build
# API at localhost:8000, Web UI at localhost:8000/ui
```

## License

MIT
