Metadata-Version: 2.4
Name: cdtm-tstools
Version: 0.1.3
Summary: Citation pipeline for CDTM trend seminars
Project-URL: Repository, https://github.com/krishuagarwal/tstools
Author-email: Krrish Agarwalla <Krrishmof07@gmail.com>
License-Expression: MIT
License-File: LICENSE
Classifier: Development Status :: 3 - Alpha
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Text Processing :: General
Requires-Python: >=3.10
Requires-Dist: click>=8.1.0
Requires-Dist: crawl4ai>=0.4.0
Requires-Dist: httpx>=0.28.0
Requires-Dist: openai>=1.109.1
Requires-Dist: python-docx>=1.2.0
Requires-Dist: python-dotenv>=1.0.0
Provides-Extra: dev
Requires-Dist: ipykernel>=6.30.1; extra == 'dev'
Requires-Dist: openpyxl>=3.1.5; extra == 'dev'
Requires-Dist: pandas-stubs>=2.3.2.250926; extra == 'dev'
Requires-Dist: pandas>=2.3.2; extra == 'dev'
Requires-Dist: pytest>=8.0.0; extra == 'dev'
Requires-Dist: types-tqdm>=4.67.0.20250809; extra == 'dev'
Description-Content-Type: text/markdown

# Spring 26 — Citation Pipeline

Extract, validate, deduplicate, and replace inline citations across `.docx` trend-phase files.

## Pipeline

```mermaid
flowchart TD
    A["FILE_ORDER (*.docx present)"] --> B["Extract + clean"]
    B --> C["Validate (AMA 11th)"]
    C --> D["Incremental dedup(check registry first)"]
    D --> E["Save CSV / JSON / dedup_map"]
    E --> F{"Ready?"}
    F -->|yes / FORCE| G["Replace inline citations"]
    F -->|no| H["Fix issues → re-run"]
    G --> I["output/*.docx"]
```

## Installation

```bash
pip install cdtm-tstools
```

Or for local development:

```bash
uv venv
uv sync
```

## Usage

Run from a directory containing a `data/` folder with your `.docx` files:

```bash
# Basic run (validate + deduplicate, no replacement)
tstools

# Point to a different data folder
tstools --data-dir data/fall26

# Enable inline citation replacement
tstools --replace

# Force replacement even with unresolved issues
tstools --replace --force

# Custom file processing order
tstools --file-order my_order.json

# All together
tstools --data-dir data/fall26 --replace --force --file-order my_order.json
```

| Flag | Description |
|---|---|
| `--data-dir PATH` | Data directory (default: `data/spring26`) |
| `--replace` | Run inline citation replacement |
| `--force` | Proceed with replacement despite unresolved issues |
| `--file-order PATH` | JSON file with the file processing order |

Equivalent module invocations: `python -m tstools` or `python -m tstools.main`.

## File order

`FILE_ORDER` in `tstools/__init__.py` defines the default processing order. Files do not need to exist yet — only files that are present on disk are processed. Any file on disk not listed in `FILE_ORDER` is appended at the end alphabetically.

Override it from outside in two ways:

**1. CLI flag** — pass `--file-order path/to/file_order.json`

**2. Auto-detected** — place a `file_order.json` in your data directory:

```json
[
    "E-Human-AI Teams-Intro.docx",
    "E-Human-AI Teams-2.docx",
    "T-Society-1.docx"
]
```

If neither is provided, the built-in list from `tstools/__init__.py` is used.

## Incremental deduplication

`dedup_map.json` is a **persistent registry** — unique IDs are permanent once assigned.

On each run:
- Citations already in the registry are carried forward unchanged.
- New citations are matched against the full registry (URL/DOI → exact text → title), then against each other.
- Registry matches reuse the existing unique ID; genuinely new citations get the next available number.
- Existing numbers are never reassigned.

### Human review

Fuzzy matches (≥ 0.70 similarity) are flagged `review_needed` in the map. After reviewing:

| Decision | Edit in `dedup_map.json` |
|---|---|
| Confirmed duplicate | Set `duplicate_of`, `match_type: "manual"` |
| Confirmed distinct | Set `review_flag: "confirmed_distinct"` |
| Manual ID assignment | Set `unique_id` to desired number, `match_type: "manual"` |

The pipeline never overwrites entries that are already in the registry. A warning is printed at load time if two canonical entries share the same `unique_id`.

## Outputs

| File | Contents |
|---|---|
| `citations.csv` | All citations + validation issues + dedup metadata |
| `bibliography.csv` | UniqueID → Citation → SourceIDs |
| `dedup_map.json` | Persistent registry (append-only) |
| `output/issues.md` | Human work queue — validation issues by file |
| `output/*.docx` | Inline citations replaced, References section removed |

## Validation (AMA 11th ed.)

Runs on every present file each run. Issues appear in `output/issues.md` until fixed in the source `.docx`.

| # | Check | Code |
|---|---|---|
| — | Bare URL | `url_only` |
| 1 | Author | `missing_author` |
| 2 | Title | `missing_title` |
| 3 | Source | `missing_source` |
| 4 | Year | `missing_year` |
| 5 | Locator (DOI / URL / vol-page) | `missing_locator` |
| 6 | Accessed date when URL, no DOI | `missing_accessed` |
| 7 | No accessed date when DOI | `unnecessary_accessed` |

## Deduplication phases

1. **URL / DOI** — same locator → definitive duplicate
2. **Exact text** — normalised match → definitive duplicate
3. **Title** — normalised title segment match → definitive duplicate
4. **Fuzzy ≥ 0.70** — flagged `review_needed`, not auto-merged

## Inline citation replacement

Input formats supported (from source `.docx`):

| Format | Example input | Output |
|---|---|---|
| Brackets — single | `[1]` | `[42]` |
| Brackets — list | `[1,2]` | `[18,27]` |
| Brackets — range | `[1-4]` | `[42-45]` |
| Superscript — single | ⁵ | `[42]` |
| Superscript — list | ²˒³ | `[18,27]` |

All output is plain square brackets. Superscript formatting is removed. Punctuation position is preserved (`[1].` → `[42].`).

## File naming

```
E-Human-AI Teams-2.docx  →  E-Human-AI Teams-2-1, E-Human-AI Teams-2-2, …
T-Society-1.docx          →  T-Society-1-1, T-Society-1-2, …
```

## Structure

```
tstools/
├── __init__.py              paths + FILE_ORDER
├── main.py                  orchestrator
├── utils/
│   ├── utils.py             extract, clean, save, verify, registry load/save
│   ├── inline_replacer.py   bracket + superscript → bracket replacement
│   └── cite_from_url.py     URL → AMA citation (crawl4ai + OpenAI)
├── validation/
│   └── validator.py         AMA format checks
└── unique/
    └── deduplicator.py      four-phase incremental dedup
```
