Metadata-Version: 2.4
Name: cdtm-tstools
Version: 0.1.5
Summary: Citation pipeline for CDTM trend seminars
Project-URL: Repository, https://github.com/krishuagarwal/tstools
Author-email: Krrish Agarwalla <Krrishmof07@gmail.com>
License-Expression: MIT
License-File: LICENSE
Classifier: Development Status :: 3 - Alpha
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Text Processing :: General
Requires-Python: >=3.10
Requires-Dist: click>=8.1.0
Requires-Dist: crawl4ai>=0.4.0
Requires-Dist: httpx>=0.28.0
Requires-Dist: openai>=1.109.1
Requires-Dist: python-docx>=1.2.0
Requires-Dist: python-dotenv>=1.0.0
Provides-Extra: dev
Requires-Dist: ipykernel>=6.30.1; extra == 'dev'
Requires-Dist: openpyxl>=3.1.5; extra == 'dev'
Requires-Dist: pandas-stubs>=2.3.2.250926; extra == 'dev'
Requires-Dist: pandas>=2.3.2; extra == 'dev'
Requires-Dist: pytest>=8.0.0; extra == 'dev'
Requires-Dist: types-tqdm>=4.67.0.20250809; extra == 'dev'
Description-Content-Type: text/markdown

# Spring 26 — Citation Pipeline

Extract, validate, deduplicate, and replace inline citations across `.docx` trend-phase files.

## Pipeline

```mermaid
flowchart TD
    A["FILE_ORDER (*.docx present)"] --> B["Extract + clean"]
    B --> C["Validate (AMA 11th)"]
    C --> D["Incremental dedup(check registry first)"]
    D --> E["Save CSV / JSON / dedup_map"]
    E --> F{"Ready?"}
    F -->|yes / FORCE| G["Replace inline citations"]
    F -->|no| H["Fix issues → re-run"]
    G --> I["output/*.docx"]
```

## Installation

```bash
pip install cdtm-tstools
```

Or for local development:

```bash
uv venv
uv sync
```

## Usage

Run from a directory containing a `data/` folder with your `.docx` files:

```bash
# Basic run (validate + deduplicate, no replacement)
tstools

# Point to a different data folder
tstools --data-dir data/fall26

# Enable inline citation replacement
tstools --replace

# Force replacement even with unresolved issues
tstools --replace --force

# Custom file processing order
tstools --file-order my_order.json

# All together
tstools --data-dir data/fall26 --replace --force --file-order my_order.json
```

| Flag | Description |
|---|---|
| `--data-dir PATH` | Data directory (default: `data/spring26`) |
| `--replace` | Run inline citation replacement |
| `--force` | Proceed with replacement despite unresolved issues |
| `--file-order PATH` | JSON file with the file processing order |

Equivalent module invocations: `python -m tstools` or `python -m tstools.main`.

## File order

`FILE_ORDER` in `tstools/__init__.py` defines the default processing order. Files do not need to exist yet — only files that are present on disk are processed. Any file on disk not listed in `FILE_ORDER` is appended at the end alphabetically.

Override it from outside in two ways:

**1. CLI flag** — pass `--file-order path/to/file_order.json`

**2. Auto-detected** — place a `file_order.json` in your data directory:

```json
[
    "E-Human-AI Teams-Intro.docx",
    "E-Human-AI Teams-2.docx",
    "T-Society-1.docx"
]
```

If neither is provided, the built-in list from `tstools/__init__.py` is used.

## Incremental deduplication

`dedup_map.json` is a **persistent registry** — unique IDs are permanent once assigned.

On each run:
- Citations already in the registry are carried forward unchanged.
- New citations are matched against the full registry (URL/DOI → exact text → title), then against each other.
- Registry matches reuse the existing unique ID; genuinely new citations get the next available number.
- Existing numbers are never reassigned.

### Human review

Fuzzy matches (≥ 0.70 similarity) are flagged `review_needed` in the map. After reviewing:

| Decision | Edit in `dedup_map.json` |
|---|---|
| Confirmed duplicate | Set `duplicate_of`, `match_type: "manual"` |
| Confirmed distinct | Set `review_flag: "confirmed_distinct"` |
| Manual ID assignment | Set `unique_id` to desired number, `match_type: "manual"` |

The pipeline never overwrites entries that are already in the registry. A warning is printed at load time if two canonical entries share the same `unique_id`.

## Outputs

| File | Contents |
|---|---|
| `citations.csv` | All citations + validation issues + dedup metadata |
| `bibliography.csv` | UniqueID → Citation → SourceIDs |
| `dedup_map.json` | Persistent registry (append-only) |
| `output/issues.md` | Human work queue — validation issues by file |
| `output/*.docx` | Inline citations replaced, References section removed |

## Validation (AMA 11th ed.)

Runs on every present file each run. Issues appear in `output/issues.md` until fixed in the source `.docx`.

| # | Check | Code |
|---|---|---|
| — | Bare URL | `url_only` |
| 1 | Author | `missing_author` |
| 2 | Title | `missing_title` |
| 3 | Source | `missing_source` |
| 4 | Year | `missing_year` |
| 5 | Locator (DOI / URL / vol-page) | `missing_locator` |
| 6 | Accessed date when URL, no DOI | `missing_accessed` |
| 7 | No accessed date when DOI | `unnecessary_accessed` |

## Deduplication phases

1. **URL / DOI** — same locator → definitive duplicate
2. **Exact text** — normalised match → definitive duplicate
3. **Title** — normalised title segment match → definitive duplicate
4. **Fuzzy ≥ 0.70** — flagged `review_needed`, not auto-merged

## Inline citation replacement

Input formats supported (from source `.docx`):

| Format | Example input | Output |
|---|---|---|
| Brackets — single | `[1]` | `[42]` |
| Brackets — list | `[1,2]` | `[18,27]` |
| Brackets — range | `[1-4]` | `[42-45]` |
| Superscript — single | ⁵ | `[42]` |
| Superscript — list | ²˒³ | `[18,27]` |

All output is plain square brackets. Superscript formatting is removed. Punctuation position is preserved (`[1].` → `[42].`).

## File naming

```
E-Human-AI Teams-2.docx  →  E-Human-AI Teams-2-1, E-Human-AI Teams-2-2, …
T-Society-1.docx          →  T-Society-1-1, T-Society-1-2, …
```

## Run tracking and freezing

Each pipeline run is logged in `data/<semester>/logs/run_log.json`. After a run, you must verify the outputs and set the `verified` flags to `true` in `run_log.json`. The next run then automatically marks the previous entry as `frozen`.

| Flag | What to check |
|---|---|
| `verified.duplicates` | Review `output/duplicates.md` |
| `verified.not_used` | Review `output/not_used.md` |
| `verified.inline_substitution` | Spot-check `output/*.docx` files |

**Frozen runs** protect UIDs — citations from frozen runs keep their assigned numbers forever, and the pipeline warns if their bibliography entries are modified. Unfrozen UIDs are reassigned in body-text order on every run, so UIDs stay contiguous as long as nothing is frozen.

Use `--skip-gate` to bypass the verification gate during iterative development.

## Known bugs fixed (v0.2)

### 1. UIDs assigned in reference-list order instead of body-text order

**Symptom:** The first citation in the body text (e.g., `[23]`) got a high UID like 75, while `[1]` (which appears late in the text) got UID 61.

**Root cause:** `deduplicate()` assigns temporary UIDs in reference-list order (the order entries appear in the References section). `reorder_unique_ids()` was then pre-seeding from `dedup_results`, which already had every UID filled in — making the body-text reorder a complete no-op.

**Fix:** `reorder_unique_ids()` now only pre-seeds UIDs from **frozen** runs (via `frozen_uids` parameter). All other UIDs are discarded and reassigned in the order citations first appear in body text across files in `FILE_ORDER`.

### 2. UID gaps after merging duplicates

**Symptom:** After manually flagging a citation as a duplicate (e.g., merging UID 40 into UID 1), UIDs went 1–39, 41–84 with a gap at 40.

**Root cause:** Same as bug #1 — all previous UIDs were treated as sticky regardless of frozen status. UIDs 41–84 were preserved at their old values instead of shifting down.

**Fix:** Only UIDs from frozen runs are sticky. When nothing is frozen, every re-run produces contiguous UIDs (1, 2, 3, ...) with no gaps.

### 3. Split-run bracket citations not detected or replaced

**Symptom:** Some inline citations like `[5,10]` or `[11,12]` were silently skipped during both body-text scanning and replacement. The output `.docx` still contained unreplaced local numbers.

**Root cause:** Word internally splits text into **runs** (formatting spans). A single `[5,10]` can be stored as three separate runs: `[5,` / `10` / `]`. The pipeline's regex (`\[(\d[\d,\s\-]*)\]`) scans one run at a time and needs the full bracket pattern in a single string. Split brackets never matched.

**Fix:** Added `_collapse_split_brackets()` — a state-machine pre-pass that walks paragraph runs, detects incomplete bracket patterns at run boundaries, and merges them into single runs before replacement. Handles patterns like:
- `[5,` / `10` / `]` → `[5,10]`
- `[11` / `,12].` → `[11,12].`
- `[` / `10` / `]` → `[10]`
- Chained splits where one merge exposes another (e.g., `]...text [4,` / `9` / `,` / `15` / `].`)

### 4. DOIs not extracted when on a separate line

**Symptom:** Citations with a DOI on its own line (after a line break within the same reference entry) were flagged `missing_locator` even though the DOI was present.

**Root cause:** The extraction logic treated each line independently. When a citation's text ended on one line and the DOI started on the next, the DOI line was discarded as a non-citation line.

**Fix:** `extract_citations_from_file()` now merges continuation lines matching `BARE_DOI_RE` (standalone DOI patterns like `doi:10.xxxx/...`) back into the preceding citation entry.

### 5. DOIs stripped by accessed-date cleanup

**Symptom:** A citation had both an "Accessed" date and a DOI (e.g., `...Accessed March 6, 2026. https://doi.org/...`). The auto-fix for "unnecessary accessed date when DOI present" stripped the accessed date **and** the DOI/URL that followed it.

**Root cause:** `ACCESSED_TAIL_RE` matched greedily from "Accessed" through the end of the string, removing everything — including the locator.

**Fix:** `fix_citations()` now extracts and re-appends any DOI or URL locator found in the stripped tail before discarding the accessed date portion.

### 6. Double periods after inline citations

**Symptom:** Some replaced citations produced `[42]..` (two periods) instead of `[42].`.

**Root cause:** When Word stored the closing `]` and `.` in separate runs, the bracket collapse step merged them in a way that preserved the original period while the run that followed also contributed one.

**Fix:** The collapse logic now correctly transfers trailing punctuation from consumed runs so that no character is duplicated.

### 7. Body-text scan misordered split-run citations

**Symptom:** UIDs for a file were not sequential in body-text order. E.g., in T_Legal_5 the first citation in the text (`[3, 6]`) got UIDs 224/225 while a later citation (`[4]`) got UID 221.

**Root cause:** `_scan_body_order()` had the same split-run problem as bug #3, but in the scanning phase. It processed intact per-run brackets first (finding `[4]`, `[8]`, etc.), then fell back to `para.text` for split brackets (`[3, 6]`). Since `[3, 6]` was split across runs (`[3,` / `6` / `]`), it was added to the order *after* all intact brackets — even though it appears first in the text.

**Fix:** `_scan_body_order()` now calls `_collapse_split_brackets()` on each paragraph's runs before scanning, exactly like `replace_inline_citations()` does. This merges split brackets into single runs so the per-run scan finds them in their correct text position.

## Structure

```
tstools/
├── __init__.py              paths + FILE_ORDER
├── main.py                  orchestrator
├── patterns.py              centralized regex patterns
├── runs.py                  run tracking, verification gates, freeze logic
├── utils/
│   ├── utils.py             extract, clean, save, verify, registry load/save
│   ├── inline_replacer.py   bracket + superscript → bracket replacement
│   └── cite_from_url.py     URL → AMA citation (crawl4ai + OpenAI)
├── validation/
│   └── validator.py         AMA format checks
└── unique/
    └── deduplicator.py      four-phase incremental dedup
```
