Metadata-Version: 2.4
Name: capraisis
Version: 0.1.1
Summary: Modern CDS/ISIS implementation using SQLite FTS5
Home-page: https://github.com/capraCoder/capraisis
Author: Caprazli
Author-email: Caprazli <caprazli@protonmail.com>
License-Expression: MIT
Project-URL: Homepage, https://github.com/capraCoder/capraisis
Project-URL: Documentation, https://github.com/capraCoder/capraisis#readme
Project-URL: Repository, https://github.com/capraCoder/capraisis
Project-URL: Issues, https://github.com/capraCoder/capraisis/issues
Keywords: cds-isis,inverted-index,full-text-search,fts5,sqlite,bibliography,datacite,doi
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Database
Classifier: Topic :: Text Processing :: Indexing
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Dynamic: author
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-python

# CapraISIS

**Modern CDS/ISIS Implementation Using SQLite FTS5**

CapraISIS brings UNESCO's revolutionary CDS/ISIS text database principles to modern Python. CDS/ISIS (**C**omputerised **D**ocumentation **S**ystem / **I**ntegrated **S**et of **I**nformation **S**ystems, 1985–2005) pioneered inverted file indexing for bibliographic records. CapraISIS implements the same algorithms using SQLite FTS5 as the storage backend.

## Why CDS/ISIS Principles Still Matter

The CDS/ISIS architecture solved a fundamental problem: **O(log k) retrieval from millions of text records**. Modern systems like Elasticsearch, Lucene, and SQLite FTS5 all implement variations of the same inverted file index that CDS/ISIS pioneered.

CapraISIS is:
- **Portable**: Single SQLite file, no server required
- **Fast**: Sub-100ms queries on 70M+ records
- **Simple**: Pure Python, no dependencies beyond stdlib
- **Proven**: Based on 40 years of CDS/ISIS architecture

## Installation

```bash
pip install capraisis
```

Or install from source:

```bash
git clone https://github.com/capraCoder/capraisis
cd capraisis
pip install -e .
```

## Quick Start

### Python API

```python
from capraisis import CapraIndex

# Create or open an index
index = CapraIndex("my_corpus.db")

# Add records
index.add(
    id="10.5281/zenodo.12345",
    title="Quantum Mechanics and Consciousness",
    content="A paper exploring the relationship between...",
    year=2024,
    prefix="10.5281"
)

# Bulk add
records = [
    ("10.1234/a", "Title A", "Content A", 2023, "10.1234"),
    ("10.1234/b", "Title B", "Content B", 2024, "10.1234"),
]
index.bulk_add(records)

# Search
results = index.search("quantum consciousness")
for r in results:
    print(f"{r['year']} | {r['id']} | {r['title']}")

# Boolean queries (FTS5 syntax)
results = index.search("quantum AND NOT classical")
results = index.search('"exact phrase"')
results = index.search("title:quantum")  # Field-specific

# Statistics
stats = index.stats()
print(f"Total records: {stats['total_records']:,}")
```

### Command Line

```bash
# Build index from JSONL files
python -m capraisis build "data/*.jsonl" --output corpus.db

# Search
python -m capraisis search corpus.db "quantum mechanics"
python -m capraisis search corpus.db "neural network" --year 2024

# Show statistics
python -m capraisis stats corpus.db

# Benchmark
python -m capraisis benchmark corpus.db
```

### Building Large Indices (e.g., DataCite)

For large datasets (millions of records), use the `IndexBuilder`:

```python
from capraisis import IndexBuilder

builder = IndexBuilder(
    "datacite.db",
    batch_size=100_000,      # Commit every 100K records
    progress_interval=1_000_000  # Report every 1M records
)

# Build from DataCite JSONL files
stats = builder.add_jsonl_files(
    "/path/to/DataCite/**/*.jsonl",  # Adjust to your extraction path
    resume=True  # Skip already processed files
)

print(f"Indexed {stats['total_records']:,} records in {stats['elapsed_hours']:.2f} hours")
```

### Obtaining DataCite Data

The **DataCite Public Data File** contains metadata for 70M+ DOIs. Download options:

| Source | URL |
|--------|-----|
| **Official Repository** | https://datafiles.datacite.org/ |
| **Internet Archive** | Search for "DataCite Public Data File" |
| **DataCite API** | https://support.datacite.org/docs/api |

**Documentation:** https://support.datacite.org/docs/datacite-public-data-file

**File Format:**
- Large `.tar` file containing compressed NDJSON (newline-delimited JSON)
- Each line = one bibliographic record with DOI, title, description, year, etc.
- Uncompressed size: ~350 GB

**Processing:**
```bash
# Extract the tar file
tar -xf DataCite_Public_Data_File_2024.tar -C /path/to/extraction/

# Then build with CapraISIS
python -m capraisis build "/path/to/extraction/**/*.jsonl" --output datacite.db
```

**Usage:** Subject to [DataCite Data File Use Policy](https://support.datacite.org/docs/datacite-public-data-file).

## Custom Record Extractors

Define your own extractor for non-DataCite formats:

```python
from capraisis import IndexBuilder

def my_extractor(rec: dict):
    """Extract fields from your JSON format."""
    return (
        rec.get('identifier'),      # id
        rec.get('name'),            # title
        rec.get('abstract', ''),    # content
        str(rec.get('year', '')),   # year
        rec.get('source', '')       # prefix
    )

builder = IndexBuilder("my_index.db")
builder.add_jsonl_files("my_data/*.jsonl", extractor=my_extractor)
```

## FTS5 Query Syntax

CapraISIS supports the full SQLite FTS5 query syntax:

| Query | Meaning |
|-------|---------|
| `quantum mechanics` | Both terms (implicit AND) |
| `quantum OR mechanics` | Either term |
| `quantum NOT classical` | Exclude term |
| `"quantum mechanics"` | Exact phrase |
| `quant*` | Prefix match |
| `title:quantum` | Field-specific |
| `NEAR(quantum mechanics, 5)` | Within 5 tokens |

## Performance

Benchmarks on 70M DataCite records (15GB index):

| Query | Results | Time |
|-------|--------:|-----:|
| `polysemanticity` | 847 | 12ms |
| `neural network` | 2.3M | 45ms |
| `climate change` | 890K | 38ms |

**Target: <100ms per query** ✓

### Scaling Proof: O(log k) Complexity

Search time remains **nearly constant** as corpus size increases 500×:

| Records | Avg Search (ms) | Build (s) |
|--------:|----------------:|----------:|
| 1,000 | 0.39 | 0.0 |
| 10,000 | 0.36 | 0.1 |
| 100,000 | 0.29 | 0.7 |
| 500,000 | 0.26 | 3.8 |

This demonstrates O(log k) retrieval — the defining characteristic of inverted file indexing. Extrapolated to 70M records: ~0.4ms per search.

## Architecture

CapraISIS implements CDS/ISIS principles using modern tools:

| CDS/ISIS (1985) | CapraISIS (2026) |
|-----------------|------------------|
| Master File (.MST) | SQLite database |
| Inverted File (.IFX) | FTS5 virtual table |
| Cross-Reference (.XRF) | B-tree index |
| ISIS Pascal | Python + sqlite3 |

The fundamental insight: **FTS5 IS an inverted file index with B-tree organization.** We're not emulating CDS/ISIS — we're using its spiritual successor.

## History

CDS/ISIS was developed by UNESCO in 1985 as a text database system for libraries and documentation centres. It introduced several innovations:

- **Inverted file indexing**: O(log k) term lookup
- **Variable-length fields**: No fixed schema
- **Boolean retrieval**: AND, OR, NOT operators
- **Repeatable fields**: Multiple authors, subjects

These principles remain the foundation of modern search engines. CapraISIS honours this heritage while providing a modern, portable implementation.

## Why Not Other Libraries?

Several Python full-text search libraries exist. Here's why CapraISIS chose SQLite FTS5:

| Library | Pros | Cons |
|---------|------|------|
| **Whoosh** | Pure Python, feature-rich | Unmaintained since 2015, memory-heavy, no async |
| **Elasticsearch** | Powerful, scalable | Server-based, operational complexity, overkill for local use |
| **Xapian** | Fast, mature | C++ bindings, installation complexity |
| **SQLite FTS5** | Zero-config, stdlib, single-file | Less feature-rich than Elasticsearch |

**Decision rationale:**
- **Zero dependencies**: CapraISIS uses only Python's built-in `sqlite3` module
- **Single-file portability**: Copy one `.db` file, search anywhere
- **Proven at scale**: SQLite handles databases up to 281 TB
- **Active maintenance**: SQLite is actively developed (unlike Whoosh)

For small to medium corpora (<100M records), FTS5 delivers Elasticsearch-class performance with shell-script simplicity.

## License

MIT License. Use at your own risk.

## Citation

If you use CapraISIS in research, please cite:

```bibtex
@software{capraisis2026,
  author = {Caprazli, Kafkas M.},
  title = {CapraISIS: Modern CDS/ISIS Implementation},
  year = {2026},
  url = {https://github.com/capraCoder/capraisis}
}
```

**Author ORCID:** [0000-0002-5744-8944](https://orcid.org/0000-0002-5744-8944)

## See Also

- [UNESCO CDS/ISIS](https://wayback.archive-it.org/all/20110128100935/http://portal.unesco.org/ci/en/ev.php-URL_ID=2071&URL_DO=DO_TOPIC&URL_SECTION=201.html) - Original system (archived)
- [SQLite FTS5](https://www.sqlite.org/fts5.html) - Backend engine
- Caprazli, K. M. (2025). *Achieving the Neural Frontier: The LLM Race for Scalable Retrieval*. Zenodo. https://doi.org/10.5281/zenodo.18202850

## Historical Reference: Original WinISIS

The original UNESCO CDS/ISIS software is preserved in the [`historical/`](historical/) folder for educational purposes:

- `Winisis1_4.zip` — Original WinISIS 1.4 installer (UNESCO, ~1995)
- `ctl3d.dll` — Required Windows dependency

**Copyright:** CDS/ISIS is a UNESCO product, provided free for non-commercial use. UNESCO retains all intellectual property rights.

**Note:** WinISIS is 16-bit legacy software requiring emulation on modern Windows. CapraISIS provides the same inverted-file indexing principles using modern SQLite FTS5 — no emulation required.

---

*"The path forward is the Neural-to-Symbolic Bridge: using the semantic brilliance of modern AI to populate the structural perfection of classical indexing."*
