Metadata-Version: 2.4
Name: cifflow
Version: 0.1.13
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Requires-Dist: duckdb
Requires-Dist: pyarrow
Requires-Dist: pytest>=8.0 ; extra == 'dev'
Requires-Dist: pytest-cov>=5.0 ; extra == 'dev'
Requires-Dist: bump-my-version ; extra == 'dev'
Requires-Dist: ruff>=0.4 ; extra == 'dev'
Requires-Dist: pydoclint[flake8]>=0.4 ; extra == 'dev'
Requires-Dist: mkdocs>=1.6 ; extra == 'docs'
Requires-Dist: mkdocs-material>=9.5 ; extra == 'docs'
Requires-Dist: mkdocstrings[python]>=0.25 ; extra == 'docs'
Provides-Extra: dev
Provides-Extra: docs
License-File: LICENSE
Summary: Parse, store, validate, and emit Crystallographic Information Files (CIF)
License-Expression: Apache-2.0
Requires-Python: >=3.10
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM

# cifflow

Parse, store, validate, and emit Crystallographic Information Files (CIF).

[![PyPI](https://img.shields.io/pypi/v/cifflow)](https://pypi.org/project/cifflow/)
[![CI](https://github.com/rowlesmr/cifflow/actions/workflows/ci.yml/badge.svg)](https://github.com/rowlesmr/cifflow/actions/workflows/ci.yml)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/cifflow)](https://pypi.org/project/cifflow/)
[![License](https://img.shields.io/pypi/l/cifflow)](https://pypi.org/project/cifflow/)

---

## What it does

- Focused on multi-block powder CIF files, but generally applicable
- Constructs CIF files
   - Parses CIF 1.1 and CIF 2.0 files, including all string types (triple-quoted, multiline text fields, embedded quotes) and save frames
   - Constructs `CifFile` objects programmatically from Python values (`CifWriter`), and performs arbitrary edits: add/remove/rename tags, loops, blocks, and save frames
   - Populate default values as defined in the dictionary
- Loads DDLm dictionaries with full `_import.get` resolution, producing a typed schema
- Removes common parse-time artefacts automatically (`clean`): orphan error tags, duplicate blocks/save frames/tags, loop padding; for anything beyond these automatic fixes, use `CifWriter`
- Ingests parsed or constructed CIF data into DuckDB using the dictionary-derived schema: one table per category, foreign keys enforced, unknown tags routed to a fallback tier
- Emits valid CIF from a populated database in four modes: ORIGINAL, GROUPED, ONE_BLOCK, ALL_BLOCKS
- Trusts the user — if you pass in multiple blocks, cifflow assumes they all belong together and, failing key value clashes, can be interpreted as a single database/experiment
- Visualises a schema as a Graphviz DOT string or a self-contained interactive HTML file

---

## Documentation

The full API reference is at [rowlesmr.github.io/cifflow](https://rowlesmr.github.io/cifflow/).


---

## Key properties

**Error-tolerant.** The parser never raises on malformed input. Every structural problem produces an 
explicit error event; parsing continues and all recoverable data is preserved.

**No silent data loss.** Duplicate tag values are preserved. Tags not mapped by the dictionary go to a 
fallback table, not a discard pile.

**Round-trip fidelity.** For well-formed input, emitted CIF re-parses to the same data. All values are 
stored and emitted as raw strings; `ValueType` provenance (placeholder `.` and `?` vs quoted equivalents) 
is preserved throughout.

**Canonical caseless names.** Block names, save frame names, and tag names are stored in Unicode canonical 
caseless form (`NFC(casefold(NFD(x)))`). Lookups are automatically casefolded: `cif["ABC"]` finds a block 
stored as `"abc"`.

**Streaming parser.** The parser is event-driven. CIF source is consumed in a single pass; the IR accumulates 
events incrementally. The Rust extension provides high-throughput Arrow output without any Python file objects.

---

## Installation

Prebuilt wheels are available for Python 3.10 -- 3.14 with MacOS, Windows, and Linux. 

```
pip install cifflow
```

To install from source (includes the Rust extension):

```
git clone https://github.com/rowlesmr/cifflow.git
cd cifflow
pip install -e ".[dev]"
maturin develop
```

---

## Quick start

### Parse a CIF file

```python
from cifflow import build

text = open('structure.cif', encoding='utf-8').read()
cif, errors = build(text)   # never raises; errors is a list[ParseError]

for block_name in cif.blocks:          # block names are always lowercase
    block = cif[block_name]
    print(f'{block_name}: {len(block.tags)} tags, {len(block.loops)} loops')
```

The best way to resolve errors is to inspect the list of errors, edit the
file accordingly, and try again. No assumptions are made about how to correct
errors automatically.


### Full pipeline: dictionary → DuckDB → CIF

To see what is going on in the lexing, parsing, ingestion, and schema generation phases, see the inspect module.

```python
import pathlib
from cifflow import (
    DictionaryLoader, directory_resolver,
    save_dictionary, load_dictionary,
    generate_schema,
    build, ingest, emit, EmitMode,
)
from cifflow.types import CifVersion

# 1. Load dictionary (with JSON cache to avoid re-parsing on every run)
cache = pathlib.Path('cif_pow_cache.json')
resolver = directory_resolver('data/dictionaries')
if cache.exists():
    dictionary = load_dictionary(cache)
else:
    dictionary = DictionaryLoader(resolver=resolver).load(
        open('data/dictionaries/cif_pow.dic', encoding='utf-8').read())
    save_dictionary(dictionary, cache)

# 2. Derive schema
schema = generate_schema(dictionary)

# 3. Parse CIF
cif, errors = build(open('all_the_data.cif', encoding='utf-8').read())

# # 3.5 Edit CIF - to fix errors or alter content
# writer = CifWriter(cif.version, cif)
# for block in writer.blocks:
#    # Do alterations
# cif = writer.build() # raises ValueError if any errors present

# 4. Ingest into an in-memory DuckDB database
#    Pass a file path string to persist: ingest(cif, 'output.db', schema=schema)
conn, warnings = ingest(cif, schema=schema)

# # 4.5 Generate any default values
# generate_defaults(conn, schema)

# 5. Emit CIF
output = emit(conn, schema, mode=EmitMode.ORIGINAL, version=CifVersion.CIF_2_0)
open('output.cif', 'w', encoding='utf-8').write(output)
```

See `example_workflow.py` in the repository root for a fully annotated end-to-end demonstration covering 
all four emission modes, type-cast export, and fidelity checking.

The full API reference is at [rowlesmr.github.io/cifflow](https://rowlesmr.github.io/cifflow/).

---

## Architecture

```
Parser → Event Stream → IR → Dictionary-aware Mapping → DuckDB → Output/API
```

| Layer         | Responsibility |
|---------------|---|
| Lexer         | Tokenisation, `ValueType` assignment |
| Parser        | Token sequence interpretation, error recovery, event emission |
| IR (CIFModel) | Event accumulation, loop validation, multiline text transformation |
| Dictionary    | DDLm parsing, schema derivation |
| DuckDB        | Persistent storage: structured tables when a dictionary is present, fallback tier otherwise |
| Output        | Valid CIF regeneration |

Layer responsibilities are strictly separated. The parser does not know about the dictionary. The dictionary 
does not know about the IR. The output layer only reads from DuckDB.

---

## Status

Everything should work. Issues, suggestions, and requests gladly recieved.

---

## Development

Run the fast test suite (excludes tests that load large real-world CIF files):

```
python -m pytest -m "not slow"
```

Run the full suite including slow tests:

```
python -m pytest
```

After modifying the Rust extension, recompile before running Python tests:

```
maturin develop
```

---

## License

Apache 2.0. See `LICENSE`.

The bundled JavaScript files (`viz.js` 2.1.2 and `svg-pan-zoom` 3.6.1) used by
`visualise_schema_html` are MIT-licensed. Licence notices are in
`src/cifflow/dictionary/js/LICENSES.txt`.

