Metadata-Version: 2.4
Name: pycrowley
Version: 0.1.0
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Rust
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Software Development :: Libraries
Classifier: Topic :: Text Processing
Classifier: Typing :: Typed
License-File: LICENSE-APACHE
License-File: LICENSE-MIT
Summary: A high-performance streaming JSON query engine for out-of-memory files
Keywords: json,query,streaming,search,big-data
License-Expression: MIT OR Apache-2.0
Requires-Python: >=3.10
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
Project-URL: Documentation, https://codeberg.org/nrposner/crowley
Project-URL: Issues, https://codeberg.org/nrposner/crowley/issues
Project-URL: Repository, https://codeberg.org/nrposner/crowley

# crowley

A high-performance JSON querying engine designed for fast starts, low flat memory usage, and out-of-memory streaming.

It is primarily designed to substitute for `ijson`. If you're coming to `crowley` from `ijson`, see the IJSON Migration Guide.

Written in Rust, with a SAX-style JSON event parser adapted from the [`json-event-parser`](https://crates.io/crates/json-event-parser) crate and a regular expression query language adapted from the [`jsongrep`](https://crates.io/crates/jsongrep) crate.

## Use cases

`crowley` is optimized for the following scenarios:

- **Queries over files too large to fit comfortably in memory.** `crowley` streams through JSON files with bounded memory regardless of file size. A 37 GB file uses ~30 MB of RAM.
- **Queries on transient data.** `crowley` quickly queries data which do not merit transformation into a more easily-queried structure such as a database or dataframe, because of time constraints or because the data is sensitive and cannot be loaded into an external application.
- **Queries over heterogeneous, deeply-nested, and schemaless data** which tools such as `pandas`, `polars`, or `duckdb` cannot ingest and transform. `crowley`'s regular-language queries don't require schema inference.
- **Queries over many files in parallel.** `crowley` natively supports searching over many files with the same query, using either a list of file paths or a pattern match. These files will be searched in parallel more quickly and with less memory overhead than `ijson` with a ProcessPool.

## Usage

### Single-file search
```python
from crowley import Query

names = Query("data.json", "users[*].name")
ages = Query("data.json", "users[*].age")

names.count()       # 4
names.exists()      # True
names.values()      # ['Alice', 'Bob', "Charlie", "Diana"]
ages.values()       # [30, 25, 35, 28]
names.agg("sum")    # nan
ages.agg("sum")     # 118.0
names.types()       # ['string']
ages.types()        # ['number']
names.mode()        # {'values': ['Alice', 'Bob', 'Charlie', 'Diana'], 'frequency': 1}
```

### Multi-file search
```python
from crowley import Query

repo_names = Query("tests/github_daily_jsonl/2015*", "[*].repo.name")

repo_names.count() # [7702, 7427, 7234, 7387, 8273, 8971, 10307, 11351, 11749, 11961, 12229, 12314, 6743, 12442, 13111, 12473, 11601, 5971, 5869, 5887, 8322, 7105, 6139, 6371]
repo_names.total_count() # 218939
repo_names.total_unique() # 65703
repo_names.mode()[0] # {'values': ['KenanSulayman/heartbeat'], 'frequency': 79}
```

## Query language

The [query language](https://github.com/micahkepe/jsongrep?tab=readme-ov-file#query-syntax) uses a regular-expression-inspired syntax for navigating JSON structure:

| Query | Meaning |
|-------|---------|
| `name` | Field `name` in the root object |
| `address.street` | Field `street` inside `address` |
| `users[*].name` | `name` field of every element in `users` array |
| `*` | Any field in the root object |
| `[*]` | Any element in the root array |
| `users[0]` | First element of `users` |
| `users[1:3]` | Elements at indices 1 and 2 |
| `(name \| age)` | Either `name` or `age` |
| `(* \| [*])*` | Any value at any depth (recursive descent) |
| `a?` | Returns the value of `a` if it exists 

## Performance

Benchmarks measured on a Mac M3 Max with 32GB of RAM:

```
File: Flat GitHub log data, 34GB
Query: [*].repo.name

Count matches:
    crowley: 71.6s
    ijson: 128.8s
    Difference: 1.8x

Return matches:
    crowley: 116.0s
    ijson: 126.1s
    Difference: 1.09x

Return unique values:
    crowley: 125.7 
    ijson: 129.5s
    Difference: 1.03x

Return unique count:
    crowley: 122.1
    ijson: 129.5s
    Difference: 1.06x

File: Nested GeoJSON, 30MB
Query: features[*].properties.name

Count matches:
    crowley: 138.44ms
    ijson: 421.85ms
    Difference: 3.0x

Existence check (true):
    crowley: 16µs
    ijson: 793µs
    Difference: 49x

Query: features[*].properties.scalerank

Sum matches:
    crowley: 184.88ms
    ijson: 425.89ms
    Difference: 2.3x

Query: features[*].properties.nonexistent

Existence check (false):
    crowley: 138.9ms
    ijson: 409.7ms
    Difference: 2.9x
```

On queries where the objective is to return values `crowley` outperforms `ijson` by 3-10%. In cases where a measure such as count or aggregate sum is returned, `crowley` can often outperform `ijson` by 2-3x by avoiding materializing values unnecessarily.

But the real benefit comes from `crowley`'s more expressive query language, which can efficiently express what would otherwise require Python loops aroung ijson. 

It can extract multiple fields through disjunctions (at one or multiple levels) in a single pass without having to materialize the parent object:

```python
# get the number of matching objects
# 133.6ms
crowley.Query(file_str, "features[*].properties.(name | admin)").count()

# get the number of unique matches
# 144.2ms
crowley.Query(file_str, "features[*].properties.(name | admin)").unique_values()

# get the number of matching objects
# 851.6ms
def ijson_two_passes():
    with open(file_str, "rb") as f:
        count1 = sum(1 for _ in ijson.items(f, "features.item.properties.name"))
    with open(file_str, "rb") as f:
        count2 = sum(1 for _ in ijson.items(f, "features.item.properties.admin"))
    return count1 + count2
ijson_two_passes()

# get the number of unique matches
# 430ms
def ijson_two_fields():
    names = set()
    with open(file_str, "rb") as f:
        for obj in ijson.items(f, "features.item.properties"):
            if "name" in obj:
                names.add(obj["name"])
            if "admin" in obj:
                names.add(obj["admin"])
    return names
ijson_two_fields()
```

It can extract all property values without internal iteration:

```python
# get the number of all matching property values by query
# 133.9ms
crowley.Query(file_str, "features[*].properties.*").count()

# get the  number of all matching properties by internal iteration
# 427.9ms
def ijson_all_props():
    count = 0
    with open(file_str, "rb") as f:
        for obj in ijson.items(f, "features.item.properties"):
            count += len(obj)
    return count
ijson_all_props()
```

It can select ranges of array elements without manual index checking:

- Note: this is one of the few places `crowley` can be slower under some conditions: if the array range is not at the root level, `ijson` + Python break logic can stop more quickly, while `crowley` must continue parsing the outer structure. For root-level array ranges, `crowley` remains faster. Attempting to use the same approach with `crowley` as with `ijson`, manually checking values and breaking out, makes crowley even slower, however.

```
Root-level array (github_array.json):
    crowley [0:3]: 22µs (crowley terminates early more quickly)
    ijson [0:3]+break: 234µs
    Difference: 10.6x

    crowley [97:102]: 464µs (crowley terminates early more quickly)
    ijson [97:102]+break: 923 µs
    Difference: 1.98x

    crowley [*] (full): 49.4ms
    crowley [*]+break: 60.9ms

Nested array (ne_10m.json):
    crowley [0:3]: 131.4ms
    ijson [0:3]+break: 847µs (ijson is able to short-circuit faster!)
    Difference: 0.006x

    crowley [97:102]: 133.8ms
    ijson [97:102]+break: 11.5ms (ijson is able to short-circuit faster!)
    Difference: 0.086x
```

```python
# start of array
crowley.Query(file_str, "features[0:3].properties.name", no_seek=True).values()

# middle of array
crowley.Query(file_str, "features[97:102].properties.name", no_seek=True).values()

def ijson_range_start():
    result = []
    with open(file_str, "rb") as f:
        for i, name in enumerate(ijson.items(f, "features.item.properties.name")):
            if i < 3:
                result.append(name)
            else:
                break
    return result
ijson_range_start()

def ijson_range_mid():
    result = []
    with open(file_str, "rb") as f:
        for i, name in enumerate(ijson.items(f, "features.item.properties.name")):
            if 97 <= i < 102:
                result.append(name)
            if i >= 101:
                break
    return result
ijson_range_mid()
```

And can even descend recursively in a way that `ijson` simply cannot do: this would require a non-streaming solution like `json` that loads the whole file into memory.

```python
# get unique values of 'type' at any depth 
# 221.8ms : ['FeatureCollection', 'name', 'Feature', 'Polygon']
crowley.Query(file_str, "(* | [*])*.type", no_seek=True).unique_values()

# get count of all matching objects at all depths
# 156.7ms : 17090
crowley.Query(file_str, "(* | [*])*.type", no_seek=True).count()

# walk the entire json tree manually looking for matching keys
# 509.8ms
import json
def json_recursive_search(key):
    with open(file_str) as f:
        data = json.load(f)

    results = []
    def walk(obj):
        if isinstance(obj, dict):
            for k, v in obj.items():
                if k == key:
                    results.append(v)
                walk(v)
        elif isinstance(obj, list):
            for item in obj:
                walk(item)
    walk(data)
    return results

values = json_recursive_search("type")
unique = set(str(x) for x in values)
```

### Cold vs Hot Start

On cold starts (first query, no prior loading), `crowley` is **2-3x faster than pandas**, **3-7x faster than DuckDB**, and handles files that make Polars fail entirely due to schema inference errors.

On subsequent calls, methods such as `count()` or `exists()` return their pre-computed answer in **O(1)** with zero file I/O. Other methods like `types()` and `agg()` will determine whether reading only matched byte positions will be faster than a full sequential scan. 

However, on very large files with a large volume of matches, the cached byte offsets for matches can considerably exceed the memory usage from streaming itself, and these offsets remain in the Query object until it is dropped. The query's cache can be manually cleared with `.clear_cache()`, and cache accumulation can be deactivated at query creation with the `no_seek=True` kwarg. This can be configured globally with `crowley.configure(no_seek=True)`.

## Acknowledgments

Built on the DFA-based query engine from [jsongrep](https://github.com/micahkepe/jsongrep) by Micah Kepe, and the SAX parser from [json-event-parser](https://github.com/oxigraph/json-event-parser) by the Oxigraph project.

This project benefits not only from the work of other developers, but also from their choice to make their source code public and freely re-usable under the MIT and Apache2.0 licenses.

