Metadata-Version: 2.3
Name: oasnap-reader
Version: 0.1.0
Summary: Read full OpenAlex snapshot and convert to reduced dataset.
Author: Malte Vogl
Author-email: Malte Vogl <vogl@gea.mpg.de>
Requires-Dist: duckdb>=1.0
Requires-Dist: pandas>=2.0
Requires-Dist: tqdm>=4.0
Requires-Python: >=3.12
Description-Content-Type: text/markdown

# oasnap-reader

Read a full [OpenAlex](https://openalex.org) snapshot and convert it to a reduced dataset for analysis.

The reduced dataset selects works that are not paratext and have information about subfields. 

Full conversion time depends on available hardware. The full conversion with >2000 files took ~2 hours with 24 workers and sufficient RAM. 

## Install

```bash
pip install oasnap-reader
```

## Usage

```python
from pathlib import Path
from oasnap_reader.reader import ReadGZ

reader = ReadGZ(
    in_path=Path("/data/openalex/works"),
    out_path=Path("/data/reduced"),
)
reader.read_all()
```

`in_path` should point to the root of the OpenAlex `works` snapshot directory.
Output is one gzip-compressed JSONL file per input file, written to `out_path`.

## Documentation

See the full documentation including usage options and API reference at the project docs site.

## Development

```bash
git clone https://gitlab.gwdg.de/mpigea/dt/oasnap-reader
cd oasnap-reader
uv sync --dev
```
