Metadata-Version: 2.4
Name: fast-gram
Version: 0.1.0
Summary: High-performance memory-mapped n-gram engine for large text corpora
Author-email: jaso1024 <jabohwho@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/Jaso1024/Fastgram
Project-URL: Repository, https://github.com/Jaso1024/Fastgram
Project-URL: Issues, https://github.com/Jaso1024/Fastgram/issues
Keywords: ngram,suffix-array,nlp,tokenization,infingram
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: C++
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: transformers>=4.0
Dynamic: license-file
Dynamic: requires-python

# Fastgram

High-performance memory-mapped n-gram engine compatible with InfiniGram-style shard directories (`tokenized.*`, `table.*`, `offset.*`).

Build:

`cmake -S . -B build -DCMAKE_BUILD_TYPE=Release`

`cmake --build build -j`

Test:

`ctest --test-dir build --output-on-failure`

Tools:

- `tg_rpc`: stdin/stdout RPC for benchmarking + integration
- `tg_query`: quick CLI query helper
- `tg_build_unigram_ranges`: build `unigram_ranges.bin` for faster unigram range lookup
- `tools/run_bench.py`: benchmark runner (uses `bench/bench_config.json`)
- `tools/gen_bench_queries.py`: build deterministic query suite for coverage
- `tools/run_bench_suite.py`: suite runner (uses `bench/bench_suite_config.json`)
- `tg_slice_index`: build deterministic slices for build benchmarks
- `tg_build_index`: build table/full index (benchmark target)
- `tools/run_build_bench.py`: index build benchmark runner (uses `bench/build_bench_config.json`)
- `tools/verify_built_index.py`: correctness check for build outputs

Benchmarking:

Query benchmarks measure `find` and `ntd` operation performance:
- `python tools/run_bench.py` - runs find/ntd benchmarks using `bench/bench_config.json`
- `python tools/run_bench_suite.py` - runs comprehensive suite using `bench/bench_suite_config.json`

Build benchmarks measure index construction performance:

1. Create test slices from an existing index:
```bash
# Small slice: 2000 docs, token_width=2 (u16)
./build/tg_slice_index <source_index_dir> bench/build_inputs/small 2000 2

# Medium slice: 20000 docs, token_width=2 (u16)
./build/tg_slice_index <source_index_dir> bench/build_inputs/medium 20000 2
```

2. Build reference indices:
```bash
# token_width=2, version=4, mode=table_only, ram_cap=8GB
./build/tg_build_index bench/build_inputs/small bench/build_refs/small 2 4 table_only 8589934592
./build/tg_build_index bench/build_inputs/medium bench/build_refs/medium 2 4 table_only 8589934592
```

3. Run build benchmarks:
```bash
python tools/run_build_bench.py
```

Notes:
- Build scripts auto-detect build directory or use `GRAM_BUILD_DIR` environment variable
- `ram_cap_bytes` in configs is 8589934592 (8GB) to limit memory during benchmarking
- Generate query suites with `python tools/gen_bench_queries.py --index-dir <path> --eos <eos_id> --vocab <vocab_size>`

Python:

`python -m pip install -e .`

`python -c "from fastgram import GramEngine; print(GramEngine)"`

Download indices:

Requires AWS CLI (`aws`).

`gram`  # interactive

`gram list`

`gram download v4_pileval_gpt2 --to index/v4_pileval_gpt2`

Run:

`gram run --index index/v4_pileval_gpt2 --prompt "natural language processing"`

Interactive run:

`gram` -> `1` (run)

Settings in run mode:

`/settings`

`/set topk 50`

`/set temperature 0.8`

`/gen 20 hello world`

Notes:

- Uses the tokenizer specified for the index in the catalog.
- Some tokenizers require `HF_TOKEN` (for gated models like Llama-2).
