Metadata-Version: 2.4
Name: pyframe-xpy
Version: 0.1.2
Summary: High-performance parallel dataframe and array processing with Arrow-backed storage
Author: FrameX Contributors
License-Expression: MIT
Project-URL: Homepage, https://github.com/aeiwz/FrameX
Project-URL: Repository, https://github.com/aeiwz/FrameX
Project-URL: Issues, https://github.com/aeiwz/FrameX/issues
Project-URL: Documentation, https://github.com/aeiwz/FrameX/tree/main/docs
Keywords: dataframe,array,analytics,arrow,dask,ray,numpy,parallel
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Software Development :: Libraries
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pyarrow>=14.0
Requires-Dist: numpy>=1.24
Provides-Extra: pandas-compat
Requires-Dist: pandas>=2.0; extra == "pandas-compat"
Provides-Extra: distributed
Requires-Dist: dask[dataframe,distributed]>=2024.1.0; extra == "distributed"
Requires-Dist: ray[data]>=2.9.0; extra == "distributed"
Provides-Extra: accel
Requires-Dist: numexpr>=2.9; extra == "accel"
Requires-Dist: numba>=0.59; extra == "accel"
Provides-Extra: gpu
Requires-Dist: cupy-cuda12x>=13.0; platform_system != "Windows" and extra == "gpu"
Provides-Extra: ml-accel
Requires-Dist: torch>=2.2; extra == "ml-accel"
Requires-Dist: jax>=0.4.30; extra == "ml-accel"
Provides-Extra: pandas-fast
Requires-Dist: modin[ray]>=0.30; extra == "pandas-fast"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-benchmark; extra == "dev"
Requires-Dist: hypothesis; extra == "dev"
Provides-Extra: bench
Requires-Dist: matplotlib>=3.8; extra == "bench"
Requires-Dist: psutil>=5.9; extra == "bench"
Provides-Extra: release
Requires-Dist: build>=1.2.2; extra == "release"
Requires-Dist: twine>=5.1.1; extra == "release"
Dynamic: license-file

# FrameX

FrameX is an Arrow-backed Python library for parallel dataframe and array processing on a single machine.

It combines:

- Pandas-like tabular APIs (`DataFrame`, `Series`, `GroupBy`)
- NumPy-compatible chunked arrays (`NDArray` with NumPy protocol support)
- Arrow-native storage/interop (`to_arrow`, Parquet/IPC I/O)
- Eager execution with optional lazy pipelines (`.lazy().collect()`)
- Runtime backends for local threads/processes plus optional Ray/Dask executors

## Why FrameX

FrameX is aimed at local analytics workflows that are bigger than comfortable single-threaded scripts but do not yet require distributed infrastructure.

Typical fit:

- ETL and analytics pipelines on medium-to-large local datasets
- feature engineering workflows that mix table and array operations
- migration paths from Pandas scripts where API familiarity matters

## Installation

From PyPI:

```bash
pip install pyframe-xpy
```

From source:

```bash
git clone https://github.com/aeiwz/FrameX.git
cd FrameX
pip install -e .
```

Requirements:

- Python `>=3.10`
- Core dependencies: `pyarrow`, `numpy`
- Optional compatibility: `pandas` (`pip install pyframe-xpy[pandas_compat]`)

## Quick Start

```python
import framex as fx

df = fx.DataFrame(
    {
        "group": ["a", "a", "b"],
        "value": [10, 20, 30],
        "is_refund": [False, True, False],
    }
)

result = (
    df.filter(~df["is_refund"])
      .groupby("group")
      .agg({"value": ["sum", "mean", "count"]})
      .sort("value_sum", ascending=False)
)

print(result.to_pandas())
```

## Core API

Top-level imports:

```python
import framex as fx
```

Main objects and helpers:

- `fx.DataFrame`, `fx.Series`, `fx.Index`, `fx.LazyFrame`
- `fx.NDArray`, `fx.array(...)`
- `fx.read_parquet`, `fx.write_parquet`, `fx.read_ipc`, `fx.write_ipc`, `fx.read_csv`, `fx.write_csv`
- `fx.read_json`, `fx.write_json`, `fx.read_ndjson`, `fx.write_ndjson`
- `fx.read_file`, `fx.write_file` for format auto-detection

Compression:
- transparent extension-based compression for `read_file` / `write_file`
- supported wrappers: `.gz`, `.bz2`, `.xz`, `.zip`, and `.zst`/`.zstd` (when `zstandard` is installed)
- `fx.from_pandas`, `fx.from_dask`, `fx.from_ray`, `fx.from_dataframe`
- `fx.get_config`, `fx.set_backend`, `fx.set_workers`, `fx.set_serializer`, `fx.set_kernel_backend`
- `fx.set_array_backend` for auto/NumExpr/Numba/JAX/PyTorch/CuPy acceleration modes
- `fx.recommend_best_performance_config()` to inspect hardware-tuned settings
- `fx.auto_configure_hardware()` to apply best-performance config automatically
- `fx.StreamProcessor` for micro-batch streaming pipelines

Acceleration extras:

```bash
pip install pyframe-xpy[accel]      # numexpr + numba
pip install pyframe-xpy[gpu]        # cupy (CUDA)
pip install pyframe-xpy[ml_accel]   # jax + pytorch
pip install pyframe-xpy[pandas_fast]  # modin backend
pip install pyframe-xpy[distributed]  # Dask + Ray distributed/HPC backends
pip install zstandard  # .zst/.zstd file compression
```

Backend notes:

- `fx.set_backend("threads" | "processes" | "ray" | "dask" | "hpc")`
- Ray and Dask execution backends require their respective runtimes to be installed/available.
- HPC mode (`"hpc"`) uses cluster-oriented execution via Dask or Ray:
  - `FRAMEX_HPC_ENGINE=dask|ray`
  - `FRAMEX_DASK_SCHEDULER_ADDRESS=<tcp://...>` to connect existing Dask clusters
  - `FRAMEX_RAY_ADDRESS=<ray://...>` to connect existing Ray clusters
  - optional SLURM bootstrap: `FRAMEX_DASK_SLURM=1` (requires `dask-jobqueue`)

Test support notes:

- Some tests are optional-backend gated and intentionally `skipped` when deps are not installed.
- Typical skip reasons: missing `dask.distributed`, `dask.dataframe`, `ray`, or `ray.data`.
- Run full optional matrix locally:

```bash
pip install pyframe-xpy[distributed]
pytest -q
```

## Documentation

Canonical docs are in [`docs/documents`](docs/documents):

- [Overview](docs/documents/overview.md)
- [Features](docs/documents/features.md)
- [Getting Started](docs/documents/getting_started.md)
- [Installation](docs/documents/installation.md)
- [Tutorial: ETL Pipeline](docs/documents/tutorial_etl_pipeline.md)
- [Tutorial: NumPy NDArray Interop](docs/documents/tutorial_numpy_array.md)
- [Use Cases](docs/documents/use_cases.md)
- [Configuration Guide](docs/documents/configuration_guide.md)
- [Performance Test](docs/documents/performance_test.md)
- [Architecture](docs/documents/architecture.md)
- [API Reference](docs/documents/api_reference.md)
- [Roadmap](docs/documents/roadmap.md)
- [FAQ](docs/documents/faq.md)

## Website (Docs UI)

The docs website lives in [`website`](website) (Next.js App Router).

Main docs routes:

- `http://localhost:3000/docs/features`
- `http://localhost:3000/docs/tutorial_etl_pipeline`
- `http://localhost:3000/docs/use_cases`
- `http://localhost:3000/docs/configuration_guide`
- `http://localhost:3000/docs/performance_test`

Run locally:

```bash
cd website
npm install
npm run dev
```

Production build:

```bash
npm run build
npm run start
```

## Development

Install dev dependencies:

```bash
pip install -e .[dev]
```

Run tests:

```bash
pytest
```

## Benchmarks

Benchmark code and generated reports are in [`benchmarks`](benchmarks).

Run the full benchmark suite (includes in-terminal progress bar and report generation):

```bash
python3 -m benchmarks.benchmark_suite
```

Run workload capability matrix checks:

```bash
python3 -m benchmarks.check_framex_workloads
```

Benchmark outputs are written to `benchmarks/results`:

- `benchmark_results.json`
- `benchmark_results.csv`
- `benchmark_report.md`
- `framex_workload_check.json`
- `performance_speedup.png`
- `parallel_processing_scaling.png`
- `multiprocessing_scaling.png`
- `memory_peak_rss.png`

## Project Status

FrameX is pre-1.0 (`0.1.2`) and in active development.

- APIs are usable and documented
- compatibility/performance behavior will continue to evolve
- pin versions for production-critical workloads

## License

[MIT](LICENSE)
