Metadata-Version: 2.3
Name: pyspark-huggingface
Version: 2.1.0
Summary: A DataSource for reading and writing HuggingFace Datasets in Spark
Author: allisonwang-db, lhoestq, wengh
Author-email: allisonwang-db <allison.wang@databricks.com>, lhoestq <quentin@huggingface.co>, wengh <wenghy02@gmail.com>
License: Apache License 2.0
Requires-Dist: datasets>=4.0
Requires-Dist: huggingface-hub>=0.34.4
Requires-Dist: pyarrow>=21.0.0
Requires-Python: >=3.9
Description-Content-Type: text/markdown

<p align="center">
  <img alt="Hugging Face x Spark" src="https://pbs.twimg.com/media/FvN1b_2XwAAWI1H?format=jpg&name=large" width="352" style="max-width: 100%;">
  <br/>
  <br/>
</p>

<p align="center">
    <a href="https://github.com/huggingface/pyspark_huggingface/releases"><img alt="GitHub release" src="https://img.shields.io/github/release/huggingface/pyspark_huggingface.svg"></a>
    <a href="https://huggingface.co/datasets/"><img alt="Number of datasets" src="https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/datasets&color=brightgreen"></a>
</p>

# Spark Data Source for Hugging Face Datasets

A Spark Data Source for accessing [🤗 Hugging Face Datasets](https://huggingface.co/datasets) and [🤗 Storage Bucket](https://huggingface.co/storage):

- Stream datasets from Hugging Face as Spark DataFrames
- Select subsets and splits, apply projection and predicate filters
- Save Spark DataFrames as Parquet files to Hugging Face
- Fast deduped uploads
- Fully distributed
- Authentication via `huggingface-cli login` or tokens
- Compatible with Spark 4 (with auto-import)
- Backport for Spark 3.5, 3.4 and 3.3

## Installation

```
pip install pyspark_huggingface
```

## Usage with dataset repositories

Load a dataset (here [stanfordnlp/imdb](https://huggingface.co/datasets/stanfordnlp/imdb)):

```python
import pyspark_huggingface
df = spark.read.format("huggingface").load("stanfordnlp/imdb")
```

Save to Hugging Face:

```python
# Login with huggingface-cli login
df.write.format("huggingface").mode("overwrite").save("username/my_dataset")
# Or pass a token manually
df.write.format("huggingface").option("token", "hf_xxx").mode("overwrite").save("username/my_dataset")
```

## Usage with storage buckets

Load a data from a [Storage Bucket](https://huggingface.co/storage):

```python
import pyspark_huggingface
df = spark.read.format("huggingface").option("data_dir", "data").load("buckets/username/bucket_name")
```

Save to Hugging Face:

```python
# Login with huggingface-cli login
df.write.format("huggingface").option("data_dir", "data").mode("overwrite").save("buckets/username/bucket_name")
# Or pass a token manually
df.write.format("huggingface").option("data_dir", "data").option("token", "hf_xxx").mode("overwrite").save("buckets/username/bucket_name")
```

Buckets support requires `datasets>=4.8.4` and `huggingface_hub>=1.10.1`.

## Advanced

Select a split:

```python
test_df = (
    spark.read.format("huggingface")
    .option("split", "test")
    .load("stanfordnlp/imdb")
)
```

Select a subset/config:

```python
sample_df = (
    spark.read.format("huggingface")
    .option("config", "sample-10BT")
    .load("HuggingFaceFW/fineweb-edu")
)
```

Specify data_files or data_dir:

```python
one_file_df = (
    spark.read.format("huggingface")
    .option("data_files", "sample/10BT/000_00000.parquet")
    .load("HuggingFaceFW/fineweb-edu")
)
multiple_files_df = (
    spark.read.format("huggingface")
    .option("data_files", '["sample/10BT/000_00000.parquet", "sample/10BT/001_00000.parquet"]')
    .load("HuggingFaceFW/fineweb-edu")
)
glob_df = (
    spark.read.format("huggingface")
    .option("data_files", "sample/10BT/*.parquet")
    .load("HuggingFaceFW/fineweb-edu")
)
dir_df = (
    spark.read.format("huggingface")
    .option("data_dir", "sample/10BT")
    .load("HuggingFaceFW/fineweb-edu")
)
```

Filters columns and rows (especially efficient for Parquet datasets):

```python
filtered_df = (
    spark.read.format("huggingface")
    .option("filters", '[("language_score", ">", 0.99)]')
    .option("columns", '["text", "language_score"]')
    .load("HuggingFaceFW/fineweb-edu")
)
```

## Fast deduped uploads

Hugging Face uses Xet: a dedupe-based storage which enables fast deduped uploads.

Unlike traditional remote storage, uploads are faster on Xet because duplicate data is only uploaded once.
For example: if some or all of the data already exists in other files on Xet, it is not uploaded again, saving bandwidth and speeding up uploads. Deduplication for Parquet is enabled through Content Defined Chunking (CDC).

Thanks to Parquet CDC and Xet deduplication, saving a dataset on Hugging Face is faster than on any traditional remote storage.

For more information, see [https://huggingface.co/blog/parquet-cdc](https://huggingface.co/blog/parquet-cdc).

## Backport

While the Data Source API was introcuded in Spark 4, this package includes a backport for older versions.

Importing `pyspark_huggingface` patches the PySpark reader and writer to add the "huggingface" data source. It is compatible with PySpark 3.5, 3.4 and 3.3:

```python
>>> import pyspark_huggingface
huggingface datasource enabled for pyspark 3.x.x (backport from pyspark 4)
```

The import is only necessary on Spark 3.x to enable the backport.
Spark 4 automatically imports `pyspark_huggingface` as soon as it is installed, and registers the "huggingface" data source.


## Development

[Install uv](https://docs.astral.sh/uv/getting-started/installation/) if not already done.

Then, from the project root directory, sync dependencies and run tests.
```
uv sync
uv run pytest
```
