Metadata-Version: 2.4
Name: pystarburst
Version: 0.12.0
Summary: PyStarburst DataFrame API allows you to query and transform data in Starburst products in a data pipeline without having to download the data locally.
License: Apache-2.0
License-File: LICENSE.txt
Author: Starburst Data
Author-email: info@starburstdata.com
Requires-Python: >=3.10,<3.15
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Provides-Extra: pandas
Provides-Extra: pyarrow
Requires-Dist: pandas (>=2.2,<3.0) ; extra == "pandas"
Requires-Dist: pyarrow (>=23.0.1,<24.0.0) ; extra == "pyarrow"
Requires-Dist: pydantic (>=2.12.5,<3.0.0)
Requires-Dist: python-dateutil (>=2.8.2,<3.0.0)
Requires-Dist: trino (>=0.336.0,<0.337.0)
Requires-Dist: urllib3 (>=2.6.3,<3.0.0)
Requires-Dist: zstandard (>=0.23.0,<0.24.0)
Project-URL: Homepage, https://starburst.io
Project-URL: Repository, https://github.com/starburstdata/pystarburst-examples
Description-Content-Type: text/markdown

# PyStarburst DataFrame API

PyStarburst DataFrame API allows you to query and transform data in Starburst products in a data pipeline without having to download the data locally.

## Documentation

See the PyStarburst API [documentation](https://pystarburst.eng.starburstdata.net/) and the examples [repository](https://github.com/starburstdata/pystarburst-examples).

## Getting started

Install pystarburst

```bash
pip install pystarburst
```

### Connect to a Starburst server

The parameters are the same connect parameters as in Trino Python Client.

```python
from pystarburst import Session

connection_parameters = {
    "host": "localhost",
    "port": 8080,
    "user": "admin",
    "catalog": "tpch",
    "schema": "tiny"
}

session = Session.builder.configs(connection_parameters).create()
```

### Using SQL

```python
from pystarburst import Session

session = Session.builder.configs({ ... }).create()

session.sql("SELECT 1 as a").show()
```

### Querying a table

```python
from pystarburst import Session

session = Session.builder.configs({ ... }).create()

df = session.table("nation")
print(df.schema)
df.show()

```

### Filtering a data frame

```python
from pystarburst import Session

session = Session.builder.configs({ ... }).create()

df = session.table("nation")
df.filter(df.col("regionkey") == 0).show()
```

### Joining data frames

```python
from pystarburst import Session

session = Session.builder.configs({ ... }).create()

df = session.table("nation")
df.filter(df.col("regionkey") == 0).show()
```

### Aggregation

```python
from pystarburst import Session
from pystarburst.functions import col

session = Session.builder.configs({ ... }).create()
df = session.table("nation")
df.agg((col("regionkey"), "max"), (col("regionkey"), "avg")).show()
```

### Arrow spooling for `to_pandas()`

When configured with an Arrow encoding, `to_pandas()` uses Arrow IPC spooling with parallel segment decoding for significantly faster pandas DataFrame creation.

```bash
pip install pystarburst[pyarrow]
```

```python
from pystarburst import Session

session = Session.builder.configs({
    ...
    "encoding": "arrow-preview+zstd",
}).create()

df = session.sql("SELECT * FROM nation").to_pandas()
```

Arrow encoding is used only for `to_pandas()`. All other operations (`collect()`, `show()`, etc.) use the default encoding.

