Metadata-Version: 2.4
Name: polars-api
Version: 0.4.0
Summary: Call REST APIs from a Polars DataFrame, one row at a time, using native Polars expressions. Sync and async GET/POST with per-row URLs, params, and bodies.
Author-email: Diego Garcia Lozano <diegoglozano96@gmail.com>
Project-URL: Homepage, https://diegoglozano.github.io/polars-api/
Project-URL: Repository, https://github.com/diegoglozano/polars-api
Project-URL: Documentation, https://diegoglozano.github.io/polars-api/
Keywords: polars,polars-api,rest-api,http,httpx,async,dataframe,etl,data-engineering,python
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: <4.0,>=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: aiohttp>=3.9
Requires-Dist: httpx>=0.28.1
Requires-Dist: nest-asyncio>=1.6.0
Requires-Dist: polars>=1.0.0
Dynamic: license-file

<p align="center">
  <img src="https://raw.githubusercontent.com/diegoglozano/polars-api/main/docs/assets/logo.svg" alt="polars-api" width="180" />
</p>

# polars-api

[![PyPI version](https://img.shields.io/pypi/v/polars-api.svg)](https://pypi.org/project/polars-api/)
[![Python versions](https://img.shields.io/pypi/pyversions/polars-api.svg)](https://pypi.org/project/polars-api/)
[![Release](https://img.shields.io/github/v/release/diegoglozano/polars-api)](https://github.com/diegoglozano/polars-api/releases)
[![Build status](https://img.shields.io/github/actions/workflow/status/diegoglozano/polars-api/main.yml?branch=main)](https://github.com/diegoglozano/polars-api/actions/workflows/main.yml?query=branch%3Amain)
[![codecov](https://codecov.io/gh/diegoglozano/polars-api/branch/main/graph/badge.svg)](https://codecov.io/gh/diegoglozano/polars-api)
[![License](https://img.shields.io/github/license/diegoglozano/polars-api)](https://github.com/diegoglozano/polars-api/blob/main/LICENSE)

**Call REST APIs from a [Polars](https://pola.rs) DataFrame, one row at a time, using native Polars expressions.**

`polars-api` registers an `.api` namespace on Polars expressions so you can issue HTTP `GET` and `POST` requests for every row of a DataFrame — synchronously or asynchronously — and pipe the responses straight back into your data pipeline.

```python
import polars as pl
import polars_api  # noqa: F401  — registers the `.api` namespace

post = pl.Struct({"userId": pl.Int64, "id": pl.Int64, "title": pl.Utf8, "body": pl.Utf8})

(
    pl.DataFrame({"url": ["https://jsonplaceholder.typicode.com/posts/1"]})
      .with_columns(
          pl.col("url").api.get().str.json_decode(post).alias("response")
      )
)
```

> In an expression, `str.json_decode()` needs an explicit `dtype` (recent Polars
> made it required). See [Decoding JSON responses](#6-decoding-json-responses)
> for the schema-free, eager alternative.

- **Repository**: <https://github.com/diegoglozano/polars-api>
- **Documentation**: <https://diegoglozano.github.io/polars-api/>
- **PyPI**: <https://pypi.org/project/polars-api/>

---

## Why polars-api?

- **Expression-native** — works inside `with_columns`, `select`, and any other Polars expression context. No `for` loops, no manual `apply`.
- **Sync and async out of the box** — async variants (`aget` / `apost`) fan out requests with `asyncio.gather` for high-throughput enrichment.
- **Per-row URLs, params, and bodies** — every argument can be a Polars expression, so you can build them from other columns.
- **Built on [httpx](https://www.python-httpx.org/) (sync) and [aiohttp](https://docs.aiohttp.org/) (async)** — async fan-out uses aiohttp for ~10× higher throughput at high concurrency.
- **Tiny surface area** — four methods (`get`, `aget`, `post`, `apost`) you already know how to use.

Common use cases:

- Enrich a DataFrame with data from a REST API (geocoding, currency rates, user profiles…).
- Score rows against an ML inference endpoint.
- Hit an internal microservice in batch from a notebook or ETL job.
- Quickly prototype API-driven data pipelines without writing async boilerplate.

## Installation

```sh
# uv
uv add polars-api

# pip
pip install polars-api

# poetry
poetry add polars-api
```

Requires Python 3.9+ and Polars 1.0+.

## Quickstart

### 1. GET request per row

```python
import polars as pl
import polars_api  # noqa: F401

post = pl.Struct({"userId": pl.Int64, "id": pl.Int64, "title": pl.Utf8, "body": pl.Utf8})

df = (
    pl.DataFrame({"id": [1, 2, 3]})
      .with_columns(
          ("https://jsonplaceholder.typicode.com/posts/" + pl.col("id").cast(pl.Utf8)).alias("url")
      )
      .with_columns(
          pl.col("url").api.get().str.json_decode(post).alias("response")
      )
)
```

### 2. GET with query parameters

Pass any Polars expression that resolves to a struct as `params`. Here the
endpoint returns a JSON array, so the decode schema is `pl.List(post)`:

```python
df = (
    pl.DataFrame({"url": ["https://jsonplaceholder.typicode.com/posts"] * 3})
      .with_columns(
          pl.struct(userId=pl.Series([1, 2, 3])).alias("params"),
      )
      .with_columns(
          pl.col("url").api.get(params=pl.col("params")).str.json_decode(pl.List(post)).alias("response")
      )
)
```

### 3. POST with a JSON body

```python
post = pl.Struct({"userId": pl.Int64, "id": pl.Int64, "title": pl.Utf8, "body": pl.Utf8})

df = (
    pl.DataFrame({"url": ["https://jsonplaceholder.typicode.com/posts"] * 3})
      .with_columns(
          pl.struct(
              title=pl.lit("foo"),
              body=pl.lit("bar"),
              userId=pl.Series([1, 2, 3]),
          ).alias("body"),
      )
      .with_columns(
          pl.col("url").api.post(body=pl.col("body")).str.json_decode(post).alias("response")
      )
)
```

### 4. Async requests for throughput

`aget` and `apost` fan out with `aiohttp` and `asyncio.gather`, so requests run concurrently per batch. In benchmarks against a local server, this is roughly an order of magnitude faster than the sync path at high concurrency:

```python
post = pl.Struct({"userId": pl.Int64, "id": pl.Int64, "title": pl.Utf8, "body": pl.Utf8})

df = (
    pl.DataFrame({"url": ["https://jsonplaceholder.typicode.com/posts"] * 100})
      .with_columns(
          pl.col("url").api.aget().str.json_decode(pl.List(post)).alias("response")
      )
)
```

### 5. Timeouts

Every method accepts a `timeout` (in seconds). Sync verbs forward it to `httpx`; async verbs wrap it in `aiohttp.ClientTimeout(total=...)`.

```python
pl.col("url").api.get(timeout=5.0)
pl.col("url").api.apost(body=pl.col("body"), timeout=10.0)
```

### 6. Decoding JSON responses

Every verb returns a `Utf8` column of raw response bodies. There are two ways to
parse it, depending on the context:

**In an expression (works in both `DataFrame` and `LazyFrame`)** — pass an
explicit `dtype`. Recent versions of Polars made the `dtype` argument of
`Expr.str.json_decode()` required, because the lazy engine needs to know the
output schema up front:

```python
post = pl.Struct({"userId": pl.Int64, "id": pl.Int64, "title": pl.Utf8, "body": pl.Utf8})

df.with_columns(
    pl.col("response").str.json_decode(post)
)
```

If the endpoint returns a JSON array, wrap the element schema in `pl.List(...)`
(e.g. `pl.List(post)`).

**On a materialized `Series` (eager `DataFrame` only)** — `Series.str.json_decode()`
can still infer the schema from the data, so you can skip the explicit dtype.
This is convenient for quick, interactive exploration:

```python
df = df.with_columns(
    df["response"].str.json_decode().alias("response")
)
```

Inference only works on an already-collected `DataFrame`; inside a `LazyFrame`
pipeline you must use the expression form with an explicit dtype.

### 7. Global defaults (set options once)

Talking to an authenticated API means passing the same `client=`, `bearer=`, or
`auth=` to _every_ call. Register them once with `set_defaults(...)` and every
subsequent `.api` call falls back to them for any argument you don't pass
explicitly:

```python
import httpx
import polars_api

polars_api.set_defaults(
    client=httpx.Client(
        base_url="https://api.example.com",
        headers={"Authorization": "Bearer my-token"},
    ),
    retries=3,
    backoff=0.5,
)

# No need to repeat client=/retries=/backoff= on each call:
df.with_columns(pl.col("path").api.get().alias("res"))
```

Explicit per-call arguments always win over the configured default. Use the
`defaults(...)` context manager to scope overrides to a block, and
`reset_defaults()` to clear them:

```python
# Scope a client to one block; previous defaults are restored on exit
with polars_api.defaults(client=session, max_concurrency=10):
    df.with_columns(pl.col("path").api.aget().alias("res"))

polars_api.reset_defaults()            # clear everything
polars_api.reset_defaults("client")    # clear just one option
polars_api.get_defaults()              # inspect the current config
```

> **Note:** `client` is shared across the sync (`httpx.Client`) and async
> (`aiohttp.ClientSession`) paths, which need different client types. If you mix
> sync and async verbs, set `client` per call (or via `defaults(...)`) so each
> path gets the right client.

## API reference

All methods live under the `.api` namespace on any Polars expression that resolves to a URL string.

| Method               | HTTP verb | Mode         |
| -------------------- | --------- | ------------ |
| `get` / `aget`       | GET       | sync / async |
| `post` / `apost`     | POST      | sync / async |
| `put` / `aput`       | PUT       | sync / async |
| `patch` / `apatch`   | PATCH     | sync / async |
| `delete` / `adelete` | DELETE    | sync / async |
| `head` / `ahead`     | HEAD      | sync / async |

Arguments (all keyword-only after the positional `params` / `body`):

- **`params`** — Polars expression yielding a struct of query-string parameters per row.
- **`body`** _(POST/PUT/PATCH only)_ — Polars expression yielding a struct serialized as a JSON body per row.
- **`data`** — Polars expression yielding a struct serialized as `application/x-www-form-urlencoded`.
- **`headers`** — Polars expression yielding a struct of headers per row (e.g. tenant IDs, custom auth).
- **`client`** — preconfigured `httpx.Client` (sync verbs) or `aiohttp.ClientSession` (async verbs) for connection reuse, custom timeouts, base URLs, cookies, etc.
- **`timeout`** — request timeout in seconds.
- **`retries`** _(int, default 0)_ — retry on connection errors, timeouts, 5xx, and 429.
- **`backoff`** _(float, default 0.0)_ — exponential backoff base (seconds). 429s respect `Retry-After` if present.
- **`max_concurrency`** _(async only)_ — cap on in-flight requests via `asyncio.Semaphore`.
- **`cache`** _(bool, default False)_ — memoize identical `(method, url, params, body, data, headers)` tuples within a batch.
- **`with_metadata`** _(bool, default False)_ — return a struct `{body, status, elapsed_ms, error}` per row instead of just the body.
- **`with_response_headers`** _(bool, default False)_ — when `with_metadata=True`, also include `response_headers: List[Struct{name, value}]` on the struct.
- **`on_error`** _("null" | "raise" | "return")_ — when `with_metadata=False`, what to do on non-2xx / network errors.
- **`on_request`**, **`on_response`** — observability hooks.
  - Sync verbs: receive `httpx.Request` and `httpx.Response`.
  - Async verbs: `on_request(method, url, kwargs)` (the args about to be sent) and `on_response(aiohttp.ClientResponse)`.
- **`auth=("user", "pass")`** — basic auth.
- **`bearer=pl.col("token")`** — per-row bearer token (also accepts a literal string).
- **`api_key=...`**, **`api_key_header="X-API-Key"`** — shorthand for an API-key header.

By default, each method returns a `pl.Expr` of `Utf8`. With `with_metadata=True`, it returns a struct column with the schema:

```python
{"body": Utf8, "status": Int64, "elapsed_ms": Float64, "error": Utf8}
```

See [Decoding JSON responses](#6-decoding-json-responses) for how to parse the
body — pass an explicit `dtype` in an expression, or call `.str.json_decode()`
on the materialized `Series` to infer the schema.

### Examples

```python
# Per-row bearer auth + retries + concurrency cap
pl.col("url").api.aget(
    bearer=pl.col("token"),
    retries=3,
    backoff=0.5,
    max_concurrency=10,
)

# Inspect status, timing, errors and response headers
pl.col("url").api.get(with_metadata=True, with_response_headers=True)

# Bring your own session (connector tuning, base_url, cookies, etc.)
session = aiohttp.ClientSession(base_url="https://api.example.com")
pl.col("path").api.aget(client=session)

# Skip duplicate URLs within a batch (e.g. after a join/explode)
pl.col("url").api.aget(cache=True)

# Follow Link: rel="next" pagination
df.with_columns(
    pl.col("url").api.paginate(max_pages=20).alias("pages")
).explode("pages")
```

### Global configuration

Module-level helpers let you set request options once instead of repeating them
on every call. Anything left unset on a call falls back to the configured
default, then to the built-in default; explicit per-call arguments always win.

| Function                            | Purpose                                                              |
| ----------------------------------- | -------------------------------------------------------------------- |
| `polars_api.set_defaults(**o)`      | Register persistent defaults for any request option.                 |
| `polars_api.get_defaults()`         | Return a copy of the currently configured defaults.                  |
| `polars_api.reset_defaults(*names)` | Clear all defaults, or only the named ones.                          |
| `polars_api.defaults(**o)`          | Context manager that applies defaults within a block, then restores. |

Configurable options mirror the per-call keyword arguments: `client`, `headers`,
`timeout`, `retries`, `backoff`, `max_concurrency`, `cache`, `with_metadata`,
`with_response_headers`, `on_error`, `on_request`, `on_response`, `auth`,
`bearer`, `api_key`, and `api_key_header`.

## Benchmarks

`benchmarks/bench.py` spins up a local `aiohttp` echo server on `127.0.0.1` and
issues N concurrent GETs against it with each client. Local-loopback isolates
client-side overhead — it is **not** a model of real network latency, but it is
useful for comparing the per-request cost of each path.

### Reproducing

```sh
# default scenarios: 100/50, 500/100, 1000/100, 2000/200, repeats=5
just bench

# or with custom scenarios (N/concurrency pairs) and repeats
uv run python benchmarks/bench.py --scenarios 100/50,1000/100 --repeats 7
```

The script writes `benchmarks/results.json` (raw timings + environment) and
`benchmarks/results.md` (Markdown table) for inspection or sharing. Both files
are gitignored — re-run locally to refresh. The numbers below are a reference
run on the environment described.

### Latest results

Median of 5 runs, on Linux 6.18 / x86_64 / 4 cores, Python 3.11.15, polars
1.19.0, httpx 0.28.1, aiohttp 3.13.5. Higher rps is better.

| Scenario (N / concurrency) | polars-api default | polars-api shared client | bare httpx (default) | bare httpx (tuned) | bare aiohttp |
| -------------------------- | -----------------: | -----------------------: | -------------------: | -----------------: | -----------: |
| 100 / 50                   |      **3,471** rps |            **4,122** rps |              538 rps |            235 rps |    3,041 rps |
| 500 / 100                  |      **3,728** rps |            **4,284** rps |              401 rps |             88 rps |    4,352 rps |
| 1000 / 100                 |      **4,012** rps |            **3,734** rps |              355 rps |            137 rps |    4,701 rps |
| 2000 / 200                 |      **3,980** rps |            **3,772** rps |              125 rps |            123 rps |    4,477 rps |

Takeaways:

- The async `aget` / `apost` path runs at roughly the same throughput as bare
  `aiohttp`, with a small overhead for the Polars expression plumbing.
- It is **~10–35× faster than `httpx`** at the concurrencies tested, which is
  why the async path was migrated to `aiohttp`.
- Bringing your own `aiohttp.ClientSession` via `client=` shaves a little more
  off small batches (no per-call session setup) and is recommended for
  long-running pipelines.

These numbers measure client overhead, not API latency. With a real remote
endpoint, network RTT will dominate and the gap between clients shrinks.

## Tips and patterns

- **Decode JSON immediately**: chain `.str.json_decode(dtype)` (an explicit schema is required in an expression — see [Decoding JSON responses](#6-decoding-json-responses)) and then `.struct.unnest()` (or `pl.col("response").struct.field("…")`) to flatten the result.
- **Build URLs from columns**: use Polars string concatenation or `pl.format("https://api.example.com/users/{}", pl.col("user_id"))` to build per-row URLs.
- **Prefer `aget` / `apost` for many rows**: async variants run requests concurrently and are typically much faster for I/O-bound workloads.
- **Inspect failures**: the sync helpers return `null` for non-2xx responses; check for nulls in the resulting column before decoding.

## FAQ

**Can I make HTTP requests from a Polars DataFrame?**
Yes — that is exactly what `polars-api` is for. Import the package and call `.api.get()` / `.api.post()` on a URL column.

**How do I call a REST API for every row of a Polars DataFrame?**
Place the URLs in a column and use `pl.col("url").api.get()` (or `aget` for async). Optional `params` and `body` arguments accept Polars expressions, so they can vary by row.

**Does it support async / concurrent requests?**
Yes. `aget` and `apost` issue requests concurrently with `asyncio.gather`, which is significantly faster than the sync variants when you have more than a handful of rows.

**Is it lazy-frame compatible?**
Yes — because everything is built on Polars expressions, you can use it in `LazyFrame.with_columns(...)` pipelines.

**What does it return?**
A `Utf8` column with the raw response body. Parse it with `.str.json_decode(dtype)` in an expression, or call `.str.json_decode()` on the materialized `Series` to infer the schema — see [Decoding JSON responses](#6-decoding-json-responses).

## Contributing

Contributions are welcome — see [CONTRIBUTING.md](./CONTRIBUTING.md). Please open an issue before starting on larger changes.

## License

[MIT](./LICENSE) © Diego Garcia Lozano

---

Repository initiated with [fpgmaas/cookiecutter-uv](https://github.com/fpgmaas/cookiecutter-uv).
