Metadata-Version: 2.4
Name: db2pq
Version: 0.3.0
Summary: Convert database tables to Parquet files.
Author-email: Ian Gow <iandgow@gmail.com>
Project-URL: Homepage, https://github.com/iangow/db2pq/
Project-URL: Repository, https://github.com/iangow/db2pq/
Project-URL: Issues, https://github.com/iangow/db2pq/issues
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: ibis-framework[duckdb,postgres]
Requires-Dist: pyarrow
Requires-Dist: duckdb
Requires-Dist: psycopg[binary]>=3.3.3; python_version >= "3.10"
Requires-Dist: psycopg<3.3,>=3.1; python_version < "3.10"
Requires-Dist: pgtoolkit>=0.33.0
Requires-Dist: python-dotenv
Requires-Dist: adbc-driver-postgresql>=1.8.0
Provides-Extra: sas
Requires-Dist: paramiko; extra == "sas"
Provides-Extra: pandas
Requires-Dist: pandas; extra == "pandas"
Provides-Extra: adbc
Requires-Dist: adbc-driver-postgresql; extra == "adbc"
Provides-Extra: ibis
Requires-Dist: adbc-driver-postgresql; extra == "ibis"
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == "dev"
Provides-Extra: docs
Requires-Dist: quartodoc<0.12,>=0.11; extra == "docs"
Requires-Dist: griffe<2; extra == "docs"
Requires-Dist: jupyter; extra == "docs"
Provides-Extra: release
Requires-Dist: build>=1.1.1; extra == "release"
Requires-Dist: twine>=6.2.0; extra == "release"
Requires-Dist: pkginfo>=1.12.1.2; extra == "release"
Requires-Dist: requests-toolbelt>=1.0.0; extra == "release"
Dynamic: license-file

# db2pq: export PostgreSQL and WRDS data to Parquet

`db2pq` is a Python library for moving data from PostgreSQL into Apache Parquet files.
It is designed for both general PostgreSQL sources and the WRDS PostgreSQL service.

If you are contributing to the project, see `CONTRIBUTING.md` for the
recommended local development workflow.

## What it does

- Export a single PostgreSQL table to Parquet.
- Export an Ibis PostgreSQL table expression to Parquet.
- Export all tables in a PostgreSQL schema to Parquet.
- Export WRDS tables to Parquet.
- Update Parquet files only when the WRDS source table is newer.
- Mirror WRDS tables into a local PostgreSQL database.
- Read `last_modified` metadata embedded in Parquet files.

## Installation

Install from PyPI:

```bash
pip install --upgrade db2pq
```

This installs the bundled `psycopg` PostgreSQL client dependency, so most users
do not need a separate system `libpq` installation.

Install optional SAS support if you use WRDS SAS metadata checks, especially
when you want WRDS refresh logic based on the SAS-side source timestamps used by `wrds_update_pq(..., use_sas=True)`:

```bash
pip install --upgrade "db2pq[sas]"
```

Install optional pandas support if you want metadata queries returned as pandas
DataFrames, for example from `pq_last_modified(...)` in notebooks or analysis
scripts:

```bash
pip install --upgrade "db2pq[pandas]"
```

Install optional Ibis support if your workflow builds PostgreSQL queries with
Ibis expressions and you want to export those query results directly to
Parquet with `ibis_to_pq(...)`:

```bash
pip install --upgrade "db2pq[ibis]"
```

Install optional ADBC support if you want the Arrow-based export path for
PostgreSQL-to-Parquet workflows, including helpers that accept
`engine="adbc"`:

```bash
pip install --upgrade "db2pq[adbc]"
```

Install both optional SAS and pandas support:

```bash
pip install --upgrade "db2pq[sas,pandas]"
```

## Environment variables

`db2pq` supports explicit function arguments and environment-based defaults.
It also loads a local `.env` file automatically (via `python-dotenv`) when resolving defaults.

Connection defaults:

- `PGUSER`: PostgreSQL user (falls back to local OS user)
- `PGHOST`: PostgreSQL host (default: `localhost`)
- `PGDATABASE`: PostgreSQL database (default: `PGUSER`)
- `PGPORT`: PostgreSQL port (default: `5432`)

WRDS + output defaults:

- `WRDS_ID`: WRDS username (required for WRDS helpers unless passed directly)
- `WRDS_USER`: accepted as a synonym for `WRDS_ID` for compatibility with Tidy Finance-style setups
- `WRDS_PASSWORD`: if present and no WRDS `.pgpass` entry exists, `db2pq` can offer to save it to `.pgpass`
- `DATA_DIR`: base directory where Parquet files are written

Example shell setup:

```bash
export WRDS_ID="your_wrds_id"
export DATA_DIR="$HOME/pq_data"
```

If `WRDS_ID` is not set, WRDS helpers such as `wrds_update_pq()` and
`wrds_pg_to_pq()` will prompt for it on first use and suggest adding it to a
local `.env` file in the calling project. If your WRDS PostgreSQL password is
not yet stored in `~/.pgpass` (or `PGPASSFILE`), `db2pq` will prompt for it
securely and save it for future connections. For compatibility with the Tidy
Finance Python setup, `db2pq` also recognizes `WRDS_USER` and can offer to
copy `WRDS_PASSWORD` into `.pgpass`.

## WRDS SSH setup (for SAS-based metadata)

`wrds_update_pq(..., use_sas=True)` uses SSH to execute SAS remotely. Configure
an SSH key for your WRDS account first:

```bash
ssh-keygen -t ed25519 -C "your_wrds_id@wrds"
cat ~/.ssh/id_ed25519.pub | \
ssh your_wrds_id@wrds-cloud-sshkey.wharton.upenn.edu \
"mkdir -p ~/.ssh && chmod 700 ~/.ssh && \
 cat >> ~/.ssh/authorized_keys && chmod 600 ~/.ssh/authorized_keys"
```

## Workflows

### 1) WRDS to Parquet

Use this path when WRDS is the source of truth and the Parquet repository is
the destination.

Most users should start with `wrds_update_pq()`, which updates only when the
WRDS source appears newer than the existing Parquet file:

```python
from db2pq import wrds_update_pq

wrds_update_pq("dsi", "crsp")
```

Use `wrds_pg_to_pq()` when you want a direct export without the update check:

```python
from db2pq import wrds_pg_to_pq

wrds_pg_to_pq(
    table_name="dsi",
    schema="crsp",
    wrds_id="your_wrds_id",  # or set WRDS_ID in the environment
)
```

Use `wrds_sql_to_pq()` when you want to run a custom SQL query against WRDS
and save the result into the standard Parquet repository layout:

```python
from db2pq import wrds_sql_to_pq

wrds_sql_to_pq(
    sql="""
        SELECT gvkey, datadate, fyear, at
        FROM comp.funda
        WHERE indfmt = 'INDL'
          AND datafmt = 'STD'
          AND consol = 'C'
          AND popsrc = 'D'
          AND fyear BETWEEN 1991 AND 2008
    """,
    table_name="funda",
    schema="comp",
    alt_table_name="funda_filtered",
)
```

Common options on this path:

- `where` to filter rows before export
- `keep` / `drop` to select columns
- `alt_table_name` to rename the Parquet file
- `archive=True` to archive replaced files
- `use_sas=True` when you want SAS-based metadata for the update check

If `WRDS_ID` is missing, `db2pq` will ask for it and suggest adding
`WRDS_ID=...` to your project's `.env` file. If no matching WRDS password is
found in `.pgpass`, `db2pq` will prompt for your WRDS PostgreSQL password and
store it for next time.

### 2) WRDS to PostgreSQL

Use this path when you want to materialize WRDS tables into a local PostgreSQL
database first.

`wrds_update_pg()` is the main helper here:

```python
from db2pq import wrds_update_pg

wrds_update_pg(
    table_name="dsi",
    schema="crsp",
    dbname="research",
)
```

This path is useful when you want to:

- keep a local PostgreSQL mirror of selected WRDS tables
- build indexes or comments locally with `process_sql()` and `set_table_comment()`
- decouple downstream Parquet generation from live WRDS access

Related helpers:

- `wrds_get_tables(schema, wrds_id=None)` to list WRDS tables
- `process_sql(sql, ...)` to run SQL against the destination PostgreSQL database
- `set_table_comment(...)` to manage destination table comments

### 3) Local PostgreSQL to Parquet

Use this path when the source table is already in your local PostgreSQL
database, including tables created by `wrds_update_pg()`.

`pg_update_pq()` mirrors the behavior of `wrds_update_pq()` but reads from the
local PostgreSQL source:

```python
from db2pq import pg_update_pq

pg_update_pq(
    table_name="dsi",
    schema="crsp",
    database="research",
)
```

This path relies on the local PostgreSQL table comment for change detection.
If the table has no parseable last-modified comment, `pg_update_pq()` will not
export by default and will tell you to use `force=True`.

Use `db_to_pq()` when you want a direct local PostgreSQL to Parquet export
without the update check:

```python
from db2pq import db_to_pq

pq_file = db_to_pq(
    table_name="my_table",
    schema="public",
    host="localhost",
    database="mydb",
)

print(pq_file)
```

These examples use the default DuckDB-backed path. `engine="adbc"` is available,
but ADBC-based paths should currently be treated as experimental.

Use `db_schema_to_pq()` when you want to export an entire local PostgreSQL
schema:

```python
from db2pq import db_schema_to_pq

files = db_schema_to_pq(schema="public")
print(files)
```

### Other Export Path

`ibis_to_pq()` remains available when your starting point is an Ibis table
expression rather than a table name:

```python
from db2pq import ibis_to_pq

expr = con.table("my_table").filter(lambda t: t.id > 100)
ibis_to_pq(expr, "my_table.parquet", compression="zstd")
```

## Parquet layout

Files are organized as:

```text
<DATA_DIR>/<schema>/<table>.parquet
```

For example:

```text
/data/crsp/dsi.parquet
```

When `archive=True`, replaced files are moved under:

```text
<DATA_DIR>/<schema>/<archive_dir>/<table>_<timestamp>.parquet
```

## How it works

At a high level, the core PostgreSQL-to-Parquet flow has three stages:

1. Query planning
   `db2pq` inspects PostgreSQL metadata, applies `keep` / `drop`,
   normalizes user-supplied `col_types`, handles timestamp conversion rules,
   and builds a SQL `SELECT`.

2. Query execution
   The planned query is executed through either:
   - `engine="duckdb"`: DuckDB reads PostgreSQL and produces Arrow output
   - `engine="adbc"`: the PostgreSQL ADBC driver streams Arrow record batches directly

3. Parquet writing
   PyArrow writes the resulting Arrow batches/tables to Parquet, normalizing
   timestamps, repairing eligible decimal columns on the ADBC path, and
   buffering row groups with both row-count and byte-size limits.

This means the main export helpers now share the same SQL-planning logic even
when they use different execution engines.

### Engine defaults

`"duckdb"` remains the default engine. You can override it per call:

```python
db_to_pq("dsi", "crsp", engine="duckdb")
```

or, if you specifically want the ADBC path for a session, set a process-wide
default:

```python
from db2pq import set_default_engine

set_default_engine("adbc")
```

You can inspect the current setting with `get_default_engine()`, and the
environment variable `DB2PQ_ENGINE` provides the same kind of session-level
default when you prefer configuration outside Python.

In general, DuckDB-backed paths are the stable default. ADBC-backed paths are
still experimental and may be faster or slower depending on the workload and
driver behavior. If you need the most reliable path, prefer `engine="duckdb"`.

## Public API

From `db2pq`:

- `db_to_pq(table_name, schema, ...)`
- `pg_update_pq(table_name, schema, ...)`
- `ibis_to_pq(table, out_file, engine=None, **writer_kwargs)`
- `wrds_pg_to_pq(table_name, schema, ...)`
- `db_schema_to_pq(schema, ...)`
- `wrds_update_pq(table_name, schema, ...)`
- `wrds_update_schema(schema, ...)`
- `pq_list_files(schema, data_dir=None, archive=False, archive_dir=None)`
- `pq_last_modified(table_name=None, schema=None, data_dir=None, file_name=None, archive=False, archive_dir="archive")`
- `pq_archive(table_name=None, schema=None, data_dir=None, file_name=None, archive_dir=None)`
- `pq_restore(file_basename, schema, data_dir=None, archive=True, archive_dir=None)`
- `pq_remove(table_name=None, schema=None, data_dir=None, file_name=None, archive=False, archive_dir="archive")`
- `db_schema_tables(schema, ...)`
- `wrds_get_tables(schema, wrds_id=None)`
- `wrds_update_pg(table_name, schema, ...)`
- `pq_to_pg(table_name, schema, ...)`
- `pq_update_pg(table_name, schema, ...)`
- `process_sql(sql, ...)`
- `set_table_comment(schema=..., table_name=..., comment=..., ...)`
- `get_wrds_username(wrds_id=None)`
- `get_wrds_conninfo(username=None)`
- `find_pgpass_entry(conninfo, **kwargs)`
- `has_pgpass_password(conninfo, **kwargs)`
- `save_password(conninfo, password=None, **kwargs)`
- `ensure_wrds_credentials(wrds_id=None, interactive=True)`
- `set_default_engine(engine)`
- `get_default_engine()`
- `close_adbc_cached()`

The main workflow entry points are:

- WRDS to Parquet: `wrds_update_pq()`, `wrds_pg_to_pq()`
- WRDS to PostgreSQL: `wrds_update_pg()`
- Local PostgreSQL to Parquet: `pg_update_pq()`, `db_to_pq()`, `db_schema_to_pq()`
- Parquet to PostgreSQL: `pq_update_pg()`, `pq_to_pg()`

`wrds_update_pq()` supports SQL-style filtering via `where`, for example:

`wrds_update_pq("funda", "comp", where="indfmt = 'INDL' AND datafmt = 'STD'")`

`pg_update_pq()` provides the same update-to-parquet pattern for a local
PostgreSQL source table, using the local table comment for change detection.

`pq_update_pg()` provides the reverse update path from a parquet file in the
local repository into PostgreSQL, using parquet `last_modified` metadata for
change detection.

## Notes

- WRDS PostgreSQL access uses host `wrds-pgdata.wharton.upenn.edu` and port `9737`.
- `batched=True` (default) lowers memory usage for large tables.
- `engine="adbc"` streams Arrow record batches directly from PostgreSQL into
  Parquet and may reduce RAM use versus the default DuckDB path.
- `numeric_mode=None` (the default) keeps engine-specific "least
  interference" behavior: DuckDB preserves PostgreSQL `NUMERIC` columns as
  Arrow decimals, while ADBC defaults to text-backed numerics.
- `numeric_mode="text"` casts PostgreSQL `NUMERIC` columns to `TEXT` on both
  engines.
- `numeric_mode="float64"` casts PostgreSQL `NUMERIC` columns to `DOUBLE
  PRECISION` on both engines.
- `numeric_mode="decimal"` preserves native decimal output on DuckDB. On the
  ADBC path it transports numerics as `TEXT` and converts eligible columns
  back to Arrow decimals using PostgreSQL precision/scale metadata. Columns
  without usable metadata remain strings. `col_types` still takes precedence
  over the mode.
- `col_types` can be used to cast selected columns before writing Parquet.
- `keep`/`drop` accept regex pattern(s) in both `wrds_update_pq()` and
  `wrds_update_pg()`. If both are supplied, `drop` is applied before `keep`.
- `tz` defaults to `"UTC"` in both update paths:
  `wrds_update_pq()` uses it to interpret source naive timestamps before writing
  timezone-aware UTC parquet timestamps; `wrds_update_pg()` converts source
  `timestamp without time zone` columns using `AT TIME ZONE '<tz>'`.
- `process_sql()` executes SQL against the PostgreSQL destination resolved from
  the same `PGUSER`/`PGHOST`/`PGDATABASE`/`PGPORT` defaults used by
  `wrds_update_pg()`.
- `set_table_comment()` can be used the same way for destination PostgreSQL
  table comments without passing an engine argument.

## Development

Run editable install in this repository:

```bash
pip install -e .
```

With optional SAS dependency:

```bash
pip install -e ".[sas]"
```

## Project docs

- Docs index: `docs/README.md`
- Contributor guide: `CONTRIBUTING.md`
- Release process: `RELEASING.md`
- Changelog: `CHANGELOG.md`

## License

MIT License. See `LICENSE`.
