Metadata-Version: 2.4
Name: pvduck
Version: 0.1.0
Summary: Load Wikimedia pageview data to a duckdb
Keywords: wikimedia,pageviews,wikipedia,duckdb,analytics,data-pipeline
Author: Vegard Egeland
Author-email: Vegard Egeland <vegardegeland@gmail.com>
License-Expression: MIT
License-File: LICENSE.md
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Topic :: Database
Classifier: Topic :: Internet :: Log Analysis
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Typing :: Typed
Requires-Dist: duckdb>=1.4.1,<2.0.0
Requires-Dist: pvstream==0.1.0
Requires-Dist: pydantic>=2.12.2,<3.0.0
Requires-Dist: pyyaml>=6.0.3,<7.0.0
Requires-Dist: typer>=0.19.2,<1.0.0
Requires-Python: >=3.10
Project-URL: Bug, https://github.com/vegardege/pvduck/issues
Project-URL: Changelog, https://github.com/vegardege/pvduck/blob/master/CHANGELOG.md
Project-URL: Documentation, https://github.com/vegardege/pvduck
Project-URL: Repository, https://github.com/vegardege/pvduck
Description-Content-Type: text/markdown

# pvduck

[![Lint, Test, Build, Deploy](https://github.com/vegardege/pvduck/actions/workflows/lint-test-build-deploy.yml/badge.svg)](https://github.com/vegardege/pvduck/actions/workflows/lint-test-build-deploy.yml)

`pvduck` is a cli tool allowing you to sample, download, filter, parse, and
aggregate Wikimedia pageviews dumps, creating an overview of the most visited
pages on a sampled subset of Wikimedia pages in a `duckdb` database.

This is useful if you want a rough overview of pageviews for a specific
subset of pages, but lack access to the analytics servers and don't want to
host a full HDFS cluster yourself. Instead of storing several GB per day,
you can aggregate years worth of data in a small database holding just the
data you need.

The tool was developed for my own weekend projects, where gauging popularity
without perfect precision or history is very helpful.

## Installation

### Recommended: pipx (installed)

Install `pvduck` in an isolated environment with [pipx](https://pipx.pypa.io/):

```bash
pipx install pvduck
```

Then use it directly:
```bash
pvduck create myproject
pvduck sync myproject
```

### Alternative: uvx (no installation)

Run `pvduck` without installing using [uvx](https://docs.astral.sh/uv/):

```bash
uvx pvduck create myproject
uvx pvduck sync myproject
```

Perfect for trying the tool without commitment!

### Alternative: Docker

For containerized environments:

```bash
docker run --rm -it \
    -v ~/.config:/root/.config \
    -v ~/.local/share:/root/.local/share \
    vegardege/pvduck:latest \
    create myproject
```

Note: Use `-it` for commands that open an editor.

## Usage

Call `pvduck --help` for instructions and `pvduck --install-completion` to
install auto-completion in your shell, both of which will help.

The tool has six commands:

| Command                  | Description                                                    |
| ------------------------ | -------------------------------------------------------------- |
| `create <project_name>`  | Create a new project                                           |
| `edit <project_name>`    | Edit project configuration                                     |
| `rm <project_name>`      | Delete config and database                                     |
| `sync <project_name>`    | Download missing data (if any) and aggregate into the database |
| `open <project_name>`    | Open the project's database in `duckdb`                        |
| `status <project_name>`  | See progress status for the project                            |
| `compact <project_name>` | Reclaim disk space according to `duckdb` best practice         |
| `ls`                     | List all existing projects                                     |

By default, `sync` will run until it exhausted the date range given with the
given sample rate. If you want to run it for a limited time only, apply the
`max_files` option to stop after a specific number of files.

Note that syncing can be memory intensive. It operates in chunks of 1 000 000
rows by default, which can be modified with the `PVDUCK_CHUNK_SIZE` environment
variable. Increase the value for faster syncs, decrease it for more memory
efficient (but slower) execution.

When you create a new project, you can define the configuration:

| Param         | Description                                                                  |
| ------------- | ---------------------------------------------------------------------------- |
| `base_url`    | Which Wikimedia mirror to use (recommended: use mirror close to you)         |
| `sleep_time`  | How many seconds to wait between each file download                          |
| `start_date`  | Date of the first dump file to download                                      |
| `end_date`    | Date of the last dump file to download (or blank for current date expanding) |
| `sample_rate` | Probability of downloading each hourly file in the interval                  |

In addition, the config file contains filters to reduce the size of the
dataset. All filters can be set to blank values, which means no rows are
excluded.

| Filter         | Type        | Description                                                 |
| -------------- | ----------- | ----------------------------------------------------------- |
| `line_regex`   | `regex`     | Regular expression used to filter lines before parsing      |
| `page_title`   | `regex`     | Regular expression used to filter page titles after parsing |
| `domain_codes` | `list[str]` | List of domain codes to accept                              |
| `min_views`    | `int`       | Minimum amount of views needed to be accepted               |
| `max_views`    | `int`       | Maximum amount of views allowed                             |
| `languages`    | `list[str]` | List of languages to accept                                 |
| `domains`      | `list[str]` | List of domains to accept                                   |
| `mobile`       | `bool`      | If set, filter on whether the row belongs to a mobile site  |

> [!IMPORTANT]  
> The `sync` operation is destructive. It keeps track of which files you have
> downloaded, but can not revert any aggregation operations. As a result, only
> some parameters and filters can be changed without putting the database in an
> inconsistent state. Notably, you _can_ expand the date range and increase the
> sample rate without issues.

## Test

```
uv run pytest tests/ --cov=src

Name                       Stmts   Miss  Cover
----------------------------------------------
src/pvduck/__init__.py         0      0   100%
src/pvduck/cli.py            108      0   100%
src/pvduck/config.py          69      0   100%
src/pvduck/db.py              70      0   100%
src/pvduck/project.py         21      0   100%
src/pvduck/stream.py          20      0   100%
src/pvduck/timeseries.py      32      0   100%
src/pvduck/validators.py      22      0   100%
src/pvduck/wikimedia.py        8      0   100%
----------------------------------------------
TOTAL                        350      0   100%
```

> [!CAUTION]
> Tests are separated in unit tests and integration tests. The integration
> tests take several minutes to run and downloads files from the configured
> server. Don't run the full test suite frivolously.
