Metadata-Version: 2.4
Name: s3-log-extraction
Version: 1.9.2
Summary: Fast extraction of access summary data from S3 logs.
Project-URL: Documentation, https://github.com/dandi/s3-log-extraction/blob/master/README.md
Project-URL: Repository, https://github.com/dandi/s3-log-extraction/
Project-URL: Issues, https://github.com/dandi/s3-log-extraction/issues
Project-URL: Changelog, https://github.com/dandi/s3-log-extraction/blob/master/CHANGELOG.md
Author-email: Cody Baker <cody.c.baker.phd@gmail.com>
License: MIT License
        
        Copyright (c) 2025 s3-log-extraction
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
License-File: LICENSE.txt
Keywords: AWS,S3,access stats,data infrastructure,download tracking,geolocation,logging,open data,usage analysis
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: MacOS
Classifier: Operating System :: Unix
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3.14
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Requires-Python: >=3.14
Requires-Dist: cryptography
Requires-Dist: natsort
Requires-Dist: pandas
Requires-Dist: psutil
Requires-Dist: pydantic
Requires-Dist: pyyaml
Requires-Dist: rich-click
Requires-Dist: tqdm
Provides-Extra: all
Requires-Dist: fsspec; extra == 'all'
Requires-Dist: ipinfo; extra == 'all'
Requires-Dist: opencage; extra == 'all'
Requires-Dist: polars; extra == 'all'
Requires-Dist: s3fs; extra == 'all'
Requires-Dist: s5cmd; extra == 'all'
Provides-Extra: geolocation
Requires-Dist: ipinfo; extra == 'geolocation'
Requires-Dist: opencage; extra == 'geolocation'
Provides-Extra: remote
Requires-Dist: fsspec; extra == 'remote'
Requires-Dist: s3fs; extra == 'remote'
Requires-Dist: s5cmd; extra == 'remote'
Provides-Extra: sharing
Requires-Dist: polars; extra == 'sharing'
Description-Content-Type: text/markdown

<div align="center">
  <picture>
    <img alt="s3-log-extraction logo" src="https://raw.githubusercontent.com/dandi/s3-log-extraction/main/docs/s3-log-extraction-logo.svg" width="200">
  </picture>

<h1>S3 Log Extraction</h1>
<p>
    <a href="https://pypi.org/project/s3-log-extraction/"><img alt="Supported Python versions" src="https://img.shields.io/pypi/pyversions/s3-log-extraction.svg"></a>
    <a href="https://pypi.org/project/s3-log-extraction/"><img alt="Ubuntu" src="https://img.shields.io/badge/Ubuntu-E95420?style=flat&logo=ubuntu&logoColor=white"></a>
    <a href="https://codecov.io/github/dandi/s3-log-extraction?branch=main"><img alt="codecov" src="https://codecov.io/github/dandi/s3-log-extraction/coverage.svg?branch=main"></a>
    <a href="https://github.com/dandi/s3-log-extraction/actions/workflows/daily_tests.yml/badge.svg"><img alt="Daily tests" src="https://github.com/dandi/s3-log-extraction/actions/workflows/daily_tests.yml/badge.svg"></a>
    <a href="https://github.com/dandi/s3-log-extraction/actions/workflows/daily_remote_tests.yml/badge.svg"><img alt="Daily tests (remote)" src="https://github.com/dandi/s3-log-extraction/actions/workflows/daily_remote_tests.yml/badge.svg"></a>
</p>
<p>
    <a href="https://pypi.org/project/s3-log-extraction/"><img alt="PyPI latest release version" src="https://badge.fury.io/py/s3-log-extraction.svg?id=py&kill_cache=1"></a>
    <a href="https://github.com/dandi/s3-log-extraction/blob/main/LICENSE.txt"><img alt="License: MIT" src="https://img.shields.io/pypi/l/s3-log-extraction.svg"></a>
    <a href="https://zenodo.org/badge/latestdoi/826995164"><img alt="DOI" src="https://img.shields.io/github/v/release/dandi/s3-log-extraction?label=DOI&color=blue"></a>
</p>
<p>
    <a href="https://github.com/psf/black"><img alt="Python code style: Black" src="https://img.shields.io/badge/python_code_style-black-000000.svg"></a>
    <a href="https://github.com/astral-sh/ruff"><img alt="Python code style: Ruff" src="https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json"></a>
</p>
</div>

Fast extraction of access summary data from S3 logs.

Originally developed for the [DANDI Archive](https://dandiarchive.org/).

Read more about [S3 logging on AWS](https://web.archive.org/web/20240807191829/https://docs.aws.amazon.com/AmazonS3/latest/userguide/LogFormat.html).

⚠️ This package currently only supports processing of access data (`GET`-type requests); if you wish to use this package for other types of requests (`PUT`/`DELETE`/`HEAD`, etc.) please reach out by raising an issue. ⚠️



## Installation

```bash
pip install s3-log-extraction
```

**Note for Windows users:** This package requires GAWK and is not natively supported on Windows. Windows users should use [Windows Subsystem for Linux (WSL)](https://docs.microsoft.com/en-us/windows/wsl/install) to run this package.


# Workflow

```mermaid
flowchart TD
    A[Configure cache<br/><br/>Initialize home and cache directories]
    B[Extract logs<br/><br/>Process raw S3 logs and store minimal extracted data]
    C[Update IP indexes<br/><br/>Generate anonymized indexes for each IP address]
    D[Update region codes<br/><br/>Map IPs to ISO 3166 region codes using external API]
    E[Update coordinates<br/><br/>Convert region codes to latitude/longitude for mapping]
    F[Generate summaries<br/><br/>Create per-dataset summaries for reporting]
    G[Generate totals<br/><br/>Aggregate statistics across datasets or archive]
    H[Share!<br/><br/>Post the summaries and totals in a public data repository]

    A --> B
    B --> C
    C --> D
    D --> E
    E --> F
    F --> G
    G --> H
```



## Generic Usage

[Optional] Configure a non-default cache directory on a mounted disk that has sufficient space (the default is placed under `~/.cache`). This will be the main location where extracted logs and other useful information will be stored.

```bash
s3logextraction config cache set <new cache directory>
```

To extract the logs:

```bash
s3logextraction extract <log directory>
```

To override the cache directory for a single extraction run (without changing global config):

```bash
s3logextraction extract <log directory> --cache <cache directory>
```

**NOTE**: If you feel like this command is taking a long time on your system, DO NOT interrupt it via `ctrl+C` or `pkill`. Instead, you can safely interrupt it by running:

```bash
s3logextraction stop
```

This will allow it to finish processing the current batch of logs and then exit gracefully.

After your logs are extracted, generate anonymized indexes for each IP address:

```bash
s3logextraction update ip indexes
````

Next, ensure some required environment variables related to external services are set:

1. **IPINFO_API_KEY**
   - Access token for the [ipinfo.io](https://ipinfo.io) service.
   - Extracts geographic region information in ISO 3166 format (e.g. "US/California") for anonymized statistics.
2. **OPENCAGE_API_KEY**
   - Access token for the [opencagedata.com](https://opencagedata.com/) service.
   - Maps the ISO 3166 codes from the first step to latitude and longitude coordinates for the geographic heat maps used in visualizations.

```bash
export IPINFO_API_KEY="your_token_here"
export OPENCAGE_API_KEY="your_token_here"
```

To update the region codes and their coordinates:

```bash
s3logextraction update ip regions
s3logextraction update ip coordinates
```

To generate top-level summaries and totals (that is, per dataset):

```bash
s3logextraction update summaries
s3logextraction update totals
```

Finally, to generate archive-wide summaries and totals:

```bash
s3logextraction update summaries --mode archive
s3logextraction update totals --mode archive
```



## How to Setup Logging

If you're new to using AWS S3 buckets and haven't yet enabled the logging this project utilizes, you can follow these simple instructions to get started.

1) Log into your AWS console.
2) Create a new **PRIVATE** S3 bucket - typically the name of the new bucket is the name of the one you wish to enable logging on with `-logs` added to the end. For example, `dandiarchive-logs`.
   - **NEVER** share this bucket publicly as it contains sensitive information.
3) Navigate back to the S3 bucket you wish to enable logging on.
4) Under the `Properties` tab, scroll down to the section called `Server access logging` and select `Edit`.
5) Toggle the selection to `Enable`, then specify the destination where logs will be stored as the new S3 bucket you created in step (2).
6) Recommended:
   - Specify the `Log object key format` as the nested pattern shown below.
   - Ensure the `Source of date used in log object key format` is the `S3 event time`.
    ![image](https://github.com/user-attachments/assets/8e8de5fc-5a58-4dda-8866-8bf71277e2e7)



## Developer Notes

Throughout the codebase, various processes are referred to in the following ways:

- parallelized: The process can be run in parallel across multiple workers, which increases throughput.
- interruptible: The process can be safely interrupted (`ctrl+C` or `pkill`) with only a very low chance of causing corruption. For parallelized interruption you may have to either `pkill` the main dispatch process or spam `ctrl+C` multiple times.
- updatable: The process can be resumed from the last checkpoint without losing any progress. It can also be run fresh at different times, such as on a CRON cycle, and it will only interact with unprocessed data.

### Performance

By leveraging `GAWK`, this version of the S3 log handling is considerably more efficient than the previous attempts.

The previous attempt used a multistep process which took several days to run (even on multiple workers). It also required an additional ~200 GB cache to allow lazy updates of the per-object bins.

This version requires no intermediate cache, stores only the minimal amount of data to be shared, and takes less than a day to do a fresh run (and is also lazy regarding daily CRON updates).

### Validation

In lieu of attempting fully validated parsing of each and every line from the log files (which is a hard, unsolved problem - see [s3-log-parser](https://github.com/dandi/s3-log-parser)), we instead validate the heuristics in a targeted manner through specific validation scripts.

These can also be used to verify the current state of the extraction process, such as warning about corrupt records or incomplete cache files.

### Submission of line decoding errors

Should you discover any lines in your S3 log files that cause failures in the codebase, please email them to the core maintainer (cody.c.baker.phd@gmail.com) before raising issues or submitting PRs contributing them as examples, to more easily correct any aspects that might require anonymization.
