Metadata-Version: 2.4
Name: tmautils
Version: 0.2.0
Summary: Internet Measurement Research Utilities (meta-package)
License-Expression: MPL-2.0
License-File: LICENSE
Author: Sulyab Thottungal Valapu
Author-email: sulyabtv@gmail.com
Requires-Python: >=3.13,<4.0
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Provides-Extra: all
Provides-Extra: bgp
Provides-Extra: dns
Provides-Extra: enrich-ip
Provides-Extra: pki
Provides-Extra: rdap
Requires-Dist: tmautils-bgp (>=0.2.0,<0.3.0) ; extra == "all"
Requires-Dist: tmautils-bgp (>=0.2.0,<0.3.0) ; extra == "bgp"
Requires-Dist: tmautils-core (>=0.2.0,<0.3.0)
Requires-Dist: tmautils-dns (>=0.2.0,<0.3.0) ; extra == "all"
Requires-Dist: tmautils-dns (>=0.2.0,<0.3.0) ; extra == "dns"
Requires-Dist: tmautils-enrich-ip (>=0.2.0,<0.3.0) ; extra == "all"
Requires-Dist: tmautils-enrich-ip (>=0.2.0,<0.3.0) ; extra == "enrich-ip"
Requires-Dist: tmautils-pki (>=0.2.0,<0.3.0) ; extra == "all"
Requires-Dist: tmautils-pki (>=0.2.0,<0.3.0) ; extra == "pki"
Requires-Dist: tmautils-rdap (>=0.2.0,<0.3.0) ; extra == "all"
Requires-Dist: tmautils-rdap (>=0.2.0,<0.3.0) ; extra == "rdap"
Project-URL: Repository, https://github.com/sulyabtv/tmautils
Description-Content-Type: text/markdown

- [tmautils](#tmautils)
    - [Installation](#installation)
    - [Quick Start](#quick-start)
        - [What's Included](#whats-included)
            - [`tmautils.core`](#tmautilscore)
            - [`tmautils.db`](#tmautilsdb)
            - [`tmautils.bgp`](#tmautilsbgp)
            - [`tmautils.rdap`](#tmautilsrdap)
            - [`tmautils.dns`](#tmautilsdns)
            - [`tmautils.enrich_ip`](#tmautilsenrich_ip)
            - [`tmautils.pki`](#tmautilspki)
        - [Writing Your First Program](#writing-your-first-program)
    - [The boring stuff](#the-boring-stuff)
        - [Why does this library exist?](#why-does-this-library-exist)
        - [Philosophy](#philosophy)
        - [Design Choices](#design-choices)
            - [Self-Contained Utilities](#self-contained-utilities)
            - [Async-First Approach](#async-first-approach)
            - [DuckDB as the Middle Layer for Database Stuff](#duckdb-as-the-middle-layer-for-database-stuff)
    - [License](#license)
    - [Contributing](#contributing)

# tmautils
A collection of Python utilities for Internet measurement research.

`tma` could stand for *Traffic Measurement and Analysis*, like the [academic conference](https://tma.ifip.org/), or *Too Much Analysis*, depending on how frustrated you are with your research. (You get to choose.)

## Installation

Install everything:

```bash
pip install tmautils[all]
```

Install only the sub-packages you need:

```bash
pip install tmautils[rdap,pki]      # specific sub-packages (+ core automatically)
pip install tmautils                # just core infrastructure
```

Available extras: `bgp`, `dns`, `enrich-ip`, `pki`, `rdap`, `all`.

You can also install sub-packages directly by name:

```bash
pip install tmautils-rdap           # same as tmautils[rdap]
pip install tmautils-bgp
pip install tmautils-dns
pip install tmautils-enrich-ip
pip install tmautils-pki
pip install tmautils-core           # same as bare tmautils
```

All packages share the `tmautils.*` namespace.

## Quick Start
`tmautils` provides two types of APIs:
- **Utility classes**: self-contained tools that manage their own directories and logging (e.g., `RevocationChecker`, `CzdsDownloadUtil`), and
- **Standalone functions**: stateless helpers you call directly (e.g., `get_cert()`, `request_with_retry()`)

As a basic example, you can use `get_cert()` (a function) to download the TLS certificate presented by a server, and `RevocationChecker` (a utility class) to check whether said certificate has been revoked:

```python
from tmautils.pki import get_cert, RevocationChecker
from pathlib import Path

cert = await get_cert("www.example.com") # using get_cert function

checker = RevocationChecker( # instantiating RevocationChecker
    working_root=Path("/tmp"),
)
result = await checker.check_cert_chain(cert) # using RevocationChecker's API
```

### What's Included
`tmautils` is divided into sub-packages roughly based on the functionality provided by the utilities / functions they contain. Sub-packages and their user-facing utilities and functions are listed below.

#### `tmautils.core`
`tmautils.core` contains core infrastructure used by all other sub-packages, but many exports are useful for writing custom user code. Provided by `tmautils-core`.

| Utility / Function | What it does |
| --- | --- |
| `IOHelper` | Directory structure and logging for utilities |
| `LogHelper` | Logging configuration |
| `AsyncRateLimiter` | Rate limiter with concurrency and throughput limits |
| `RetryConfig` | Configuration for HTTP retry behavior |
| `get_logger_from_helper()` | Get the configured or no-op logger from LogHelper |
| `run_coro_sync()` | Run async code from sync context |
| `request_with_retry()` | HTTP requests with retry and backoff |
| `get_with_retry()` | Convenience wrapper for GET with retry |
| `gzip_file()`, `gunzip_file()` | Compress/decompress a file with gzip |
| `try_convert_ip()` | Try to convert an IP address string to an IP address object |
| `is_ipv4()`, `is_ipv6()` | Check if string is valid IPv4/IPv6 address |

#### `tmautils.db`
`tmautils.db` contains storage backends. Also provided by `tmautils-core`.

| Utility / Function | What it does |
| --- | --- |
| `BufferedWriter` | Batched writes to storage backends |
| `DuckDbStore` | DuckDB storage interface |
| `DuckDbBackend` | DuckDB backend for BufferedWriter |
| `DuckLakeStore` | DuckLake storage interface |
| `DuckLakeBackend` | DuckLake backend for BufferedWriter |
| `ParquetBackend` | Parquet backend for BufferedWriter |
| `DuckDbInetLpmIndex` | In-memory LPM index for fast IP lookups |
| `pydantic_to_arrow()` | Convert Pydantic models to Arrow tables |

#### `tmautils.bgp`
Provided by `tmautils-bgp`.

| Utility / Function | What it does |
| --- | --- |
| `PyasnUtil` | IP to ASN lookups (wraps `pyasn`) |
| `ASdbCategoryUtil` | AS categorization using Stanford ASdb |
| `CaidaAsOrgInfoUtil` | AS to Organization mapping using CAIDA's AS2Org |

#### `tmautils.rdap`
Provided by `tmautils-rdap`.

| Utility / Function | What it does |
| --- | --- |
| `RdapClient` | RDAP queries for domains, IPs, ASNs, entities, and nameservers |

#### `tmautils.dns`
Provided by `tmautils-dns`.

| Utility / Function | What it does |
| --- | --- |
| `AsyncDnsPythonUtil` | Async DNS resolution (wraps `dnspython`) |
| `CzdsDownloadUtil` | Download ICANN CZDS zone files |
| `OpenIntelZoneStreamUtil` | Subscribe to OpenINTEL ZoneStream |
| `dns_msg_semantic_hash()` | Compute semantic hash of DNS messages |

#### `tmautils.enrich_ip`
Provided by `tmautils-enrich-ip`.

| Utility / Function | What it does |
| --- | --- |
| `IPApiBatchUtil` | Interact with `ip-api.com`'s batch API |
| `IPInfoLiteUtil` | Interact with the IPinfo's Lite dataset |
| `IpInfoPrivacyUtil` | Privacy detection using IPinfo's database |
| `IpInfoCarrierUtil` | Mobile carrier lookup using IPinfo's database |
| `ChromePrefetchUtil` | Check if an IP address belongs to Chrome Prefetch Proxy |

#### `tmautils.pki`
Provided by `tmautils-pki`.

| Utility / Function | What it does |
| --- | --- |
| `RevocationChecker` | Check certificate revocation (OCSP/CRL) |
| `create_ssl_context()` | Create configurable SSL context |
| `get_cert()` | Fetch TLS certificate from a server |
| `get_cert_chain()` | Fetch full certificate chain from a server |
| `fetch_issuer_cert()` | Fetch issuer certificate via AIA extension |
| `fetch_issuer_chain()` | Build certificate chain from leaf to root |

### Writing Your First Program
As you've probably noticed by now, there is no "one way" to use `tmautils`: what utility/function you use will be driven by your use case, and the options supported vary by individual utilities. However, I have tried to include useful documentation with each utility/function.

That said, there are two "patterns" you will see across the API:
- All utilities support modification of storage/logging behavior through constructor kwargs. See [here](#self-contained-utilities) for details.
- I/O-heavy APIs are written with an `async`-first approach; but in most cases a sync wrapper is provided. See [here](#async-first-approach) for details.

## The boring stuff
### Why does this library exist?
This library grew out of code written by me (Sulyab) for my PhD research. At some point, it made sense to pull out the reusable bits into a common place, and establish some directory structure for data to keep track of what goes where. Days of debugging led to addition of logging features, days of fighting with the GIL led to the addition of multiprocessing helpers, and so on.

### Philosophy
There are three main tenets that shaped the design choices of this library:

1. *Do NOT reinvent the wheel (when sufficiently good wheels exist.)* There are several great Python packages out there that can help with specific Internet measurement analysis tasks, like [pyasn](https://github.com/hadiasghari/pyasn) for IP to ASN lookups. In such cases, it is preferable to write thin wrappers around such packages (such as `PyasnUtil` for `pyasn`.)

2. *However, sometimes reinventing the wheel makes more sense.* Usually this happens when the "best" Python package available for a use case is not "sufficiently" good according to certain criteria. For example, it may lack an `async` API or aggressive caching, two reasons why `RevocationChecker` exists instead of relying on [pki-tools](https://github.com/fulder/pki-tools) for certificate revocation lookups.

3. *Solve the problem at hand first, the "perfect" solution can come later.* This is an important one, and the one that may affect you, the user, the most. Like [mentioned](#why-does-this-library-exist), each utility in this library was written to address specific needs at the time. As such, some utilities may support feature X but not feature Y, and some "pieces" may be more or less polished than others. **However, I am always looking to improve the code and add more features, so please consider contributing!**

### Design Choices
While this library is a grab bag of utilities, I have tried to keep some design choices consistent throughout:

#### Self-Contained Utilities
Each utility in this library is designed to be "self-contained". Typically, you pass a `working_root` parameter when you instantiate a utility, say `AbcUtil`. This action will create a directory named `AbcUtil` in `working_root`, with subdirectories such as `logs`, `raw` and `cache`. (There will be another level of subdirectories if there are multiple instances of the same utility.) The following is an example:

```python
from tmautils.dns import CzdsDownloadUtil
from pathlib import Path
czds = CzdsDownloadUtil(working_root=Path("/data"))
# Creates: /data/CzdsDownloadUtil/raw/, /data/CzdsDownloadUtil/logs/, etc.
```

Under the hood, most of this work is done by `IOHelper`, which takes care of the directory structure and instantiates `LogHelper` to take care of logging.

You can pass additional arguments to `IOHelper` to modify some default behavior, such as making subdirectories symlinks, and turning off file logging. Example:

```python
czds = CzdsDownloadUtil(
    working_root=Path("/data"),
    # The following arguments are passed to IOHelper
    # make /data/CzdsDownloadUtil/raw a symlink to ~/downloads/czds
    raw_dir_symlink_to=Path("~/downloads/czds/"),
    # IOHelper in turn passes the following argument to LogHelper
    logging_kwargs={
        "file_level": None, # No file logging
    }
)
```

If you wish, you can delegate the storage/logging management of your custom program by instantiating IOHelper:

```python
from tmautils.core import IOHelper
io = IOHelper(
    "MyProgram",
    working_root=Path("/data"),
    # by default, IOHelper will configure the subdirectories:
    # raw/, logs/, processed/, results/
)
# Access storage
out_path = io.raw / "hello.txt" # Returns a pathlib.Path object
out_path.write_text("Hello, World!")
# Use logger
io.logger.warning("I have no clue what I am doing.")
```

#### Async-First Approach
Many utilities in this library do I/O-heavy work (e.g., downloading zone files, making HTTP requests, checking certificate revocation status). For better concurrency, I/O-heavy utilities are written with an `async`-first approach. In such cases, the API also provides a sync version in case you do not want to bother with `asyncio`:

```python
result = await util.fetch_data("param")    # Async
result = util.fetch_data_sync("param")     # Sync wrapper
```

`async` methods use the base name (`fetch_data()`), and sync wrappers append `_sync` (`fetch_data_sync()`). The sync wrappers simply call `run_coro_sync` on the `async` method.

#### DuckDB as the Middle Layer for Database Stuff
After experimenting with different database backends, the current direction is to use [DuckDB](https://duckdb.org/) as a unified intermediate layer. DuckDB supports several popular backends (CSV, Parquet, SQLite, even query remote sources) and allows executing SQL queries on those backends. This allows us to write code at the level of DuckDB connections rather than add support for different backends.

New code uses the following flow: Pydantic models -> Arrow table -> DuckDB insert. There is a fair amount of instrumentation around this which you can find in `tmautils.db`. For a good example of how to write a utility using this pattern, see `OpenIntelZoneStreamUtil`.

Some useful tools:
- Writing rows one-by-one to a database is slow. `BufferedWriter` accumulates rows in memory and flushes them in batches, working with multiple backends (thanks to DuckDB).
- For Longest Prefix Matching (LPM) lookups, we use `DuckDbInetLpmIndex` which builds an in-memory LPM index to speed up lookups. Once built, the index can be used in Python code or DuckDB SQL (via [UDFs](https://duckdb.org/docs/stable/clients/python/function).)

## License
This project is licensed under [MPL-2.0](LICENSE) (Mozilla Public License 2.0).

What this means in practice:
- If you modify an existing file, your modifications must remain MPL-2.0.
- You can license new files however you want. (But I won't merge them into `tmautils` unless they are MPL-2.0.)
- You can use this code alongside code under other licenses.

## Contributing
Contributions to `tmautils` are highly welcome! After all, there is much to do in Internet measurement research.

Since all the "real" code lives in subpackages, you probably want to contribute to their corresponding repos. If you want to submit a new subpackage to the `tmautils` metapackage, let me know.

Before writing code, please familiarize yourself with the [philosophy](#philosophy) and [design choices](#design-choices), and try to follow them, or talk to me about how they are stupid and we should do things differently. I want this library to be the best version of itself.

AI Policy: I don't consider AI tool usage any different from IDE usage. This also means that *you* are responsible for the code you write and *you* should inspect every line of code written by an LLM. This policy is currently (slightly) relaxed for tests.

