Metadata-Version: 2.4
Name: agent-census
Version: 0.0.1
Summary: Characterize the clients hitting a web site by analyzing its access logs.
Author-email: Mark Nottingham <mnot@mnot.net>
License-Expression: MIT
Project-URL: homepage, https://github.com/mnot/agent-census
Classifier: Operating System :: OS Independent
Classifier: Development Status :: 4 - Beta
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE.md
Requires-Dist: tomli; python_version < "3.11"
Provides-Extra: dev
Requires-Dist: mypy; extra == "dev"
Requires-Dist: black; extra == "dev"
Requires-Dist: isort; extra == "dev"
Requires-Dist: pylint; extra == "dev"
Requires-Dist: pytest; extra == "dev"
Requires-Dist: pytest-md; extra == "dev"
Requires-Dist: validate-pyproject; extra == "dev"
Requires-Dist: build; extra == "dev"
Requires-Dist: tomli; extra == "dev"
Dynamic: license-file

# agent-census

*What's hitting your site, classified by how it behaves -- not just what it claims to be.*

Most of the traffic to a typical site isn't people; it's software, and a fair bit
of it lies about what it is. agent-census reads your access log and sorts the
clients by what they actually do -- whether they pull a page's sub-resources like
a browser, walk the site like a crawler, poll a feed on a schedule, or go looking
for known-vulnerable paths. Anything claiming to be a known crawler is checked
against DNS and published address ranges, so a Googlebot arriving from some random
datacentre gets called what it is. What you end up with is your traffic broken
down by what each client is for. The User-Agent still counts -- it's just treated
as a claim to weigh against behaviour and origin, not a fact to take on trust.

[Here's a sample report](https://projects.mnot.net/agent-census/) generated from a real access log.

## Install

```
pipx install agent-census
```

## Use

The simplest case is an Apache log in the default `combined` format:

```
agent-census analyze /var/log/apache2/access.log
```

You can pass several rotated logs at once. They're pooled into one analysis, so a
client that spans the rotation is counted once:

```
agent-census analyze /var/log/httpd/access.log*
```

For a custom format, pass the `LogFormat`/`CustomLog` directive string verbatim
from your Apache config. Tab separators (`\t`), quoted fields with spaces,
`%{...}x` SSL variables, and `%{...}e` environment variables are all handled:

```
agent-census analyze access.log \
    --log-format '%h %l %u %t "%r" %>s %b "%{Referer}i" "%{User-Agent}i" %D'
```

The presets `common`, `combined`, and `vhost_combined` are available via
`--log-format-preset`. Options may appear before, after, or between the log files.

Cloudflare Logpush logs (newline-delimited JSON) are also supported, as another
preset:

```
agent-census analyze cloudflare-logs.json --log-format-preset cloudflare
```

Cloudflare logs carry the client's AS number, so network and ASN-based detection
work without any extra configuration.

### What to log

The Apache `combined` format already carries everything the core analysis needs.
The `common` preset drops the User-Agent and the Referer, so prefer `combined`,
or a custom format that includes them.

Required (all present in `combined`):

- **Client address** (`%h`) -- the identity everything else groups on, and the
  basis for the network, datacentre, and crawler-verification checks.
- **Timestamp** (`%t`) -- timing regularity, peak request rate, the reported time
  range, and (with `--quiescent-hours`) freeing memory mid-run.
- **Request line** (`"%r"`) -- the method and path; the most load-bearing field,
  behind vulnerability probing, feed detection, path coverage, and crawl shape.
- **Status code** (`%>s`) -- the status mix, 404 storms, `304 Not Modified` (the
  `has-cache` tag), and robots.txt compliance.
- **User-Agent** (`"%{User-Agent}i"`) -- browser, bot, and declared-crawler
  recognition.

Recommended. The first two are already in `combined`; the rest aren't in any
preset, so add them to a custom `LogFormat` (quoted) -- they're worth it:

- **Referer** (`"%{Referer}i"`, in `combined`) -- referer-following, which
  separates crawlers from scrapers and flags fabricated referers.
- **Bytes sent** (`%b` or `%B`, in `combined`) -- the bandwidth figures in the
  report.
- **AS organisation and number** (`"%{MM_ASORG}e"` and `"%{MM_ASN}e"`, MaxMind
  `mod_maxminddb`) -- name datacentre clients by their hosting organisation, and
  recognise datacentres and ASN-listed crawlers by AS number. Much of
  [Networks and hosting](#networks-and-hosting) leans on these; log **both** (the
  number drives recognition, the org names it).
- **Content-Type** (`"%{Content-Type}o"`) -- the response media type, which
  sharpens feed-reader detection (an RSS/Atom type, not just a feed-shaped URL).
- **X-Forwarded-For** (`"%{X-Forwarded-For}i"`) -- if you're behind a CDN or
  proxy, for `--identity forwarded`.

Response time (`%D` / `%T`) and the virtual host are parsed if present but not
currently used by the analysis.

Output is Markdown by default. Pass `--html` for a self-contained, styled page
(one file, no external assets) you can open in a browser. Both formats work for
`analyze` and `inspect`:

```
agent-census analyze access.log --html -o census.html
```

The report opens with a summary of each kind, then a cross-tab of where each
kind's traffic came from (see [Networks and hosting](#networks-and-hosting)),
then the notable clients in each kind. Within a kind, clients that differ only
by IP address and origin AS — same User-Agent, same tags — are collapsed into
one row showing their combined traffic; in the HTML report a disclosure expands
to the per-IP/ASN breakdown, and `inspect` always lists them individually.

### robots.txt compliance

To check `robots.txt` compliance, give agent-census the file. A local copy is the
default, since it should match the period the log covers:

```
agent-census analyze access.log --robots-file ./robots.txt
```

Naming a host or URL instead fetches it over the network. A live `robots.txt` may
not match the rules that applied when the log was written, so the report flags it:

```
agent-census analyze access.log --host example.com
```

The summary's robots column reads `N✓ / M✗ / K?`: respected, ignored, or too few
requests to tell (a client that hasn't yet requested a disallowed path isn't
counted either way).

### Verifying declared crawlers

A User-Agent claiming Googlebot proves nothing on its own. Verification checks the
client's IP against the crawler's published address ranges and its reverse/forward
DNS. It runs by default and makes network calls (DNS lookups, and the occasional
ranges fetch); turn it off for an offline, faster run:

```
agent-census analyze access.log --no-verify-bots
```

A verified crawler's IPs collapse into one entry keyed by its domain. A client
whose IP is outside the published ranges, or whose reverse DNS doesn't check out,
is classed `impersonator`, which means a forged identity that verification has
disproved. Misbehaviour is separate: a "Googlebot" that probes for `/.env` keeps
its declared kind and gets a `probing` tag (and `ignores-robots` if it earns one),
because a real crawler can still behave badly. With verification off there's
nothing to disprove the claim, so it stays a declared crawler with those tags.

### Networks and hosting

Where a client comes from matters. A "browser" arriving from a datacentre rather
than a consumer ISP is usually automation. agent-census recognises the major
cloud and hosting providers (AWS, Google Cloud, Cloudflare, Hetzner) from their
published IP ranges, folds shared-egress traffic (iCloud Private Relay, Tor) into
one entry per network, and breaks the kinds down by origin network in a cross-tab.
In the HTML report that table is interactive: switch between raw counts, share of
each kind, and share of each network, with the busier cells shaded.

Range lists are fetched and cached weekly by default. `--no-fetch-ranges` stays
offline on the bundled data.

If your log carries the client's autonomous-system details (for example from
MaxMind's `mod_maxminddb`: `%{MM_ASORG}e` for the organisation and `%{MM_ASN}e`
for the number, quoted in your `LogFormat`), datacentre clients are named by their
hosting organisation. You can also list extra AS numbers to treat as datacentres
in the bundled `datacenter_ranges.toml`.

### Inspecting a client

To see why a client was classified the way it was, use `inspect`. It shows every
signal that fired (including the runners-up), the measured features, the
`robots.txt` finding, and the request trace:

```
agent-census inspect access.log --kind vuln_scanner
agent-census inspect access.log --client 203.0.113.66
agent-census inspect access.log --kind scraper --network aws
```

`--network` matches a substring of the origin-network name and composes with
`--kind`, so the two together select a single cell of the cross-tab.

### Identity

How requests are grouped into clients is configurable, since no single rule fits
every deployment. The default, `ip_ua`, groups by (IP, User-Agent). Behind a CDN,
use `forwarded` (the left-most `X-Forwarded-For`); for IP-rotating bots in one
range, `ip_ua_subnet`. The report notes how the chosen strategy fragmented or
merged the data, so you can judge whether it fit.

```
agent-census analyze access.log --identity forwarded
```

### Scoping to one site

If one server's log mixes several virtual hosts, `--vhost SUBSTRING` analyses
only the lines served for a matching host (matched against the logged `%v`, or
the `Host` header if you don't log `%v`). The filtered lines are reported as
excluded, separately from parse skips. `--vhost` is repeatable — a line is kept
if it matches any of the given hosts.

```
agent-census analyze access.log --log-format-preset vhost_combined \
    --vhost mnot.net --vhost www.mnot.net
```

This also sidesteps a CDN artefact: if a slice of your traffic was proxied to
this origin under another hostname, those requests arrive from the CDN's IPs
(so they can't be attributed or crawler-verified). Scoping to your own host
drops that slice cleanly.

### Remembered settings

Some options are sticky, so you needn't retype them. `--log-format` /
`--log-format-preset`, `--identity`, and `--robots-file` / `--robots-url` are
saved to `~/.config/agent-census/config.json` and reused when a later run omits
them. Passing one updates the saved value.

## How it works

Classification is based on behaviour, not just the User-Agent (which is easy to
forge). Each client's requests are reduced to measured features: request volume,
status mix, timing regularity, sub-resource co-loading, path coverage, and the
like. A set of independent classifiers each vote for a kind, with a confidence and
the reasons behind it. The strongest vote wins, or `unknown` if nothing clears a
threshold. Secondary tags such as `verified`, `ignores-robots`, `datacenter`, and
`has-cache` annotate the result.

The confidence weights and the threshold are hand-tuned, so check the
classifications against your own logs before trusting the headline numbers.
`inspect` shows why any client landed where it did.

## Contributing

Contributions are welcome. See [CONTRIBUTING.md](CONTRIBUTING.md) for the
development setup, conventions, and an outline of how the code fits together.
