Metadata-Version: 2.4
Name: big-code-analysis
Version: 1.0.0
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Rust
Classifier: Topic :: Software Development :: Libraries
Classifier: Topic :: Software Development :: Quality Assurance
Requires-Dist: pytest>=8.0 ; extra == 'dev'
Requires-Dist: mypy>=1.10 ; extra == 'dev'
Requires-Dist: ruff>=0.6,<0.16 ; extra == 'dev'
Requires-Dist: pyright>=1.1.380 ; extra == 'dev'
Requires-Dist: maturin>=1.7,<2.0 ; extra == 'dev'
Requires-Dist: jupyter>=1.0 ; extra == 'examples'
Requires-Dist: nbconvert>=7.0 ; extra == 'examples'
Requires-Dist: pandas>=2.0 ; extra == 'examples'
Requires-Dist: matplotlib>=3.7 ; extra == 'examples'
Provides-Extra: dev
Provides-Extra: examples
Summary: Python bindings for the big-code-analysis Rust library
Keywords: metrics,tree-sitter,static-analysis
Author-email: Calixte Denizet <cdenizet@mozilla.com>, Elijah Zupancic <elijah@zupancic.name>
License-Expression: MPL-2.0
Requires-Python: >=3.12
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
Project-URL: Repository, https://github.com/dekobon/big-code-analysis

# big-code-analysis (Python bindings)

Python bindings for the
[`big-code-analysis`](https://github.com/dekobon/big-code-analysis)
Rust library — compute maintainability metrics for source code in
~20 languages using the same tree-sitter parsers the Rust crate
ships with.

**Full documentation:** the book's
[Python Bindings chapter](https://dekobon.github.io/big-code-analysis/python/index.html)
covers the install matrix, batch / async / SARIF recipes, and the
full error taxonomy. The README below is the quick reference
shown on PyPI.

All nine phases of the Python bindings work (issues #265–#273;
parent #103) have landed. The crate now ships single-file
analysis, the never-raise batch entry point, the `flatten_spaces`
flat-record iterator, explicit metric selection (`metrics=`),
SARIF 2.1.0 rendering (`to_sarif`), the strict ruff / mypy /
pyright tooling gate, manylinux wheel CI on Linux x86_64 +
aarch64, the book's "Python Bindings" chapter, and the end-user
example set covered below. See the
[CHANGELOG](../CHANGELOG.md) for the per-phase changes.

## Runnable examples

`big-code-analysis-py/examples/` is the canonical collection of
copy-paste recipes. Every file is executed under CI either via
`tests/test_book_examples.py` (the `.py` examples) or via
`jupyter nbconvert --execute` (the notebook), so a renamed kwarg
or removed function fails CI before the example can rot in the
docs.

| File | What it shows |
|------|---------------|
| `quick_start.py` | Single-file analysis + headline metric. Embedded by the book's *Quick start*. |
| `batch_processing.py` | `analyze_batch` + the `AnalysisError` discriminator. Embedded by *Batch processing*. |
| `flat_records.py` | `flatten_spaces` → sqlite for one file. Embedded by *Flat-record iteration*. |
| `metric_selection.py` | `metrics=` kwarg + dependency-pull behaviour. Embedded by *Metric selection*. |
| `sarif_output.py` | Minimal SARIF rendering. Embedded by *SARIF output*. |
| `errors_taxonomy.py` | The full exception map across the entry points. Embedded by *Error handling*. |
| `async_patterns.py` | `asyncio.to_thread` (canonical) vs the in-loop anti-pattern. Embedded by *Async patterns*. |
| `cli_parity.py` | Byte-for-byte parity smoke test vs `bca metrics --output-format json`. Wired into `make py-test`. |
| `pipeline_db.py` | Directory walk → `analyze_batch` → `flatten_spaces` → sqlite top-N, with a deliberately broken file to exercise the never-raise contract. |
| `sarif_upload.py` | SARIF emission tuned for GitHub Code Scanning (`github/codeql-action/upload-sarif@v3`). |
| `jupyter_quickstart.ipynb` | Pandas DataFrame + matplotlib `cyclomatic.sum` per function + top-N. Executed in CI via `python-examples-nbconvert`. |

## Installation

The package is not yet published on PyPI. For development, build
locally via [maturin](https://www.maturin.rs/):

```bash
cd big-code-analysis-py
uv venv .venv && source .venv/bin/activate
uv pip install -e ".[dev]"  # pulls maturin, pytest, mypy, ruff, pyright
maturin develop
python -c "import big_code_analysis; print(big_code_analysis.__version__)"
```

## Usage

```python
import big_code_analysis as bca

# Analyse a file by path. The returned dict matches the JSON
# emitted by `bca metrics --output-format json` for the same
# file at the `FuncSpace` boundary — same field order, same
# numeric formatting, same shape. Language detection mirrors the
# CLI (path extension, then shebang, then emacs `-*- mode -*-`).
# Pass `exclude_tests=True` to mirror `bca metrics --exclude-tests`
# (prunes Rust `#[test]` / `#[cfg(test)]` subtrees before metric
# computation). Generated files (`@generated`, `DO NOT EDIT`,
# `GENERATED CODE` markers) are skipped by default, matching the
# CLI walker — `analyze` returns `None` for them; pass
# `skip_generated=False` to opt out. See `bca.analyze.__doc__`
# for the full parity contract.
result = bca.analyze("src/main.rs")
if result is not None:
    print(result["metrics"]["cognitive"]["sum"])

# Analyse a Rust file with `#[test]` subtrees pruned out — same
# result as `bca metrics --exclude-tests --output-format json`.
prod_only = bca.analyze("src/main.rs", exclude_tests=True)

# Non-UTF-8 paths raise `ValueError` by default so the `name`
# field is always a round-trippable identifier. Pass
# `allow_lossy_path=True` to opt into the CLI's U+FFFD
# substitution behaviour (see `bca.analyze.__doc__` and #316).
lossy = bca.analyze(weird_path, allow_lossy_path=True)

# Force analysis of files marked `@generated` (default skips them).
forced = bca.analyze("third_party/generated.pb.go", skip_generated=False)

# Analyse an in-memory snippet (str, bytes, or bytearray accepted).
metrics = bca.analyze_source("fn main() {}\n", "rust")

# Language detection helpers. `language_for_file` reads the file
# and runs the same detection pipeline as `analyze` — path
# extension first, then shebang / emacs-mode fallback (#318) —
# so an extension-less script with a `#!/usr/bin/env python`
# leading line resolves the same way it would for `analyze`. The
# file is read on every call (parity with `analyze`), so the path
# must exist; I/O failures raise the same typed `OSError` subclass
# `analyze` does (`FileNotFoundError`, `PermissionError`, …). If
# you only need the cheap extension lookup (`.py` → `python`) and
# do not want the file read, use
# `bca.language_extensions("python")` and match the extension
# yourself.
assert bca.language_for_file("path/to/real/foo.py") == "python"
# Extension-less script with a `#!/usr/bin/env python` first line
# would resolve to "python" too (the asymmetry #318 closed).
assert "python" in bca.supported_languages()
assert "py" in bca.language_extensions("python")
```

## Selecting metrics

Pass `metrics=[…]` to compute only a subset of the metric suite.
`metrics=None` (the default) preserves today's "compute everything"
behaviour. Unrequested metrics are **absent** from the result dict
(not present with `None` placeholders).

```python
import big_code_analysis as bca

# Compute only LoC and cyclomatic complexity.
result = bca.analyze("src/main.rs", metrics=["loc", "cyclomatic"])
assert result is not None
assert set(result["metrics"]) == {"loc", "cyclomatic"}

# Selecting a derived metric pulls its dependencies in automatically:
# `metrics=["mi"]` also computes loc, cyclomatic, and halstead.
mi_result = bca.analyze("src/main.rs", metrics=["mi"])
assert mi_result is not None
assert {"loc", "cyclomatic", "halstead", "mi"}.issubset(mi_result["metrics"])

# `bca.METRIC_NAMES` is a `tuple[str, ...]` enumerating every
# canonical name accepted by `metrics=` (alphabetised, lowercase).
assert "halstead" in bca.METRIC_NAMES
```

The same kwarg is honoured by `bca.analyze_source` and
`bca.analyze_batch` — the latter applies the selection uniformly to
every file in the batch. Validation runs *before* any file I/O: an
empty list or unknown name raises `ValueError` immediately and never
returns an `AnalysisError` slot for what is really a caller bug.

```python
# Compute only `cyclomatic` and `cognitive` across a batch.
results = bca.analyze_batch(
    ["src/a.py", "src/b.rs"],
    metrics=["cyclomatic", "cognitive"],
)
```

Names are case-sensitive lowercase; passing an unknown name raises
`ValueError` with the canonical list in the error message. The
`"exit"` Metric-Display spelling is accepted as an alias for the
canonical JSON-key spelling `"nexits"`; both produce a `"nexits"`
key in the output. Duplicates are silently collapsed.

## SARIF 2.1.0 output

`bca.to_sarif(result, *, thresholds=None)` renders an analysis
result (or an iterable of them) into a SARIF 2.1.0 JSON document
suitable for upload to GitHub Code Scanning or any other SARIF
consumer. The output is produced by the same Rust writer that
backs `bca check -O sarif`, so the schema URL, tool driver name /
version, and rule descriptions match the CLI byte-for-byte.

```python
import big_code_analysis as bca

# Single file → SARIF with a finding for every function whose
# cyclomatic complexity strictly exceeds 15.
sarif = bca.to_sarif(
    bca.analyze("src/main.py"),
    thresholds={"cyclomatic": 15, "loc.lloc": 200},
)
with open("metrics.sarif", "w", encoding="utf-8") as fh:
    fh.write(sarif)

# Batch input — AnalysisError entries are skipped silently because
# they represent files we couldn't analyse, not findings.
batch = bca.analyze_batch(["src/a.py", "src/b.rs", "src/c.cpp"])
sarif = bca.to_sarif(batch, thresholds={"cognitive": 20})
```

Accepted threshold names mirror the CLI's `EXTRACTORS` table in
`big-code-analysis-cli/src/thresholds.rs` — e.g. `"cognitive"`,
`"cyclomatic"`, `"cyclomatic.modified"`, `"halstead.volume"`,
`"halstead.difficulty"`, `"halstead.effort"`, `"halstead.time"`,
`"halstead.bugs"`, `"loc.sloc"`,
`"loc.ploc"`, `"loc.lloc"`, `"loc.cloc"`, `"loc.blank"`, `"nom"`,
`"tokens"`, `"nexits"`, `"nargs"`, `"mi.original"`, `"mi.sei"`,
`"mi.visual_studio"`, `"abc"`, `"wmc"`, `"npm"`, `"npa"`. An
unknown name raises `ValueError` listing the accepted set, so a
typo fails fast instead of silently producing an empty SARIF run.

`thresholds=None` (the default) and `thresholds={}` both produce
a well-formed SARIF document with empty `results` and `rules`
arrays. This matches the CLI's posture: there are **no built-in
default thresholds**; every check run supplies its own limits.

**Unit-level findings.** `to_sarif` emits file-scope (unit-space)
findings for every metric whose JSON headline at the unit space
matches the CLI's per-space accessor (`loc.*`, `halstead.*`,
`mi.*`, `nom`, `nargs`, `nexits`, `tokens`, `abc`, `wmc`, `npm`,
`npa`). The three exceptions — `cyclomatic`, `cyclomatic.modified`,
`cognitive` — are skipped at the unit level because the JSON only
exposes the aggregate `sum` across children while the CLI's
per-space accessor returns just the unit's own scalar; emitting
findings from the aggregate would diverge from the CLI for parent
spaces. Unit findings carry `logicalLocations: [{"fullyQualifiedName":
"<file>"}]`; nameless non-unit spaces (rare parse-failure case)
carry `"<unnamed>"` — both matching the CLI's `function_token`
placeholders.

### Upload to GitHub Code Scanning

```yaml
# .github/workflows/code-scanning.yml (excerpt)
- name: Compute metric SARIF
  run: |
    python - <<'PY'
    import big_code_analysis as bca
    with open("paths.txt", encoding="utf-8") as paths_fh:
        results = bca.analyze_batch(paths_fh.read().splitlines())
    with open("metrics.sarif", "w", encoding="utf-8") as fh:
        fh.write(bca.to_sarif(results, thresholds={"cyclomatic": 15}))
    PY
- name: Upload to Code Scanning
  uses: github/codeql-action/upload-sarif@v3
  with:
    sarif_file: metrics.sarif
```

## Batch processing

`bca.analyze_batch(paths)` runs the same analysis as `bca.analyze`
over every path in an iterable and **never raises on per-file
errors**: each result slot is either an analysis ``dict`` or a
`bca.AnalysisError` describing the failure. The list has the same
length as the input and preserves order one-to-one, so callers
can `zip(inputs, results)` without losing the pairing.

```python
import big_code_analysis as bca

paths = ["src/a.py", "src/missing.py", "src/b.rs"]
results = bca.analyze_batch(paths)
for path, result in zip(paths, results):
    if isinstance(result, bca.AnalysisError):
        print(f"skipped {path}: ({result.error_kind}) {result.error}")
    else:
        process(result)
```

The pattern above keeps `paths` and `results` as separate
materialised sequences. If you want to drive `analyze_batch` from
a generator (e.g. `glob.iglob('**/*.py')`) for memory efficiency,
materialise it into a list first — otherwise
`zip(generator, analyze_batch(generator))` yields nothing because
`analyze_batch` exhausts the generator before `zip` re-iterates it:

```python
import glob

paths = list(glob.iglob("src/**/*.py", recursive=True))
results = bca.analyze_batch(paths)
# now zip(paths, results) works
```

`bca.AnalysisError` is a frozen value type with `path: str`,
`error: str`, and `error_kind: Literal["UnsupportedLanguage",
"ParseError", "IoError"]`. It implements `__eq__`, `__hash__`,
and `__repr__`, so callers can put errors in a `set` to
deduplicate failures across runs. It is **not** an `Exception`
subclass — `analyze_batch` returns it, never raises it.

`analyze_batch` only raises on **programmer** errors: `TypeError`
for a non-iterable `paths` argument (or a non-path element
inside), and `ValueError` for an empty `metrics=` list or an
unknown metric name. The `metrics=` selection (see
[Selecting metrics](#selecting-metrics) above) applies uniformly
to every file in the batch; validation runs **before** the input
iterable's `__iter__` so a bad selection aborts without invoking
any side effects.

Generators work — paths are consumed lazily. There is no
built-in parallelism; the recommended pattern is
`concurrent.futures.ThreadPoolExecutor` around `bca.analyze` for
parallel single-file calls. `analyze_batch` also runs with the
`is_generated` walker filter **off** so every input position
yields either a `dict` or an `AnalysisError` (never `None`).
Call `bca.analyze(path)` per-file with the default
`skip_generated=True` if you need the CLI walker's skip behaviour.

## Flatten to records

`bca.flatten_spaces(result)` walks the nested `FuncSpace` tree in
pre-order and yields one flat, scalar-only `dict` per node — ready
for `sqlite3.executemany`, `pandas.DataFrame.from_records`, or any
other tabular consumer. Metric keys use the same dotted convention
as the CLI's CSV writer (`cyclomatic.modified.sum`,
`halstead.volume`, `loc.lloc_average`, …). Metric *columns* match
the CLI's `CSV_HEADER` set; the identity columns do not — CSV uses
`space_name` / `space_kind` and has no `parent_name` / `depth`,
while flat records use `name` / `kind` and add the parent / depth
pair. One metric also diverges: `tokens.*` flattens to the JSON
shape (`tokens.tokens`, `tokens.tokens_average`,
`tokens.tokens_min`, `tokens.tokens_max`), while CSV_HEADER renames
those columns to `tokens.sum` / `.average` / `.min` / `.max`.
Rename in the consumer if you need exact CSV alignment.

```python
import sqlite3
import big_code_analysis as bca

result = bca.analyze("src/lib.rs")
if result is None:  # generated/skipped file
    raise SystemExit("nothing to analyze")
records = list(bca.flatten_spaces(result))
columns = sorted({k for r in records for k in r})
# flatten_spaces keys come from a bounded alphabet (`.`, `_`,
# ASCII alnum), so f-string quoting is safe here. Sanitize if you
# ever build records by hand.
with sqlite3.connect("metrics.db") as db:
    cols = ", ".join(f'"{c}"' for c in columns)
    qs = ", ".join("?" for _ in columns)
    db.execute(f"CREATE TABLE m ({cols})")
    db.executemany(
        f"INSERT INTO m ({cols}) VALUES ({qs})",
        [tuple(r.get(c) for c in columns) for r in records],
    )
```

The iterator is lazy and single-use: it walks the input once
without materialising the whole list, and a second iteration is
empty. Records always carry `path` (the analyzed file, or `None`
for `analyze_source`), `name`, `kind`, `start_line`, `end_line`,
`parent_name`, and `depth`. Anonymous spaces (Rust closures, JS
function expressions / arrows) keep their `name == "<anonymous>"`
marker verbatim — `flatten_spaces` does not normalize. Missing
metric subtrees produce no keys (absent, not `None`), matching the
"Halstead disabled" edge case for `metrics=` selection.

`parent_name` alone cannot disambiguate same-named siblings nested
under different parents (e.g. two `Inner` classes under two
different outer classes both surface as `parent_name == 'Inner'`
for their own children). Pair with `depth` and source-order
position, or rebuild the qualified name in your consumer, if you
need a fully-qualified path.

Don't mutate the input `result` while iterating: the walker keeps
references into it, so mutations to not-yet-yielded subtrees will
be observed in later records.

`flatten_spaces` raises `TypeError` if the input is not a mapping;
callers must filter `None` returns from `bca.analyze` (e.g. when
`skip_generated=True` matched a generated file) before passing.

## Errors

`bca.analyze` raises exceptions; `bca.analyze_batch` returns
`bca.AnalysisError` values inside the result list (never raised on
per-file failures — see the Batch processing section above).

Exception types raised by `bca.analyze` / `bca.analyze_source`:

- `bca.UnsupportedLanguageError` (subclass of `ValueError`) —
  raised when a file extension is unrecognised, or when
  `analyze_source(..., language="...")` is passed an unknown
  language name.
- `bca.ParseError` (subclass of `ValueError`) — raised when the
  underlying tree-sitter parser fails on the supplied source.
- `ValueError` — raised by `bca.analyze` when the path is not
  valid UTF-8 and the default strict policy is in effect; pass
  `allow_lossy_path=True` to mirror the CLI's U+FFFD substitution
  via `Path::to_string_lossy` and accept the resulting
  non-round-trippable `name` field (#316).
- `OSError` — bubbled up from the underlying file-system read.
  Dispatches to the canonical subclass (`FileNotFoundError`,
  `PermissionError`, `IsADirectoryError`, …) based on `errno`,
  with `err.errno` and `err.filename` populated.

Returned by `bca.analyze_batch` inside the result list:

- `bca.AnalysisError` — frozen value type with `path: str`,
  `error: str`, and `error_kind: Literal["UnsupportedLanguage",
  "ParseError", "IoError"]`. Not an `Exception` subclass.
  `error_kind` is a closed taxonomy: ``"IoError"`` covers both
  filesystem failures and the non-UTF-8 path case (kept at three
  kinds per the API contract); ``"ParseError"`` similarly covers
  internal JSON-serialisation failures of the resulting
  `FuncSpace` (rare; reserved upstream). The OS `errno` is
  preserved in the `error` string via Rust's ``"<msg> (os error
  <N>)"`` default formatting — parse with regex
  ``r"\(os error (\d+)\)$"`` if you need it for retry
  classification, or call `bca.analyze` per-file to get a typed
  `OSError` subclass instead.

## Type checking

The package ships PEP 561 type stubs (`py.typed` + `_native.pyi`).
`mypy --strict` and `pyright` should both pass cleanly against
client code.

## License

MPL-2.0 (matches the Rust library).

