Metadata-Version: 2.4
Name: diffhouse
Version: 2.0.3
Summary: Repository mining tool for structuring Git metadata at scale.
License-Expression: MIT
License-File: LICENSE.md
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: OS Independent
Requires-Dist: packaging>=25.0
Requires-Dist: regex>=2025.9.18
Requires-Dist: validators>=0.35.0
Requires-Dist: xxhash>=3.6.0
Requires-Dist: pandas>=2.0.0 ; extra == 'pandas'
Requires-Dist: polars>=1.0.0 ; extra == 'polars'
Requires-Python: >=3.10
Project-URL: Documentation, https://vupdivup.github.io/diffhouse/
Project-URL: Repository, https://github.com/vupdivup/diffhouse
Provides-Extra: pandas
Provides-Extra: polars
Description-Content-Type: text/markdown

# diffhouse: Repository Mining at Scale

[![PyPI](https://img.shields.io/pypi/v/diffhouse)](https://pypi.org/project/diffhouse/) [![DOI](https://zenodo.org/badge/1052651155.svg)](https://doi.org/10.5281/zenodo.17368264) [![Test status](https://img.shields.io/github/actions/workflow/status/vupdivup/diffhouse/os-test.yml?label=tests&branch=main)](https://github.com/vupdivup/diffhouse/actions/workflows/os-test.yml)

[Documentation](https://vupdivup.github.io/diffhouse/)

<!-- home-start -->

diffhouse is a **Python solution for structuring Git metadata**, designed to enable
large-scale codebase analysis at practical speeds.

Key features are:

- 🚀 Fast access to commit data, file changes and more
- 📊 Easy integration with pandas and Polars
- 🐍 Simple-to-use Python interface

## Performance

<p align="center">
  <img src="https://raw.githubusercontent.com/vupdivup/diffhouse/assets/benchmarks/benchmark_tweenjs.png" alt="tweenjs/tween.js benchmark results" width="480px">
  <br/>
  <em>Processing times for <a href="https://github.com/tweenjs/tween.js">tween.js</a>. Lower is better.</em>
</p>

For more details, see [benchmarks](https://vupdivup.github.io/diffhouse/benchmarks/).

## Requirements

<table>
    <tr>
        <td><strong>Python</strong></td>
        <td>3.10 or higher</td>
    </tr>
    <tr>
        <td><strong>Git</strong></td>
        <td>2.22 or higher</td>
    </tr>
</table>

Git also needs to be added to the system PATH.

## Limitations

At its core, diffhouse is a data *extraction* tool and therefore does not calculate software metrics like code churn or cyclomatic complexity; if this is needed, take a look at [PyDriller](https://github.com/ishepard/pydriller) instead.

<!-- home-end -->

## User Guide

<!-- user-guide-start -->

This guide aims to cover the basic use cases of diffhouse. For a full list of objects, consider reading the
[API Reference](https://vupdivup.github.io/diffhouse/reference).

### Installation

Install diffhouse from PyPI:

```sh
pip install diffhouse
```

#### Optional Dependencies

If you plan to combine diffhouse with pandas or Polars, install the package with their respective extras:

<table>
    <tr>
        <td>pandas</td>
        <td><code>pip install diffhouse[pandas]</code></td>
    </tr>
    <tr>
        <td>Polars</td>
        <td><code>pip install diffhouse[polars]</code></td>
    </tr>
</table>

### Quickstart

```py
from diffhouse import Repo

with Repo('https://github.com/user/repo') as r:
    for c in r.commits:
        print(c.commit_hash[:10], c.date, c.author_email)

    if len(r.branches.to_list()) > 100:
        print('🎉')

    df = r.diffs.to_pandas()
```

To start, create a [`Repo`](https://vupdivup.github.io/diffhouse/reference/repo/) instance by passing either a Git-hosting URL or a local path as its `source` argument. Next, use the `Repo` in a `with` statement to clone the source into a local, non-persistent
location.

Inside the `with` block, you can access data through the following properties:

| Property | Description | Record Type
| --- | --- | --- |
| [`Repo.commits`](https://vupdivup.github.io/diffhouse/reference/repo/#diffhouse.Repo.commits) | Commit history of the repository. | [`Commit`](https://vupdivup.github.io/diffhouse/reference/commit/) |
| [`Repo.filemods`](https://vupdivup.github.io/diffhouse/reference/repo/#diffhouse.Repo.filemods) | File modifications across the commit history. | [`FileMod`](https://vupdivup.github.io/diffhouse/reference/filemod/) |
| [`Repo.diffs`](https://vupdivup.github.io/diffhouse/reference/repo/#diffhouse.Repo.diffs) | Source code changes across the commit history. | [`Diff`](https://vupdivup.github.io/diffhouse/reference/diff/) |
| [`Repo.branches`](https://vupdivup.github.io/diffhouse/reference/repo/#diffhouse.Repo.branches) | Branches of the repository. | [`Branch`](https://vupdivup.github.io/diffhouse/reference/branch/) |
| [`Repo.tags`](https://vupdivup.github.io/diffhouse/reference/repo/#diffhouse.Repo.tags) | Tags of the repository. | [`Tag`](https://vupdivup.github.io/diffhouse/reference/tag/) |

### Querying Results

Data accessors like `Repo.commits` are [`Extractor`](https://vupdivup.github.io/diffhouse/reference/extractor/) objects and can output their results in various formats:

#### Looping Through Objects

You can use extractors in a `for` loop to process objects one by one. Data will be extracted on demand for memory efficiency:

```py
with Repo('https://github.com/user/repo') as r:
    for c in r.commits:
        print(c.commit_hash[:10])
        print(c.author_name)

        if c.in_main:
            break
```

`iter_dicts()` is a `for` loop alternative that yields dictionaries instead of diffhouse objects. A good use case for this is writing results into a newline-delimited JSON file:

```py
import json

with (
    Repo('https://github.com/user/repo') as r,
    open('commits.jsonl', 'w') as f
):
    for c in r.commits.iter_dicts():
        f.write(json.dumps(c) + '\n')
```

#### Converting to Dataframes

pandas and Polars `DataFrame` APIs are supported out of the box. To convert result sets to dataframes, call the following methods:

- `to_pandas()` or `pd()` for pandas
- `to_polars()` or `pl()` for Polars

```py
with Repo('https://github.com/user/repo') as r:
    df1 = r.filemods.to_pandas()  # pandas
    df2 = r.diffs.to_polars()  # Polars
```

### Preliminary Filtering

You can filter data along certain dimensions *before* processing takes place to reduce extraction time and/or network load.

> [!NOTE]
> Filters are a WIP feature. Additional options like date and branch filtering are planned for future releases.

#### Skipping File Downloads

If no blob-level data is needed, pass `blobs=False` when creating the `Repo` to skip file downloads during cloning. Note that this will not populate:

- `files_changed`, `lines_added` and `lines_deleted` fields of `Repo.commits`
- `Repo.filemods`
- `Repo.diffs`

```py
with Repo('https://github.com/user/repo', blobs=False) as r:
    for b in r.branches:
        pass  # business as usual

    r.filemods  # throws FilterError
```

<!-- user-guide-end -->
