Metadata-Version: 2.4
Name: repo-people
Version: 0.2.0
Summary: Collect and export full GitHub user profile data for everyone associated with a repository.
License: MIT
License-File: LICENSE
Keywords: github,users,contributors,stargazers,watchers,maintainers,python,pypi,csv,json,github-api
Author: AJ McKenna
Author-email: amckenna41@qub.ac.uk
Maintainer: AJ McKenna
Maintainer-email: amckenna41@qub.ac.uk
Requires-Python: >=3.9,<4.0
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Information Technology
Classifier: Intended Audience :: Science/Research
Classifier: License :: Free For Educational Use
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: Implementation :: PyPy
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Dist: PyGithub (>=2.0.0,<3.0.0)
Requires-Dist: aiohttp (>=3.9,<4.0)
Requires-Dist: beautifulsoup4 (>=4.12.0,<5.0.0)
Requires-Dist: requests (>=2.31.0,<3.0.0)
Project-URL: repository, https://github.com/amckenna41/repo-people
Description-Content-Type: text/markdown

# repo-people

[![PyPI version](https://badge.fury.io/py/repo-people.svg)](https://badge.fury.io/py/repo-people)
[![Platforms](https://img.shields.io/badge/platforms-linux%2C%20macOS%2C%20Windows-green)](https://pypi.org/project/repo-people/)
[![PythonV](https://img.shields.io/pypi/pyversions/repo-people?logo=2)](https://pypi.org/project/repo-people/)
[![Documentation Status](https://readthedocs.org/projects/repo-people/badge/?version=latest)](https://repo-people.readthedocs.io/en/latest/?badge=latest)
[![License: MIT](https://img.shields.io/badge/License-MIT-red.svg)](https://opensource.org/licenses/MIT)
[![Issues](https://img.shields.io/github/issues/amckenna41/repo-people)](https://github.com/amckenna41/repo-people/issues)
[![codecov](https://codecov.io/gh/amckenna41/repo-people/branch/master/graph/badge.svg?token=4PQDVGKGYN)](https://codecov.io/gh/amckenna41/repo-people)

<p align="center">
  <img src="https://raw.githubusercontent.com/amckenna41/repo-people/refs/heads/main/images/logo.png" alt="repo-people logo" width="300"/>
</p>

**repo-people** is a Python package that collects and exports the full GitHub profile for every person associated with a repository — contributors, maintainers, stargazers, watchers, issue/PR authors, fork owners, commit authors and dependents.


Table of Contents
=================
  * [Introduction](#introduction)
  * [Background](#background)
  * [Requirements](#requirements)
  * [Installation](#installation)
  * [Documentation](#documentation)
  * [Usage](#usage)
  * [Directories](#directories)
  * [Issues](#issues)
  * [License](#license)
  * [Contact](#contact)

---

## Introduction

**repo-people** provides a single-call pipeline to collect every GitHub user associated with a repository across 9 role categories, fetch 30+ profile fields for each person from the GitHub API, and export the results to JSON, CSV, or Markdown. It is designed for research, open-source community analysis, and developer intelligence workflows.

Key capabilities:
- Collects users from **9 role categories** in a single call
- Fetches **30+ profile fields** per user (bio, location, company, followers, orgs, languages, …)
- Computes derived metrics: account age, followers/following ratio, repos/year, recently-active flag, bot detection
- Incremental fetch with `save_each_iteration` and `resume` — safe to interrupt and restart on large repos
- Flexible filtering: `roles`, `exclude`, `exclude_bots`, `limit`, `fields`
- Concurrent fetching via `workers` — uses `ThreadPoolExecutor` to fetch multiple profiles in parallel
- Async fetching via `get_users_async()` — uses `asyncio` + `aiohttp` for high-concurrency scenarios
- Opt-in social accounts via `include_social_accounts` — fetches linked LinkedIn, Mastodon, npm, and other accounts
- Export to **JSON**, **CSV** and **Markdown** table
- Analysis helpers: `summarise()` and `top_users()`
- Token validated on startup — invalid or expired tokens raise `ConnectionError` immediately
- Rate-limit progress printed every 50 users with remaining request count and reset time

---

## Background

Understanding who contributes to, uses, and maintains an open-source project is valuable for community health analysis, academic research, and competitive intelligence. GitHub exposes this information across many endpoints (contributors, stargazers, watchers, forks, issues, pull requests, CODEOWNERS, commit history), but collecting and joining it requires many paginated API calls.

**repo-people** automates that collection, deduplicates users across all roles, enriches each record with the full GitHub profile, and computes additional signals (account age, activity recency, bot detection) in a single pipeline call.

---

## Requirements

- **[Python](https://www.python.org/)** ^3.9
- **[PyGithub](https://pygithub.readthedocs.io/en/latest/)** ^2.0.0 — GitHub API client
- **[requests](https://requests.readthedocs.io/en/latest/)** ^2.31.0 — HTTP requests for REST endpoints
- **[beautifulsoup4](https://www.crummy.com/software/BeautifulSoup/)** ^4.12.0 — HTML scraping for dependents
- **[aiohttp](https://docs.aiohttp.org/en/stable/)** ^3.9 — async HTTP client for `get_users_async()`

A GitHub personal access token is strongly recommended. Unauthenticated requests are limited to 60/hour; authenticated requests allow 5,000/hour.

---

## Installation

Install the latest version of `repo-people` via [PyPi][PyPi] using pip:

```bash
pip3 install repo-people --upgrade
```

Installation from source:
```bash
git clone -b main https://github.com/amckenna41/repo-people.git
cd repo-people
pip3 install .
```


---

## Documentation

- [Read the Docs](https://repo-people.readthedocs.io/en/latest/) — full package documentation
- [FIELDS.md](FIELDS.md) — full reference table of all 48 output fields with descriptions
- [CHANGELOG.md](CHANGELOG.md) — version history and release notes

---

## Usage

### Quick Start

### How to get a GitHub Personal Access Token

1. Sign in to [github.com](https://github.com) and go to **Settings** → **Developer settings** → **Personal access tokens** → **Tokens (classic)**.
2. Click **Generate new token (classic)**.
3. Give the token a descriptive name and set an expiration date.
4. Select the following scopes:
   - `repo` — read access to repository metadata, contributors, and collaborators
   - `read:user` — read user profile data
   - `read:org` — read organisation membership (needed for `public_orgs`)
5. Click **Generate token** and copy it immediately — it won't be shown again.
6. Store it securely (e.g. in an environment variable or a secrets manager) and pass it via the `token` parameter:

```python
import os
rp = RepoPeople("owner", "repo", token=os.environ["GITHUB_TOKEN"])
```

> **Tip:** Unauthenticated requests are limited to 60/hour. Authenticated requests allow 5,000/hour, making a token essential for any non-trivial repo.


```python
from repo_people import RepoPeople

rp = RepoPeople("owner", "repo", token="ghp_...")
user_data = rp.get_users(export=True)
# Returns a dict keyed by username, with 30+ profile fields per user
```

### Authentication

```python
import os
rp = RepoPeople("owner", "repo", token=os.environ["GITHUB_TOKEN"])
```

The token is validated immediately on construction — an invalid or expired token raises `ConnectionError` before any collection begins.

### `RepoPeople()` Constructor

```python
RepoPeople(owner, repo, token=None, outdir=None, skip_codeowners=False, skip_collaborators=False)
```

| Parameter | Type | Default | Description |
|---|---|---|---|
| `owner` | `str` | — | GitHub username or organisation that owns the repo. |
| `repo` | `str` | — | Repository name. |
| `token` | `str \| None` | `None` | Personal access token. Strongly recommended — validated immediately on init; raises `ConnectionError` for invalid tokens. |
| `outdir` | `str \| None` | `"{owner}_{repo}"` | Leaf directory inside `outputs/`. All output files are written under `outputs/{outdir}/`. |
| `skip_codeowners` | `bool` | `False` | Skip CODEOWNERS file when collecting maintainers. |
| `skip_collaborators` | `bool` | `False` | Skip repo collaborators when collecting maintainers. |

### `get_users()` Parameters

| Parameter | Type | Default | Description |
|---|---|---|---|
| `export` | `bool` | `False` | Write results to a JSON file. |
| `export_csv` | `bool` | `False` | Write results to a CSV file. |
| `save_each_iteration` | `bool` | `False` | Save after every single user fetch. |
| `limit` | `int \| None` | `None` | Cap the number of profiles to fetch. |
| `roles` | `list[str] \| None` | `None` (all 9) | Restrict which roles to collect. |
| `exclude` | `list[str] \| None` | `None` | Usernames to skip. |
| `exclude_bots` | `bool` | `False` | Skip bot accounts automatically. |
| `resume` | `bool` | `False` | Skip users already in the output file. |
| `verbose` | `bool` | `True` | Print progress to stdout. |
| `fields` | `list[str] \| str \| None` | `None` (all) | Restrict which fields appear in output. Invalid names raise `ValueError` before any fetch. |
| `include_social_accounts` | `bool` | `False` | Fetch each user's linked social accounts (LinkedIn, Mastodon, npm, …). Costs one extra API call per user. |
| `workers` | `int` | `1` | Number of concurrent fetch threads. Increase for faster collection on large repos. |

Valid `roles` values: `contributors`, `maintainers`, `stargazers`, `watchers`, `issue_authors`, `pr_authors`, `fork_owners`, `commit_authors`, `dependents`.

### Examples

#### Filter by role

```python
# Only gather contributors and stargazers
user_data = rp.get_users(roles=["contributors", "stargazers"])
```

#### Limit, exclude, and skip bots

```python
user_data = rp.get_users(
    limit=100,
    exclude=["dependabot", "github-actions[bot]"],
    exclude_bots=True,
)
```

#### Export to JSON and CSV

```python
user_data = rp.get_users(export=True, export_csv=True)
```

#### Export to Markdown table

```python
rp.export_to_markdown(user_data, fields=["login", "name", "location", "followers"])
```

#### Resume an interrupted run

```python
# First run
rp.get_users(save_each_iteration=True, export=True)

# Resume after interruption
rp.get_users(save_each_iteration=True, export=True, resume=True)
```

#### Concurrent fetching

```python
# Speed up large repos by fetching profiles in parallel
user_data = rp.get_users(workers=4)
```

#### Async fetching

```python
import asyncio

user_data = asyncio.run(rp.get_users_async(concurrency=10))
```

#### Include social accounts

```python
user_data = rp.get_users(include_social_accounts=True)
# Each record gains a 'social_accounts' dict, e.g. {'linkedin': 'https://linkedin.com/in/...'}
```

#### Dot-notation field access

`get_users()` returns a `UserDataView` — a plain `dict` subclass that additionally supports dot notation to extract a single field across every user at once:

```python
user_data = rp.get_users()

# Extract one field for all users
emails    = user_data.email_public
# {"alice": {"email_public": "alice@example.com"}, "bob": {"email_public": ""}, ...}

locations = user_data.location
followers = user_data.followers
roles     = user_data.roles
```

All standard `dict` operations still work unchanged. Accessing an unrecognised field name raises `AttributeError` listing the valid field names.

#### Analysis helpers

```python
stats = rp.summarise(user_data, top_n=5)
# {'total': 134, 'top_locations': [('San Francisco', 18), ...], ...}

leaders = rp.top_users(user_data, n=10, by="followers")
```

### Output Fields

Each user entry contains 30+ fields. See [FIELDS.md](FIELDS.md) for the full reference. A summary by category:

| Category | Fields |
|---|---|
| Identity | `login`, `name`, `company`, `location`, `email_public`, `blog`, `twitter`, `bio` |
| Timestamps | `created_at`, `updated_at` |
| Counters | `followers`, `following`, `public_repos`, `public_gists` |
| Flags | `has_public_email`, `has_blog`, `has_twitter`, `is_bot`, `hireable` |
| Computed | `account_age_days`, `followers_following_ratio`, `repos_per_year`, `recently_active`, `last_public_event_at` |
| Organisations | `public_orgs`, `orgs_public_count` |
| Sampled | `top_languages`, `total_public_stars_sampled`, `total_public_forks_sampled`, `ssh_keys_count`, `gpg_keys_count`, `starred_repos_sampled` |
| Social | `social_accounts` (opt-in via `include_social_accounts`) |
| Repo-specific | `is_collaborator`, `permission_on_repo` |
| Metadata | `roles` (populated by `get_users()`) |

---

## Directories

```
repo-people/
├── repo_people/          # Package source
│   ├── __init__.py
│   ├── repo_people.py    # RepoPeople class — main pipeline
│   ├── export.py         # Role-specific username collectors (9 functions)
│   ├── users.py          # GitHubUserInfo wrapper and UserSnapshot dataclass
│   └── utils.py          # Shared helpers: paginate(), _headers(), write_csv()
├── tests/                # Unit and integration tests
│   ├── test_repo_people.py
│   ├── test_export.py
│   └── test_users.py
├── docs/                 # Sphinx documentation source
├── outputs/              # Default output directory (created at runtime)
├── FIELDS.md             # Full output field reference
├── CHANGELOG.md          # Version history
├── pyproject.toml        # Package metadata and dependencies
└── README.md
```

<!-- ## Authentication

A GitHub token is strongly recommended (5 000 vs 60 requests/hour unauthenticated):

```python
import os
rp = RepoPeople("owner", "repo", token=os.environ["GITHUB_TOKEN"])
``` -->



## Issues
Any issues, errors or bugs can be raised via the [Issues](https://github.com/amckenna41/repo-people/issues) tab in the repository.

## Contact
If you have any questions or comments, please contact amckenna41@qub.ac.uk or raise an issue on the [Issues][Issues] tab. <br><br>
<!-- [![LinkedIn](https://img.shields.io/badge/LinkedIn-0077B5?style=for-the-badge&logo=linkedin&logoColor=white)](https://www.linkedin.com/in/adam-mckenna-7a5b22151/) -->

## License
Distributed under the MIT License. See [`LICENSE`][license] for more details. 



[<img src="https://img.shields.io/github/stars/amckenna41/repo-people?color=green&label=star%20it%20on%20GitHub" width="132" height="20" alt="Star it on GitHub">](https://github.com/amckenna41/repo-people)


<a href="https://www.buymeacoffee.com/amckenna41" target="_blank"><img src="https://cdn.buymeacoffee.com/buttons/default-orange.png" alt="Buy Me A Coffee" height="41" width="174"></a>

[Back to top](#TOP)

[PyPi]: https://pypi.org/project/repo-people
[Issues]: https://github.com/amckenna41/repo-people/issues
[license]: https://github.com/amckenna41/repo-people/blob/master/LICENSE

