Metadata-Version: 2.4
Name: jobhive-py
Version: 0.1.0
Summary: 3.3M+ live jobs from 400 000+ companies, scraped directly from ATS sources. The open dataset and toolkit for job market data.
Project-URL: Homepage, https://github.com/stapply-ai/ats-scrapers
Project-URL: Documentation, https://github.com/stapply-ai/ats-scrapers#readme
Project-URL: Repository, https://github.com/stapply-ai/ats-scrapers
Project-URL: Issues, https://github.com/stapply-ai/ats-scrapers/issues
Project-URL: Changelog, https://github.com/stapply-ai/ats-scrapers/blob/main/CHANGELOG.md
Project-URL: Dataset, https://storage.stapply.ai/jobhive/v1/manifest.json
Author-email: Kalil Bouzigues <kalil.bouzigues@gmail.com>
License: MIT
License-File: LICENSE
Keywords: ashby,ats,data,greenhouse,hiring,job-scraper,job-search,jobs,lever,recruiting,smartrecruiters,workable,workday
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Internet :: WWW/HTTP :: Indexing/Search
Classifier: Topic :: Office/Business
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Typing :: Typed
Requires-Python: >=3.11
Requires-Dist: httpx>=0.27
Requires-Dist: pandas>=2.0
Requires-Dist: pydantic>=2.6
Provides-Extra: all
Requires-Dist: aiohttp>=3.9; extra == 'all'
Requires-Dist: beautifulsoup4>=4.12; extra == 'all'
Requires-Dist: boto3>=1.34; extra == 'all'
Requires-Dist: firecrawl-py>=1.0; extra == 'all'
Requires-Dist: html2text>=2024.0; extra == 'all'
Requires-Dist: pyarrow>=15.0; extra == 'all'
Provides-Extra: dev
Requires-Dist: mypy>=1.10; extra == 'dev'
Requires-Dist: pandas-stubs; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.23; extra == 'dev'
Requires-Dist: pytest-httpx>=0.30; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.5; extra == 'dev'
Requires-Dist: types-boto3; extra == 'dev'
Provides-Extra: discovery
Requires-Dist: firecrawl-py>=1.0; extra == 'discovery'
Provides-Extra: parquet
Requires-Dist: pyarrow>=15.0; extra == 'parquet'
Provides-Extra: publish
Requires-Dist: boto3>=1.34; extra == 'publish'
Requires-Dist: pyarrow>=15.0; extra == 'publish'
Provides-Extra: scrapers
Requires-Dist: aiohttp>=3.9; extra == 'scrapers'
Requires-Dist: beautifulsoup4>=4.12; extra == 'scrapers'
Requires-Dist: html2text>=2024.0; extra == 'scrapers'
Description-Content-Type: text/markdown

<p align="center">
  <img src="https://raw.githubusercontent.com/stapply-ai/ats-scrapers/main/assets/banner.jpeg" alt="jobhive" />
</p>

# jobhive

> **The open dataset and toolkit for global job market data.**
> 3.3M+ live jobs from 400 000+ companies, scraped directly from the ATS platforms where companies actually post. No LinkedIn, no reposts, no recruiters.

[![PyPI](https://img.shields.io/pypi/v/jobhive.svg)](https://pypi.org/project/jobhive/)
[![Python](https://img.shields.io/pypi/pyversions/jobhive.svg)](https://pypi.org/project/jobhive/)
[![License](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)

```python
from jobhive import search

df = search(query="ml engineer", location="Paris", remote=True)
```

No API key, no auth, no rate limits. The dataset refreshes every 24 hours.

---

## Why jobhive

Most job aggregators scrape LinkedIn and Indeed — both full of duplicates,
ghost listings, and reposts. **jobhive goes one layer down**: directly to
the ATS platforms (Greenhouse, Lever, Ashby, Workday, BambooHR…) where
companies actually post.

- **Single source of truth** — every row comes from the company's own
  ATS, so titles, locations, and salaries are accurate.
- **No duplicates** — one ATS posting = one row.
- **Structured salary** when the ATS exposes it (Ashby, Greenhouse Pay
  Transparency, Lever salaryRange, etc.).
- **MIT licensed, fully open** — fork the dataset, fork the scrapers.

## Coverage

| Metric | Value |
|---|---:|
| Live jobs | **3 376 000+** |
| Companies | **406 000+** |
| ATS platforms | **31** |

Top 10 by job count:

| ATS | Jobs |
|---|---:|
| Bundesagentur (DE public-sector) | 931 049 |
| Workday | 653 041 |
| EURES (EU/EEA public-sector) | 626 783 |
| SmartRecruiters | 213 372 |
| SuccessFactors | 180 499 |
| Greenhouse | 110 071 |
| Oracle HCM | 107 464 |
| iCIMS | 92 211 |
| Lever | 60 342 |
| Phenom | 56 483 |

Counts come from the live manifest at
`https://storage.stapply.ai/jobhive/v1/manifest.json` — verify any time
with `jobhive list-ats`.

## Install

```bash
pip install jobhive-py
```

Distributed as `jobhive-py` on PyPI; the import name is still `jobhive`.

Optional extras:

```bash
pip install "jobhive-py[parquet]"     # faster downloads via Apache Parquet
pip install "jobhive-py[scrapers]"    # build your own pipeline
pip install "jobhive-py[all]"
```

## Two ways to use it

### 1. Query the public dataset

```python
from jobhive import search

# Free-text title + location + remote filter
df = search(query="rust", location="Berlin", remote=True, salary_min=80_000)

# Restrict to one ATS slice (smaller download)
df = search(query="data engineer", ats="ashby")

# Pandas all the way down
df.groupby("company").size().sort_values(ascending=False).head(20)
```

Every row carries:

```
url, title, company, ats_type, ats_id,
location, is_remote, lat, lon,
salary_min, salary_max, salary_currency, salary_period, salary_summary,
employment_type, commitment, experience, department, team,
description, posted_at, fetched_at, requisition_id, apply_url, raw
```

Optional fields are `None` when the source ATS doesn't expose them.
``raw`` keeps any provider-specific fields the canonical schema doesn't
represent — Greenhouse `metadata`, Workday `bulletFields`, etc.

### 2. Scrape your own companies

```python
from jobhive.scrapers import GreenhouseScraper, LeverScraper, AshbyScraper

jobs = GreenhouseScraper("anthropic").fetch()    # → list[Job]
jobs = LeverScraper("palantir").fetch()
jobs = AshbyScraper("openai").fetch()
```

Or pick by name:

```python
from jobhive.scrapers import get_scraper

scraper = get_scraper("ashby", "openai")
```

## Scrapers

**Multi-tenant ATS** (pass the company's slug on that ATS):

`Greenhouse`, `Lever`, `Ashby`, `SmartRecruiters`, `Workable`,
`Rippling`, `Personio`, `Gem`, `JoinCom`, `iCIMS`, `JazzHR`, `Breezy`,
`Teamtailor`, `Pinpoint`, `BambooHR`, `Cornerstone`, `Recruitee`,
`Recruiterbox`, `Eightfold`, `Avature`, `Phenom`, `Workday`, `Oracle`,
`SuccessFactors`, `Taleo`, `Mercor`.

**Custom big-tech APIs** (single-tenant, slug ignored): `Amazon`,
`Apple`, `Google`, `TikTok`, `Uber`.

**National public-sector aggregators**: `Bundesagentur` (DE),
`Arbetsformedlingen` (SE), `Eures` (EU/EEA-wide).

**Hybrid jobboards**: `WelcomeToTheJungle`.

A few scrapers (`Tesla`, `Meta`) need a real browser session and ship as
placeholders pending the optional browser backend in 0.2.

## CLI

```bash
jobhive search "platform engineer" --location Paris --limit 20
jobhive scrape ashby openai
jobhive list-ats
```

## Contributing

**The goal is the largest open-source live job dataset on the
internet.** That's a forever project, and there's a clear path to make
it bigger:

- **Add a new ATS scraper** — every ATS we don't cover yet is a few
  thousand companies missing from the dataset. The scraper API is
  intentionally tiny: subclass `BaseScraper`, set `ats`, implement
  `fetch()`. See any file under `src/jobhive/scrapers/` for a 50-line
  reference, and the `Job` model in `src/jobhive/models.py` for the
  schema you populate.
- **Improve coverage on an existing ATS** — many scrapers extract
  description / salary / employment-type only when the ATS surfaces
  them. If you find a tenant where a field is structurally available
  but we're missing it, a one-line PR is welcome.
- **Discover new tenants** — we maintain a
  `{ats}/{ats}_companies.csv` per ATS. New rows = new companies in
  the dataset.
- **Report broken scrapers** — open an issue with the slug and the
  failure mode. ATS APIs drift; flagging a regression early keeps the
  dataset accurate for everyone.

```bash
git clone https://github.com/stapply-ai/ats-scrapers
cd ats-scrapers
uv pip install -e ".[dev,scrapers]"
pytest
ruff check .
```

PRs welcome on `main`. CI is green for all 6 of {3.11, 3.12, 3.13} ×
{ubuntu, macos}; please keep it that way.

## License

MIT.

## Acknowledgments

Built with [Reverse API Engineer](https://github.com/kalil0321/reverse-api-engineer).
