Metadata-Version: 2.4
Name: jobquest
Version: 0.1.7
Summary: Stealth job scraper for LinkedIn, Indeed & Glassdoor — powered by Scrapling
License: MIT
License-File: LICENSE
Keywords: jobs,scraper,linkedin,indeed,glassdoor,stealth,scrapling
Author: Alwin Paul
Author-email: alwin.paulpv@gmail.com
Requires-Python: >=3.10,<4.0
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Topic :: Internet :: WWW/HTTP :: Indexing/Search
Requires-Dist: beautifulsoup4 (>=4.12.2)
Requires-Dist: browserforge (>=1.0)
Requires-Dist: curl-cffi (>=0.7)
Requires-Dist: markdownify (>=0.14.0)
Requires-Dist: msgspec (>=0.18)
Requires-Dist: numpy (>=1.26.0)
Requires-Dist: pandas (>=2.1.0)
Requires-Dist: patchright (>=1.0)
Requires-Dist: playwright (>=1.40)
Requires-Dist: pydantic (>=2.3.0)
Requires-Dist: regex (>=2024.4.28)
Requires-Dist: requests (>=2.31.0)
Requires-Dist: scrapling (>=0.2)
Requires-Dist: tls-client (>=1.0.1)
Description-Content-Type: text/markdown

# JobQuest

A Python library that scrapes real job postings from LinkedIn, Indeed, and Glassdoor — and actually gets results back instead of 403 errors.

Built on top of [Scrapling](https://github.com/D4Vinci/Scrapling) for Chrome-level TLS fingerprinting and Cloudflare bypass. Every request looks like it came from a real browser, because under the hood it basically did.

## Why this exists

Most job scraping libraries break within weeks. Cloudflare updates its bot detection, LinkedIn starts rate-limiting harder, Glassdoor changes its GraphQL schema — and suddenly you're staring at empty DataFrames.

JobQuest handles this with stealth-first defaults:

- **Chrome TLS fingerprinting** via `curl_cffi` — your requests have the same TLS signature as Chrome 138
- **Cloudflare bypass** via Scrapling's `StealthyFetcher` when needed (full headless Chromium)
- **Realistic headers** generated by `browserforge` — not hardcoded strings from 2023
- **Automatic fallback** — if stealth isn't installed, it falls back to `requests`/`tls_client`

## Install

```bash
pip install jobquest
```

## Quick start

```python
from jobquest import scrape_jobs

jobs = scrape_jobs(
    site_name=["linkedin", "indeed", "glassdoor"],
    search_term="AI Engineer",
    location="Germany",
    results_wanted=25,
    country_indeed="germany",
)

print(f"{len(jobs)} jobs found")
print(jobs[["title", "company", "location"]])
```

That's it. Returns a Pandas DataFrame with all the job data you'd expect.

## What you get back

Each row in the DataFrame has:

| Column | Description |
|--------|-------------|
| `title` | Job title |
| `company` | Company name |
| `location` | City, state, country |
| `job_url` | Link to the posting |
| `date_posted` | When it was listed |
| `description` | Full job description (markdown by default) |
| `salary_source` | Where salary info came from |
| `min_amount` / `max_amount` | Salary range |
| `is_remote` | Remote or not |
| `job_type` | Full-time, part-time, contract, etc. |
| `company_url` | Company page link |
| `emails` | Contact emails found in the description |

Plus site-specific fields like `job_level`, `job_function`, and more.

## Supported sites

| Site | Status | Notes |
|------|--------|-------|
| **LinkedIn** | Working | Guest API, no login needed |
| **Indeed** | Working | GraphQL API, mobile app headers |
| **Glassdoor** | Working | GraphQL API, auto CSRF token |

## All the options

```python
from jobquest import scrape_jobs

jobs = scrape_jobs(
    site_name="linkedin",              # or ["linkedin", "indeed", "glassdoor"]
    search_term="machine learning",
    location="Berlin",
    distance=50,                       # km radius
    is_remote=True,
    job_type="fulltime",               # fulltime, parttime, contract, internship
    easy_apply=True,                   # LinkedIn/Glassdoor easy apply filter
    results_wanted=50,
    country_indeed="germany",          # for Indeed's country-specific API
    hours_old=72,                      # only jobs posted in last 72 hours
    linkedin_fetch_description=True,   # fetch full descriptions (slower)
    description_format="markdown",     # markdown, html, or plain
    enforce_annual_salary=True,        # normalize all salaries to yearly
    proxies=["http://user:pass@proxy:8080"],
    verbose=2,                         # 0=errors, 1=warnings, 2=info
)
```

## How the stealth works

JobQuest uses a two-tier approach:

**Tier 1 -- StealthSession (default for all requests)**
Every HTTP request goes through Scrapling's `FetcherSession`, which uses `curl_cffi` to impersonate Chrome's TLS fingerprint. Headers are generated by `browserforge` to match real browser patterns. This is enough for LinkedIn, Indeed, and most API calls.

**Tier 2 -- StealthyFetcher (for Cloudflare-protected pages)**
When a site puts up a Cloudflare challenge (like Glassdoor's CSRF page), JobQuest launches a real headless Chromium browser via `patchright` to solve it. Cookies from that session get transferred back to the lighter HTTP session for subsequent requests.

If `scrapling` isn't installed, everything falls back to `requests` + `tls_client` -- the same approach most job scrapers use. You just lose the stealth advantage.

## Using proxies

```python
jobs = scrape_jobs(
    site_name="linkedin",
    search_term="data engineer",
    proxies=["http://user:pass@proxy1:8080", "socks5://proxy2:1080"],
)
```

Proxies rotate automatically between requests.

## Export to CSV / Excel

```python
import csv

jobs.to_csv("jobs.csv", quoting=csv.QUOTE_NONNUMERIC, escapechar="\\", index=False)
jobs.to_excel("jobs.xlsx", index=False)
```

## Requirements

- Python 3.10+
- The stealth stack installs automatically: `scrapling`, `curl_cffi`, `browserforge`, `patchright`, `playwright`

## License

MIT

