Metadata-Version: 2.4
Name: jcp-data-manager
Version: 0.2.0
Summary: CLI toolkit for JCP session enrichment, job posting, and job expiration checks.
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: azure-ai-inference
Requires-Dist: azure-core
Requires-Dist: beautifulsoup4>=4.12.0
Requires-Dist: polars>=1.0.0
Requires-Dist: requests>=2.31.0
Requires-Dist: deepface>=0.0.93
Requires-Dist: gender-guesser>=0.4.0
Requires-Dist: ethnicolr>=0.18.4
Requires-Dist: google-genai
Requires-Dist: pandas>=2.0.0
Requires-Dist: python-dotenv>=0.21.0
Requires-Dist: python-jobspy>=1.1.82

# jcp-data-manager

CLI toolkit for JCP session enrichment, job posting, and job expiration checks.

## What it does

- Loads a merged JCP sessions JSON export
- Normalizes nested session, LinkedIn, and survey rows into a flat table
- By default, enriches rows with image-based DeepFace analysis
- By default, enriches rows with name-based gender and ethnicity predictions
- Scrapes jobs, generates JCP-ready HTML, and creates WordPress drafts
- Checks existing WordPress drafts for dead or soft-404 source links and can move invalid posts to private

## Install

```bash
pip install jcp-data-manager
```

## Environment file

Create a local `.env` file in the project folder before using the job-posting or expiration commands. `.env` is gitignored; `.env.example` shows the required keys.

Future users need to provide their own values for:

- `WORDPRESS_BASE_URL`
- `WORDPRESS_USERNAME`
- `WORDPRESS_APP_PASSWORD`
- `WORDPRESS_FEATURED_MEDIA_ID`
- `GITHUB_MODELS_TOKEN`
- `GITHUB_MODELS_ENDPOINT`
- `GITHUB_MODELS_MODEL`
- `GEMINI_API_KEY`
- `GEMINI_MODEL`

`GITHUB_MODELS_ENDPOINT`, `GITHUB_MODELS_MODEL`, and `GEMINI_MODEL` have sensible defaults, but keeping them in `.env` makes the setup explicit.

## Commands

### Session enrichment

The sessions file should be a top-level JSON object with a `sessions` key whose value is a list.

Each session row is expected to come from your server-side merged export and should include a nested `session` object. If present, `linkedin_rows`, `profile_data`, and `job_survey_rows` are normalized the same way as in your notebook workflow.

```bash
jcp-data-manager enrich-sessions --sessions /content/jcpst-sessions-2026-04-21-17-27-55.json --output merged.parquet
```

Legacy usage still works:

```bash
jcp-data-manager --sessions /content/jcpst-sessions-2026-04-21-17-27-55.json --output merged.parquet
```

### Job scraping and posting

This command scrapes jobs, filters for qualification text, asks GitHub Models to format the posting HTML, saves the output dataset, and then posts WordPress drafts.

```bash
jcp-data-manager get-jobs --occupation-title "Graphic Designer" --date-posted 04/21/2026 --location "Seattle, WA"
```

By default it posts with the LinkedIn sign-in popup flow. Use `--no-linkedin` to switch to the non-LinkedIn session-store post template:

```bash
jcp-data-manager get-jobs --occupation-title "Graphic Designer" --date-posted 04/21/2026 --location "Seattle, WA" --no-linkedin
```

Use `--skip-post` if you only want the scraped and generated output file without creating WordPress drafts.

### Expiration checking

This command inspects WordPress posts, fetches each footnote URL, asks Gemini for a soft-404 probability, and by default changes invalid posts to `private`.

```bash
jcp-data-manager check-job-expiration --status draft --output invalid-posts.csv
```

Use `--skip-private` if you want the report without updating WordPress post status.

## uv

The package metadata now works with `uv` directly:

```bash
uv sync
uv run jcp-data-manager --help
```

## Project layout

```text
src/jcp_data_manager/
  __init__.py
  cli.py
  config.py
  enrichment.py
  expiration.py
  io.py
  jobs.py
  job_templates.py
  merge.py
```
