Metadata-Version: 2.4
Name: hack4her-review-data
Version: 0.1.0
Summary: Synthetic multilingual accommodation review data generator for Hack4Her travel-safety prototypes.
Author: Hack4Her Data Team
Project-URL: Homepage, https://hack4her.github.io/
Project-URL: Repository, https://github.com/iflashlord/hack4her-review-data
Keywords: hack4her,synthetic-data,reviews,travel-safety,booking,cli
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Education
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Testing
Classifier: Topic :: Utilities
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Provides-Extra: cli
Requires-Dist: rich>=13.7.0; extra == "cli"
Requires-Dist: typer>=0.12.0; extra == "cli"
Provides-Extra: dev
Requires-Dist: build>=1.2.0; extra == "dev"
Requires-Dist: twine>=5.0.0; extra == "dev"
Requires-Dist: rich>=13.7.0; extra == "dev"
Requires-Dist: typer>=0.12.0; extra == "dev"

# Hack4Her Mock Accommodation Reviews

This repo contains a dependency-free Python generator for synthetic Booking.com-style accommodation reviews for the Hack4Her challenge theme: women's safety while travelling.

The generated data is mock data only. Reviews, properties, labels, and coordinates are synthetic and must not be interpreted as real Booking.com customer reviews or real safety ratings for any location.

## Generated Files

The default 1k balanced dataset has already been generated:

- `data/mock_reviews_balanced_1000.csv`
- `data/mock_reviews_balanced_1000.jsonl`
- `data/mock_reviews_balanced_1000.summary.json`
- `data/mock_review_source_context_pool_10000.csv`
- `data/mock_review_source_context_pool_10000.jsonl`

Additional 1k scenario datasets are available in:

- `data/scenarios/`
- `data/random/`

Larger 10k scenario datasets are available in:

- `data/scenarios_10k/`
- `data/random_10k/`

Pre-generated participant-ready starter packs are available in:

- `data/starter_1000/`
- `data/starter_10000/`

New generated outputs default to `data_output_generated/`, which is ignored by git.

The dataset includes multilingual reviews in English, Spanish, French, German, Dutch, Italian, Portuguese, and Arabic.

## Run

For detailed usage, see [docs/USAGE.md](docs/USAGE.md).
For PyPI publishing, see [docs/PUBLISHING.md](docs/PUBLISHING.md).

## Installable Package

After the package is published to PyPI:

```bash
python -m pip install hack4her-review-data
hack4her-data --starter-pack --records 1000
```

For the Rich/Typer visual terminal:

```bash
python -m pip install "hack4her-review-data[cli]"
hack4her-data-cli
```

## Participant Start Point

Use starter packs when teams need data to begin building without seeing organizer labels.

Fancy terminal UI:

macOS/Linux:

```bash
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements-cli.txt
python3 scripts/hack4her_cli.py
```

Windows PowerShell:

```powershell
py -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r requirements-cli.txt
python scripts\hack4her_cli.py
```

The fancy CLI opens a Booking.com Hack4Her branded terminal menu where teams select the dataset type, record count, output format, and output folder. Visual menu outputs automatically hide organizer/evaluation labels in the main dataset and create a separate 10% labeled golden sample for validation or scoring. It uses a cross-platform Rich/Typer interface with an animated Booking.com header, smaller Hack4Her text in pink, scenario safety-mix previews, output-folder checks, generation-plan panels, written-file summaries, and animated progress bars. The dependency-free script below remains available for teams that only want Python standard library commands.

Direct fancy CLI commands also work:

```bash
python3 scripts/hack4her_cli.py menu
python3 scripts/hack4her_cli.py doctor
python3 scripts/hack4her_cli.py starter --records 1000
python3 scripts/hack4her_cli.py scenarios
```

Generate participant-ready CSV files for all deterministic scenarios:

```bash
python3 scripts/generate_mock_reviews.py --starter-pack --records 1000
```

Choose any size from `1000` to `10000` in steps of `1000`:

```bash
python3 scripts/generate_mock_reviews.py --starter-pack --records 5000
python3 scripts/generate_mock_reviews.py --starter-pack --records 10000
```

Starter packs default to:

- `data_output_generated/`

Each starter pack contains one public CSV per scenario, one 10% labeled golden CSV per scenario, summaries, and a small README explaining how to choose a dataset.

Generate the default deterministic 1k balanced dataset:

```bash
python3 scripts/generate_mock_reviews.py
```

Generate a specific scenario:

```bash
python3 scripts/generate_mock_reviews.py --records 1000 --scenario safety_heavy --output-dir data_output_generated
python3 scripts/generate_mock_reviews.py --records 1000 --scenario location_focus --output-dir data_output_generated
python3 scripts/generate_mock_reviews.py --records 1000 --scenario host_focus --output-dir data_output_generated
python3 scripts/generate_mock_reviews.py --records 1000 --scenario stay_focus --output-dir data_output_generated
python3 scripts/generate_mock_reviews.py --records 1000 --scenario mostly_positive --output-dir data_output_generated
```

Generate all deterministic scenarios:

```bash
python3 scripts/generate_mock_reviews.py --all-scenarios --records 1000 --output-dir data_output_generated
```

Generate all deterministic 10k scenarios:

```bash
python3 scripts/generate_mock_reviews.py --all-scenarios --records 10000 --output-dir data_output_generated
```

Generate a deliberately random set. This changes on each run unless `--seed` is provided:

```bash
python3 scripts/generate_mock_reviews.py --scenario random --records 1000 --output-dir data_output_generated
```

Generate a 10k random set:

```bash
python3 scripts/generate_mock_reviews.py --scenario random --records 10000 --output-dir data_output_generated
```

Generate a participant-facing version without helper labels:

```bash
python3 scripts/generate_mock_reviews.py --records 1000 --scenario balanced --public --format csv --output-dir data_output_generated
```

With `--public`, the full main dataset hides organizer labels and the script also writes a `_golden_10pct.csv` file with labels for 10% of rows.

## Reproducibility

Deterministic scenarios use a stable 10k source context pool and a default seed of `20260522`, so everyone running the same command gets the same records. The normal record choices are `1000`, `2000`, `3000`, `4000`, `5000`, `6000`, `7000`, `8000`, `9000`, and `10000`. The `random` scenario intentionally uses a fresh random seed unless you pass `--seed`.

Use `--write-source-pool` to write the synthetic 10k source context pool:

```bash
python3 scripts/generate_mock_reviews.py --records 1000 --scenario balanced --write-source-pool --output-dir data_output_generated
```

## Scenarios

- `balanced`: mixed travel reviews with a visible safety signal.
- `safety_heavy`: many safety-related reviews across location, host, and stay.
- `location_focus`: safety around neighborhood, route, entrance, or transit.
- `host_focus`: host conduct, check-in conduct, and support response.
- `stay_focus`: room, lock, access, privacy, and on-property safety concerns.
- `mostly_positive`: mostly normal or positive reviews with sparse safety concerns.
- `random`: non-deterministic topic mix for surprise testing.

## Useful Columns

- `review_text`, `review_title`, `language`, `rating`: primary participant-facing review fields.
- `city`, `country`, `latitude`, `longitude`, `area_type`: useful for map prototypes.
- `is_safety_related`, `safety_category`, `safety_concern_level`, `safety_signal`: helper labels for testing or evaluation.
- `topic`, `sentiment`, `labels`: additional organizer-facing metadata.
