Metadata-Version: 2.4
Name: baybin_sentinel
Version: 2026.6.25.1
Summary: Baybin Sentinel: OpenSearch writer
Author-email: Hsin-Yu Liu <meteorgroup33@gmail.com>
License-Expression: MIT
Project-URL: Homepage, https://github.com/meteorgroup33/baybin_sentinel
Project-URL: Bug Tracker, https://github.com/meteorgroup33/baybin_sentinel/issues
Classifier: Programming Language :: Python :: 3.13
Classifier: Operating System :: OS Independent
Requires-Python: >=3.13
Description-Content-Type: text/markdown
Requires-Dist: opensearch-py>=2.8.0
Requires-Dist: pyyaml>=6.0.2

# Baybin Sentinel

`baybin_sentinel` is a Python utility package designed for the **Baybin Sentiment Analysis System**. It provides specialized writers to streamline the ingestion of social media data into **OpenSearch**.

Currently supported platforms: **Facebook**, **Threads**, **PTT**, **News** (RSS / Scrapy), **Google Trends**.

## Installation

### (For Crawler Developers) Install Package
```bash
pip install -U baybin_sentinel
```

### (For Package Developers) Create Virtual Environment
```bash
conda update -n base -c conda-forge conda
conda create -n sentinel python=3.13 pip -y
conda activate sentinel
cd baybin_sentinel
pip install -r requirements.txt
pip install -e .
```

## Configuration

Each writer accepts credentials either as direct parameters or via a config file.

**Option A — direct parameters:**
```python
writer = PttWriter(
    host="192.168.x.x",
    port=9200,
    user="your_username",
    password="your_password",
    verify_certs=False,
)
```

**Option B — config file (recommended for development):**
```python
writer = PttWriter(config_path="/absolute/path/to/config.yaml")
```
```yaml
# config.yaml
opensearch:
  host: "your_opensearch_ip"
  port: 9200
  user: "your_username"
  password: "your_password"
  verify_certs: false
```

**Option C — environment variable (recommended for Celery workers / containers):**

Set `BAYBIN_SENTINEL_CONFIG` to the absolute path of your config file. Takes priority over `config_path`.
```bash
export BAYBIN_SENTINEL_CONFIG=/absolute/path/to/config.yaml
```

Config resolution order: direct params → `BAYBIN_SENTINEL_CONFIG` env var → `config_path` argument → default `"config.yaml"` (relative to CWD).

## Index naming convention
Each writer targets a dedicated OpenSearch index following the pattern `raw_{platform}_{content_type}s`:

| Writer | Post index | Comment index |
|---|---|---|
| `FacebookWriter` | `raw_facebook_posts` | `raw_facebook_comments` |
| `ThreadsWriter` | `raw_threads_posts` | `raw_threads_comments` |
| `PttWriter` | `raw_ptt_posts` | — |
| `NewsWriter` | `raw_news_posts` | — |
| `GoogleTrendsWriter` | `raw_google_trends_posts` | — |

## Field normalization
Each writer accepts pre-normalized data and routes fields to root vs `metadata` before writing to OpenSearch.

**Canonical root-level fields (posts):**
`post_id`, `platform`, `client_id`, `source_name`, `url`, `content`, `author_name`, `language`, `timestamp`, `crawled_at`, `s3_path`

**Canonical root-level fields (comments):**
`comment_id`, `legacy_comment_id`, `post_id`, `post_url`, `platform`, `client_id`, `author_id`, `author_name`, `content`, `content_hash`, `timestamp`, `crawled_at`, `created_at`, `depth`, `s3_path`

Any field not in the canonical set is automatically moved into a nested `metadata` object.

## Validation
Every writer validates the document **before** writing to OpenSearch. A `ValueError` is raised immediately if any required field is missing or empty — no silent bad writes.

**Required post fields:** `post_id`, `platform`, `client_id`, `timestamp`, `crawled_at`

**Required comment fields:** `comment_id` (or `legacy_comment_id`), `post_id`, `platform`, `client_id`, `content`, `timestamp`, `crawled_at`

This means:
- You must call `normalize_post()` / `normalize_comment()` before passing data to the writer — passing a raw API response directly will raise.
- `client_id` must always be present — enforces multi-tenancy at the write layer.

## Platform field maps

**ThreadsWriter** — accepts raw output from the internal Threads scraper:

| Raw field | Canonical field |
|---|---|
| `text` | `content` (posts and comments) |
| `post_url` | `url` (posts) |
| `author` | `author_name` (posts) |
| `reply_author` | `author_name` (comments) |
| `reply_author_id` | `author_id` (comments) |

**FacebookWriter** — expects pre-normalized post data (output of `normalize_post()`). Comment field map:

| Raw field | Canonical field |
|---|---|
| `reply_author` | `author_name` |
| `reply_author_id` | `author_id` |

**PttWriter, NewsWriter, GoogleTrendsWriter** — expect pre-normalized data with canonical field names already set.

## Example (Facebook)
```python
from baybin_sentinel.platforms.facebook import FacebookWriter

writer = FacebookWriter(
    host="192.168.x.x",
    port=9200,
    user="your_username",
    password="your_password",
    verify_certs=False,
)

# Single post with its comments
writer.save(post, comments)

# Bulk posts only
writer.save_bulk_posts(posts)

# Bulk comments only
writer.save_bulk_comments(comments)
```

## Example (Threads)
```python
from baybin_sentinel.platforms.threads import ThreadsWriter

writer = ThreadsWriter(config_path="/path/to/config.yaml")

# Single post with its replies (extracted from post["replies_detail"])
writer.save(post)

# Single post with explicit comments
writer.save(post, comments)

# Bulk posts only
writer.save_bulk_posts(posts)

# Bulk comments for one post
writer.save_bulk_comments(replies, post_url="https://threads.net/...")
```

## Example (PTT)
```python
from baybin_sentinel.platforms.ptt import PttWriter

writer = PttWriter(config_path="/path/to/config.yaml")

writer.save_post(post)
writer.save_bulk_posts(posts)
```

## Example (News)
```python
from baybin_sentinel.platforms.news import NewsWriter

writer = NewsWriter(config_path="/path/to/config.yaml")

writer.save_post(post)
writer.save_bulk_posts(posts)
```

## Example (Google Trends)
```python
from baybin_sentinel.platforms.google_trends import GoogleTrendsWriter

writer = GoogleTrendsWriter(config_path="/path/to/config.yaml")

writer.save_post(post)
writer.save_bulk_posts(posts)
```

## Publishing to PyPI

If you are the maintainer, follow these steps to publish a new version:

1. **Update version** in `pyproject.toml` (e.g., `0.2.0`).
2. **Install build tools**:
   ```bash
   pip install build twine
   ```
3. **Build the package**:
   ```bash
   rmdir /s /q dist build 2>nul
   python -m build
   ```
4. **Upload to PyPI**:
   ```bash
   python -m twine upload dist/*
   ```
5. **Authentication**:
   - **Username**: `__token__`
   - **Password**: `pypi-your-api-token-here` (including the `pypi-` prefix)
