Metadata-Version: 2.4
Name: roughsearch
Version: 0.1.1
Summary: Full-text search with zero thinking.
Author-email: sumeshi <sum3sh1@protonmail.com>
License: MIT
License-File: LICENSE
Requires-Python: >=3.11
Requires-Dist: anyio>=4.0
Requires-Dist: duckdb>=1.0
Requires-Dist: fastapi>=0.110
Requires-Dist: httpx>=0.27
Requires-Dist: pydantic>=2.0
Requires-Dist: sudachidict-full
Requires-Dist: sudachipy>=0.6
Requires-Dist: tqdm>=4.66
Requires-Dist: uvicorn>=0.29
Description-Content-Type: text/markdown

# Roughsearch

![roughsearch-logo](https://gist.githubusercontent.com/sumeshi/c2f430d352ae763273faadf9616a29e5/raw/1b6cc5ab252d12d762f1911717a46be846fc0756/roughsearch.svg)

**A full-text search engine that tries to require as little thinking as possible.**

Roughsearch is a lightweight full-text search engine built on DuckDB and BM25. It targets Japanese and English only. You can use it from the CLI or call it directly from Python.

So when you want to search a large pile of text files in a decent way, what do you do?
Start by writing a Dockerfile? Spin up an Elasticsearch container? Install a morphological analysis plugin? Fire a huge number of API requests at it? Wait forever for indexing? No. Your life will end first.

Roughsearch gives up on flexibility completely. Fine-grained settings, grand scoring systems, and everything else are gone. This software has exactly one purpose: **"search roughly."** Run `$ pip install roughsearch`, and the environment is ready. Point a command at the target directory, and the indexing is done. That is all.


## Architecture
Roughsearch indexes loaded documents with the following pipeline:

1. It reads the text and runs morphological analysis with [Sudachi](https://github.com/WorksApplications/Sudachi).
2. It extracts major terms from the result, mainly nouns, verbs, and adjectives.
3. It indexes the normalized form of each extracted term and a romaji version of that term.

At search time, **the original terms score higher** and **the transliterated alphabet forms score lower**, producing weighted best-match results. All of this data is stored in a single `.duckdb` file, which makes the index highly portable.


## Installation

```bash
$ pip install roughsearch
```

This software requires Python 3.11 or later.


## Usage

### Embedded Use
When you want to add a simple full-text search engine to your own system.

```python
import roughsearch

with roughsearch.Client("docs.duckdb", language="ja") as rs:
    rs.add("doc-001", title="いろはにほへと", body="あのイーハトーヴォのすきとおった風")
    rs.add("doc-002", title="ちりぬるを", body="夏でも底に冷たさをもつ青いそら")
    rs.reindex()

    results = rs.search("風")
    for hit in results.hits:
        print(hit.score, hit.title, hit.snippet)
```

### CLI Server
When you just want to get it running.  
The server exposes a REST API that any frontend can use for search.

```bash
$ roughsearch init docs.duckdb --language ja
$ roughsearch add docs.duckdb ./docs
$ roughsearch serve docs.duckdb --port 8080
```

`add`, `serve`, `search`, and `dump` normally use the default language saved by `init`. If needed, you can temporarily override it with `--language`.

### HTTP Client
When you want to connect to a running Roughsearch server and query it.

```python
import roughsearch

rs = roughsearch.HttpClient("http://localhost:8080")
results = rs.search("空")
```


## CLI Reference

### Commands

| Command | Description |
| --- | --- |
| `init <db_path>` | Initialize and create a new database |
| `add <db_path> <path>` | Add documents from a directory and rebuild the index |
| `serve <db_path>` | Start the REST API server |
| `search <db_path> <query>` | Search from the command line |
| `reindex <db_path>` | Rebuild the FTS index, for example after adding documents |
| `reanalyze <db_path>` | Reanalyze stored documents with the current analyzer and rebuild the index, for example after a software update |
| `dump <db_path>` | Print stored documents as JSON to stdout |
| `stats <db_path>` | Show the document count |
| `inspect [text]` | Analyzer debugging command that prints tokenization results as JSON |

### Options

#### init

| Option | Default | Description |
| --- | --- | --- |
| `--language` | `ja` | Database analyzer language (`en` or `ja`) |

#### add

| Option | Default | Description |
| --- | --- | --- |
| `--glob` | `None` | Glob pattern for target files such as `*.md` |
| `--language` | `None` | Temporarily override the language for added documents |

#### serve

| Option | Default | Description |
| --- | --- | --- |
| `--language` | `None` | Temporarily override the default language used by the server |
| `--host` | `127.0.0.1` | Bind address |
| `--port` | `8080` | Port number |

#### search

| Option | Default | Description |
| --- | --- | --- |
| `--language` | `None` | Language filter for the search |
| `--limit` | `20` | Maximum number of results |

#### dump

| Option | Default | Description |
| --- | --- | --- |
| `--language` | `None` | Filter by language |
| `--limit` | `20` | Maximum number of output rows |

#### inspect

| Option | Default | Description |
| --- | --- | --- |
| `--language` | `ja` | Analyzer language |
| `--title` | `""` | Text to analyze on the title side |
| `--file` | `None` | Read the body from a file. If set, it takes precedence over the positional `text` argument |


## Examples

### Index and Search a Local Document Directory

```bash
$ pip install roughsearch

$ roughsearch init notes.duckdb --language ja
$ roughsearch add notes.duckdb ./notes --glob "*.md"
$ roughsearch search notes.duckdb "ニンジャ"
```

### Embedded Python Use with Metadata and Filters

```python
import roughsearch

with roughsearch.Client("notes.duckdb", language="ja") as rs:
    rs.add(
        "note-001",
        title="いろはにほへと",
        body="あのイーハトーヴォのすきとおった風",
        metadata={"tags": ["note", "japanese"], "source": "handbook"},
        source_uri="handbook/note-001.md",
    )
    rs.reindex()

    from roughsearch.search.query import SearchQuery, SearchFilters
    results = rs.search(
        SearchQuery(
            query="風",
            filters=SearchFilters(tags=["note"]),
            highlight=True,
            limit=10,
        )
    )
```

### Start the API Server and Search with curl

```bash
$ roughsearch serve docs.duckdb --port 8080 &

$ curl -s -X POST http://localhost:8080/documents \
  -H "Content-Type: application/json" \
  -d '{"id":"1","title":"いろはにほへと","body":"あのイーハトーヴォのすきとおった風"}'

$ curl -s -X POST http://localhost:8080/reindex

$ curl -s -X POST http://localhost:8080/search \
  -H "Content-Type: application/json" \
  -d '{"query":"風","limit":5}' | python -m json.tool
```

### Bulk Add

```python
import roughsearch

docs = [
    {"id": "1", "title": "いろはにほへと", "body": "あのイーハトーヴォのすきとおった風"},
    {"id": "2", "title": "ちりぬるを",  "body": "夏でも底に冷たさをもつ青いそら"},
]

with roughsearch.Client("bulk.duckdb") as rs:
    rs.add_documents(docs)
    rs.reindex()
    print(rs.search("風").total)
```

---

## Output Format

```json
{
  "query": "風",
  "total": 1,
  "hits": [
    {
      "id": "doc-001",
      "score": 8.512,
      "title": "いろはにほへと",
      "snippet": "あのイーハトーヴォのすきとおった<mark>風</mark>",
      "body": "あのイーハトーヴォのすきとおった風",
      "language": "ja",
      "source_uri": null,
      "heading_path": null,
      "parent_id": null,
      "chunk_id": null,
      "metadata": {}
    }
  ]
}
```

## REST API Endpoints

| Method | Path | Description |
| --- | --- | --- |
| `GET` | `/health` | Health check |
| `GET` | `/stats` | Document counts by language |
| `POST` | `/documents` | Add one document |
| `POST` | `/documents/bulk` | Add multiple documents |
| `GET` | `/documents/{id}` | Fetch a document by ID |
| `DELETE` | `/documents/{id}` | Soft-delete a document |
| `POST` | `/search` | Full-text search |
| `POST` | `/reindex` | Rebuild the FTS index |
| `POST` | `/optimize` | Run a DB checkpoint and compaction |


## Notes

* **Reindexing is required after writes.** Documents added with `add()` are stored immediately, but they will not appear in search results until you call `reindex()`. This keeps bulk imports fast.
* **Assume a single writer.** DuckDB does not support concurrent writes. Run one server process and only one write operation at a time.
* **It listens on localhost by default.** If you need external access, place it behind a reverse proxy such as nginx.


## License
MIT. See [LICENSE](LICENSE) for details.

powered by [Sudachi](https://github.com/WorksApplications/Sudachi): Apache License v2.0
