Metadata-Version: 2.3
Name: clickzetta-semantic-model-generator
Version: 1.0.21
Summary: Curate a Semantic Model for ClickZetta Lakehouse
License: Apache Software License; BSD License
Author: qililiang
Author-email: qililiang@clickzetta.com
Requires-Python: >=3.9, !=2.7.*, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*, !=3.4.*, !=3.5.*, !=3.6.*, !=3.7.*, !=3.8.*, !=3.12.*, !=3.13.*
Classifier: License :: Other/Proprietary License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Provides-Extra: looker
Requires-Dist: PyYAML (>=6.0.1,<7.0.0)
Requires-Dist: clickzetta-connector-python (>=0.8.92)
Requires-Dist: clickzetta-zettapark-python (>=0.1.3)
Requires-Dist: dashscope (>=1.22.2,<2.0.0)
Requires-Dist: loguru (>=0.7.2,<0.8.0)
Requires-Dist: looker-sdk (>=24.14.0,<25.0.0) ; extra == "looker"
Requires-Dist: numpy (>=1.26.4,<3.0.0)
Requires-Dist: pandas (>=2.0.1,<3.0.0)
Requires-Dist: protobuf (==5.26.1)
Requires-Dist: pyarrow (==14.0.2)
Requires-Dist: pydantic (>=2.8.2,<3.0.0)
Requires-Dist: python-dotenv (>=1.0.1,<2.0.0)
Requires-Dist: requests (>=2.32.3,<3.0.0)
Requires-Dist: ruamel.yaml (==0.17.21)
Requires-Dist: sqlglot (==25.10.0)
Requires-Dist: streamlit (==1.36.0)
Requires-Dist: streamlit-extras (==0.4.0)
Requires-Dist: strictyaml (>=1.7.3,<2.0.0)
Requires-Dist: tqdm (>=4.66.5,<5.0.0)
Requires-Dist: urllib3 (>=1.26.19,<3.0.0)
Description-Content-Type: text/markdown

# semantic-model-generator

The ClickZetta Semantic Model Generator is a Streamlit companion built for ClickZetta teams. Use it to explore Lakehouse metadata, author and refine semantic YAML, and plug into partner workflows—everything runs against ClickZetta’s Lakehouse APIs and backed volumes by default.

## Requirements

- Python 3.11
- Access to a ClickZetta workspace (service URL, instance, workspace, schema, vcluster, username, password)
- A `connections.json` file in one of the standard ClickZetta locations (`~/.clickzetta/connections.json`, `config/connections.json`, `config/lakehouse_connection/connections.json`, or `/app/.clickzetta/lakehouse_connection/connections.json`). The structure matches the template from [`mcp-clickzetta-server`](https://github.com/yunqiqiliang/mcp-clickzetta-server/blob/main/config/connections-template.json). Set "is_default": true for the connection the app should use.

```json
{
  "system_config": {
    "embedding": {
      "provider": "dashscope",
      "dashscope": {
        "api_key": "dashscope_api_key",
        "model": "qwen-plus-latest"
      }
    }
  },
  "connections": [
    {
      "connection_name": "dev",
      "is_default": true,
      "service": "cn-shanghai-alicloud.api.clickzetta.com",
      "instance": "your_instance",
      "workspace": "quick_start",
      "schema": "PUBLIC",
      "username": "user",
      "password": "password",
      "vcluster": "default_ap"
    }
  ]
}
```

Environment variables such as `CLICKZETTA_SERVICE`, `CLICKZETTA_USERNAME`, etc. override the JSON values when present.

## App Overview

The Streamlit homepage highlights how to use this toolkit alongside the ClickZetta platform:

- **Local companion for semantic modeling.** Iterate quickly on YAML, inspect metadata, and validate changes before promoting them back to your Lakehouse.
- **Keep production work in ClickZetta.** Build and manage canonical models in the ClickZetta console, then switch to this app when you need richer editing, partner integrations, or AI enrichment—both share the same volumes for frictionless workflows.
- **Why semantics matter.** A curated semantic layer standardizes measures, joins, and business logic so LLMs understand context, avoid hallucinations, and deliver consistent analytics for data teams and business users.
- **Typical workflows covered:** author and refine models from table metadata, safely edit existing YAML with ClickZetta validation, generate/test SQL through the chat assistant, and auto-enrich documentation via DashScope.
- **Use it as a sandbox.** Pull models from a volume, experiment with the editor and chat assistant, then push the refined YAML back once it passes validation.

![Semantic model generator architecture](images/semantic-model-overview.svg)

## Installation from Docker

If you prefer not to install Python dependencies locally, pull the published Docker image:

```bash
docker pull czqiliang/semantic-model-generator:latest
docker run --rm -p 8501:8501 \
  -v $(pwd)/connections.json:/app/.clickzetta/connections.json \
  czqiliang/semantic-model-generator:latest
```

Mount your `connections.json` (see example below) so the container can pick up ClickZetta and DashScope credentials.

The Streamlit UI will be available at http://localhost:8501.

Docker Compose example (`docker-compose.yml`):

```yaml
version: "3.9"
services:
  app:
    image: czqiliang/semantic-model-generator:latest
    ports:
      - "8501:8501"
    volumes:
      - ~/.clickzetta:/app/.clickzetta         # macOS 默认配置挂载
```

Run it with `docker compose up`.

Linux 主机通常会把 ClickZetta 配置放在 `/opt/clickzetta` 下，Compose 配置可以改成：

```yaml
    volumes:
      - /opt/clickzetta:/app/.clickzetta
```

## or Installation from source code

```bash
# optional: conda env using environment.yml
conda env create -f environment.yml
conda activate clickzetta_env

# or install via poetry/pip
poetry install
# pip install .
```

The app depends on `clickzetta-connector-python` and `clickzetta-zettapark-python`; ensure they are installed via the commands above.

## Running the Streamlit app

```bash
# inside the Poetry environment
poetry run streamlit run app.py

# or, after activating the env, run:
python -m streamlit run app.py
```

When the app launches it will:

1. Load credentials from the ClickZetta connection config or environment.
2. Default file operations to `volume:user://~/semantic_models/` inside your user volume.
3. Provide workflows for generating semantic YAML, editing YAML, validating (basic checks), and importing partner specs (dbt, etc.).

## DashScope 使用提示

- 语义补全调用 DashScope 官方 SDK 默认端点，无需也无法通过 `base_url` 重写。
- 即便在 `connections.json` 或环境变量里设置 `DASHSCOPE_BASE_URL`/兼容端点，应用也不会使用这些值。
- 仍需提供 `DASHSCOPE_API_KEY` 与模型名称（如 `qwen-plus`）；其他参数保持默认即可避免常见 `InvalidParameter: url error`。
- 仅当你明确需要 OpenAI 兼容模式时才应使用兼容端点；当前 Streamlit 应用未对兼容端点提供支持。

## Key behaviours

- **Volume-first uploads**: YAML import/export uses the user volume path `volume:user://~/semantic_models/` unless a different volume/stage is selected.
- **Metadata discovery**: Workspace metadata (catalogs, schemas, tables) is fetched via ClickZetta INFORMATION_SCHEMA queries. Sample values and comments are collected using ClickZetta sessions.
- **Partner integrations**: dbt helpers read YAML from the chosen volume/stage, merge metadata, and reuse ClickZetta credentials.
- **Chat/validation placeholders**: Cortex-specific validation and chat calls are not yet available in ClickZetta mode; the UI will display placeholders instead of calling external services.

## Development scripts

Useful commands while iterating:

```bash
make setup        # install dependencies
make run_admin_app
make fmt_lint     # format + lint
make test         # execute pytest suite
make docker-buildx       # build multi-arch Docker image (linux/amd64, linux/arm64)
make docker-buildx-push  # build and push multi-arch image
```

## License

Apache 2.0 / BSD (dual license) – see LICENSE and LEGAL files for details.

