Metadata-Version: 2.4
Name: hep-data-llm
Version: 1.1.1
Project-URL: Documentation, https://github.com/Gordon Watts/hep-data-llm#readme
Project-URL: Issues, https://github.com/Gordon Watts/hep-data-llm/issues
Project-URL: Source, https://github.com/Gordon Watts/hep-data-llm
Author-email: Gordon Watts <gwatts@uw.edu>
License-Expression: MIT
License-File: LICENSE.txt
Classifier: Development Status :: 4 - Beta
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Programming Language :: Python :: Implementation :: PyPy
Requires-Python: >=3.8
Requires-Dist: diskcache
Requires-Dist: dotenv
Requires-Dist: fsspec[http]
Requires-Dist: openai
Requires-Dist: pydantic
Requires-Dist: pyyaml
Requires-Dist: tenacity
Requires-Dist: tqdm
Requires-Dist: typer
Provides-Extra: results
Requires-Dist: jupyterlab; extra == 'results'
Requires-Dist: matplotlib; extra == 'results'
Requires-Dist: pandas; extra == 'results'
Requires-Dist: papermill; extra == 'results'
Requires-Dist: seaborn; extra == 'results'
Provides-Extra: test
Requires-Dist: black; extra == 'test'
Requires-Dist: coverage; extra == 'test'
Requires-Dist: flake8; extra == 'test'
Requires-Dist: hypothesis; extra == 'test'
Requires-Dist: pytest; extra == 'test'
Requires-Dist: pytest-asyncio; extra == 'test'
Requires-Dist: pytest-cov; extra == 'test'
Requires-Dist: pytest-mock; extra == 'test'
Description-Content-Type: text/markdown

# hep-data-llm

[![PyPI - Version](https://img.shields.io/pypi/v/hep-data-llm.svg)](https://pypi.org/project/hep-data-llm)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/hep-data-llm.svg)](https://pypi.org/project/hep-data-llm)

-----

## Table of Contents

- [Installation](#installation)
- [License](#license)

## Introduction

This repo contains the code used to translate english queries for plots into the actual plots using LLM's and python packages and tools like ServiceX, Awkward, Vector, and hist. This is a proof-of-concept and not meant to be production level code.

Benchmark studies with the 8 adl-index as presented in conferences can be found in the [results](results/) directory for various workflows.

Use `--help` at the root level and on commands (e.g. `hep-data-llm plot --help`) to get a complete list of options.

## Installation

To run out of the box you'll need to do the following once:

__Prerequisites__:

1. You'll need to have `docker` installed on your machine
1. Build the `docker` image to run the workflow. Which docker image is used depends on what workflow you are using.
    - ServiceX/Awkward: `docker build -t hepdatallm-awkward:latest Docker`
    - ServiceX/RDF: `docker build -t hepdatallm-rdf:latest -f Docker/Dockerfile.RDF .`
that is used to run `servicex`, `awkward`, and friends:
1. If you are running a `servicex` workflow, [get an access token](https://servicex-frontend.readthedocs.io/en/stable/connect_servicex.html). Make sure the `servicex.yaml` file is either in your home directory or your current working directory.
1. You'll need token(s) to access the LLM. Here is what the `.env` looks like. Please create this either in your local directory or your home directory. Make sure only you can read it: this is access to a paid service!

```text
api_openai_com_API_KEY=<openai-key>
api_together_xyz_API_KEY=<together.ai key>
openrouter_ai_API_KEY=<openrouter-key>
```

### Running in a local python environment

```console
pip install hep-data-llm
hep-data-llm plot "Plot the ETmiss of all events in the rucio dataset mc23_13p6TeV:mc23_13p6TeV.801167.Py8EG_A14NNPDF23LO_jj_JZ2.deriv.DAOD_PHYSLITE.e8514_e8528_a911_s4114_r15224_r15225_p6697." output.md
```

The output will be in `output.md` - view in a markdown rendering problem (I use `vscode`). A `img` directory will be created and it will contain the plot (hopefully).

Use `hep-data-llm plot --help` to see all the options you can give it. It defaults to using `gpt-5`, the most successful model in tests.

### Default questions

A `questions.yaml` file is bundled with the package containing a list of common plotting questions. To run one of these questions by number, pass the index (starting from 1) instead of the full text:

```console
hep-data-llm plot 1 output.md
```

This will execute the first question from `questions.yaml`.

### Running with `uvx`

This is great if you want to just run once or twice.

```console
uvx hep-data-llm plot "Plot the ETmiss of all events in the rucio dataset mc23_13p6TeV:mc23_13p6TeV.801167.Py8EG_A14NNPDF23LO_jj_JZ2.deriv.DAOD_PHYSLITE.e8514_e8528_a911_s4114_r15224_r15225_p6697." output.md
```

This uses the `uvx` tool to install a temporary environment. If you want to keep this around to use, you can use `uv tool install hep-data-llm`. Do remember to update it every now and then!

## Usage

### new profile

Use the `new profile <filename>` to create a new profile. It copies the default profile, and you can then modify it and update it with new prompt or other items.

### Creating a new workflow

Otherwise known as creating a new prompt, this is about creating a new prompt file and hint files and what it takes in the context of this package.

1. General preparation - you'll need a docker container with the appropriate software installed. You'll also need a good set of test instructions.
1. Use the `hep-data-llm new profile my-prompt.yaml` command to create a "dummy" prompt file.
1. Edit the new profile `yaml` file:
    - If you are editing new hint files, then replace the list of hint files with a local (relative) reference to the hint files you want to use
    - Choose a fairly cheap model to run (since you'll probably be running it a lot). Change the `model:` entry (or you can use the `--model` option).
1. When you are ready to test, use `hep-data-llm plot --profile my-prompt --ignore-cache hints <question> output.md`. Replace `<question>` with your question or a question number from the default list of questions.
    - Note the `ignore-cache` - the code always caches the hints files, even if they are located on the local disk.
    - Use `--repeat N` to record multiple, independent runs for each model. Every trial bypasses the LLM cache automatically so you get fresh outputs for each repetition without reusing earlier responses.

Notes from adding a `servicex-RDF` workflow:

- The guardrail that looked for the `png` file to be written out had to be altered
- Hint files that described `servicex` assumed `awkward` output - it had to be split in two so that there was a short hint file that described how to generate a `servicex` request and a second one that described how to take the results and turn them into a `awkward` data. The same thing then had to occur for `rdf`.
- A new `docker` container had to be built, in this case based on the ROOT container image.

## License

`hep-data-llm` is distributed under the terms of the [MIT](https://spdx.org/licenses/MIT.html) license.
