Metadata-Version: 2.3
Name: target_benchmark
Version: 0.1.2
Summary: Table Retrieval for Generative Tasks Benchmark
License: Apache-2.0
Author: Xingyu Ji
Author-email: jixy2012@berkeley.edu
Requires-Python: >=3.9,<4.0
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Provides-Extra: llamaindex-retriever
Provides-Extra: ottqa-retriever
Requires-Dist: datasets (>=2.19.0,<3.0.0)
Requires-Dist: evaluate (>=0.4.2,<0.5.0)
Requires-Dist: func-timeout (>=4.3.5,<5.0.0)
Requires-Dist: hnswlib (>=0.8.0,<0.9.0)
Requires-Dist: langchain (>=0.1.16,<0.2.0)
Requires-Dist: langchain-community (>=0.0.34,<0.0.35)
Requires-Dist: langchain-core (>=0.1.45,<0.2.0)
Requires-Dist: langchain-openai (>=0.0.8,<0.0.9)
Requires-Dist: langchain-text-splitters (>=0.0.1,<0.0.2)
Requires-Dist: llama-index (>=0.10.58,<0.11.0) ; extra == "llamaindex-retriever"
Requires-Dist: nltk (>=3.8.1,<4.0.0) ; extra == "ottqa-retriever"
Requires-Dist: numpy (>=1.26.4,<2.0.0)
Requires-Dist: pandas (>=2.2.2,<3.0.0)
Requires-Dist: pexpect (>=4.9.0,<5.0.0) ; extra == "ottqa-retriever"
Requires-Dist: pydantic (>=2.7.4,<3.0.0)
Requires-Dist: python-dateutil (>=2.9.0,<3.0.0)
Requires-Dist: python-dotenv (>=1.0.1,<2.0.0)
Requires-Dist: qdrant-client (>=1.9.1,<2.0.0)
Requires-Dist: regex (>=2023.10.3,<2024.0.0)
Requires-Dist: rouge-score (>=0.1.2,<0.2.0)
Requires-Dist: sacrebleu (>=2.4.2,<3.0.0)
Requires-Dist: scikit-learn (>=1.3.0,<2.0.0)
Requires-Dist: scipy (>=1.13.0,<2.0.0)
Requires-Dist: spacy (>=3.7.4,<4.0.0) ; extra == "ottqa-retriever"
Requires-Dist: tqdm (>=4.65.0,<5.0.0)
Requires-Dist: transformers (>=4.41.2,<5.0.0)
Project-URL: Homepage, https://target-benchmark.github.io/
Project-URL: Repository, https://github.com/target-benchmark/target
Description-Content-Type: text/markdown

# TARGET: Table Retrieval for Generative Tasks Benchmark

## Set Up TARGET

**Install via pip**

```python
pip install target_benchmark
```

**Install from source**

```shell
git clone https://github.com/target-benchmark/target.git

cd target

pip install -e .
```

If you want to use the default generators for generating downstream task answers, you need to add your OpenAI API key as one of the environment variables:

```shell
export OPENAI_API_KEY=<your openai api key>
```

## Features
- run evaluations on TARGET's baseline retrievers
- implement your own custom retrievers and generators
- create your own custom task

## Usage Example: Evaluate Baseline Retriever

Let's see how we can run evaluation on a baseline retriever. We'll use LlamaIndex as an example:

```python
from target_benchmark.evaluators import TARGET, get_task_names
# you can run `get_task_names()` to get all available tasks
from target_benchmark.retrievers import LlamaIndexRetriever

# specify a task and a dataset to run evaluations on.
target_fetaqa = TARGET(("Table Retrieval Task", "fetaqa"))
# create a new retriever object
llamaindex_retriever = LlamaIndexRetriever()
# run the evaluation!
performance = target_fetaqa.run(retriever=llamaindex_retriever, split="test", top_k=10)

# if you'd like, you can also persist the retrieval and downstream generation results
performance = target_fetaqa.run(retriever=llamaindex_retriever, split="test", top_k=10, retrieval_results_file="./retrieval.jsonl", downstream_results_file="./downstream.jsonl")
```

## Create Retrievers

TARGET offers a simple interface for creating custom retrievers. You can either inherit from the `AbsCustomEmbeddingRetriever` class or the `AbsStandardEmbeddingRetriever` class.

### Inheriting from `AbsCustomEmbeddingRetriever` Class

Inherit from this class if your retriever uses a **custom format for embedding tables** (e.g., specific directory structures or file types). The TARGET evaluator assumes that your retriever will manage the persistence of embeddings during evaluation.

**When to Use This Class**

- **Custom Embedding Formats**: Your retriever requires specific storage formats for embeddings.
- **Self-Managed Persistence**: You handle the storage and retrieval of embeddings yourself.

**Implementing the Required Methods**

To use this class, implement the following two methods:

1. **`embed_corpus`**
	- **Parameters**:
     - `dataset_name`: Identifier for the dataset.
     - `corpus`: The dataset to embed, provided as an iterable of dictionaries.

2. **`retrieve`**
   - **Parameters**:
     - `query`: The user's query string.
     - `dataset_name`: Identifier for the dataset.
     - `top_k`: Number of top results to return.
   - **Returns**: A list of tuples, where each tuple contains `(database_id, table_id)` of a retrieved table.

```python
from target_benchmark.retrievers import AbsCustomEmbeddingRetriever
class YourRetriever(AbsCustomEmbeddingRetriever):
    # You can set a `expected_corpus_format`
    # (ie nested array, dictionary, dataframe, etc.)
    # in your `__init__` function.
    # The corpus tables will be converted to this format
    # before passed into the `embed_corpus` function.
    def __init__(self, **kwargs):
        super().__init__(expected_corpus_format="nested array")

    # returns a list of tuples, each being (database_id, table_id) of the retrieved table
    def retrieve(self, query: str, dataset_name: str, top_k: int) -> List[Tuple]:
        pass

    # returns nothing since the embedding persistence is dealt with within this function.
    def embed_corpus(self, dataset_name: str, corpus: Iterable[Dict]) -> None:
        pass
```


### Inherit from `AbsStandardEmbeddingRetriever` Class
Inherit from this class if your retriever returns a vector embedding for each table and query. It automatically handles vector data storage using an **in-memory Qdrant vector database**, so data is **not persisted across calls to `TARGET.run`**. (support for persistence across evaluation runs will be included in the future)

**Why Inherit from This Class?**

Consider inheriting from this class instead of `AbsCustomEmbeddingRetriever` if:

- **Simple Embedding Output**: Your retriever outputs embeddings as vectors (lists of floats).
- **No Special Storage Needs**: Your retrieval system doesn't require specific persistence formats or folder structures.

**How to Use This Class**

To inherit from this class, you need to implement two methods:

1. **`embed_query`**: Returns an embedding vector for a given query.
   - **Parameters**:
     - `query`: The user's query string.
     - `dataset_name`: Identifier for the dataset.
   - **Returns**: embedding of query in a numpy array

2. **`embed_corpus`**: Returns embedding vectors for each item in the corpus (e.g., tables or documents).
	- **Parameters**:
     - `dataset_name`: Identifier for the dataset.
     - `corpus_entry`: An entry in the corpus dataset.
    - **Returns**: embedding of corpus entry in a numpy array

```python
from target_benchmark.retrievers import AbsStandardEmbeddingRetriever
class YourRetriever(AbsStandardEmbeddingRetriever):
    def __init__(self, **kwargs):
        super().__init__(expected_corpus_format="nested array")

    #return the embeddings for the query as a numpy array
    def embed_query(self, query: str, dataset_name: str,) -> np.ndarray:
        pass

    # returns embedding of the passed in table as a numpy array
    def embed_corpus(self, dataset_name: str, corpus_entry: Dict) -> np.ndarray:
        pass
```

### Note on `corpus` and `corpus_entry` Formatting

TARGET provides standardized formatting for the corpus datasets. More specifically, each TARGET corpus dataset includes the following columns:
- **database_id (str)**: database that the table belongs to.
- **table_id (str)**: table's identifier.
- **table**: the actual table contents. default format is nested array, but you can specify the expected format to be `dictionary` or `dataframe` in your retriever's constructor. Tables are automatically converted to the expected format before passed into the `embed_corpus` function.
- **context (dict)**: any metadata associated with the table. for example, text-2-sql datasets' context often include primary and foreign key information.


Both retriever classes' `embed_corpus` function takes in corpus information.
- `AbsStandardEmbeddingRetriever`: `corpus_entry` is a single entry within the corpus dataset. for example, it may look like this:
```python
{
    "database_id": "0",
    "table_id": "totto_source/train_json/example-10461.json",
    "table": <table contents in the retriever's expected format>,
    "context": {"table_page_title": "1982 Illinois gubernatorial election",
  "table_section_title": "Results"},
}
```
- `AbsCustomEmbeddingRetriever`: `corpus` is an iterable of dictionaries. Each dictionary contains a batch of corpus entries. For example:
```python
{
    "database_id": ["0", "1"],
    "table_id": ["Serbia_at_the_European_Athletics_Championships_2", "List_of_University_of_Texas_at_Austin_alumni_20"],
    "table": [<table content>, <table content>],
    "context": [{"section_title": "Indoor -- List of Medalists"}, {"section_title": "Literature , writing , and translation"}],
}
```
The length of the lists will correspond to the batch size specified when calling `TARGET.run`.

## Create Custom Generators

Creating your customer generators for downstream tasks is also straightforward. You only need to implement one function,
- **`generate`**
	- **Parameters**:
        - `table_str`: String of the retrieved table contents.
        - `query`: The natural language query.

    - **Returns**::
        - a dictionary for flexibility. the contained keys closely correlates to the configuration of the task that invokes the generator, as different tasks require different kinds of information.
        - for text-2-sql tasks, the dictionary is expected to contain keys
            - `sql_query`: for the generated sql
            - `database_id`: id of the database to query. Why needed? if tables from multiple databases are passed into the generator's context, the generator will need to pick from one of the databases when creating the query.
        - for other tasks, the dictionary is expected to contain key
            - `content`: for the generated response.



```python
from target_benchmark.generators import AbsGenerator
class YourCustomGenerator(AbsGenerator):
    # returns the answer to the query
    def generate(self, table_str: str, query: str) -> Dict:
        pass
```

To use your generators, first create a task object, and pass the generator into the task object:

```python
from target_benchmark.evaluators import TARGET
from target_benchmark.tasks import QuestionAnsweringTask
qa_task = QuestionAnsweringTask(task_generator=YourGenerator())
target_evaluator = TARGET(downstream_tasks=qa_task)
```
Note that here instead of specifying the task by its name, we are passing in a task object instead with the generator set to our created custom generator.

## Using the Evaluation Scripts
You can run the evaluation scripts in the `experiments` folder to run baselines retrievers included in the TARGET paper on all tasks.

### Command-Line Arguments

- **`--retriever_name`**  
  **Type:** `str`  
  **Description:** Name of the retriever. Available options include:
  - `llamaindex`
  - `hnsw_openai`
  - `tfidf_no_title`
  - `tfidf_with_title`
  - `bm25_no_title`
  - `bm25_with_title`
  - `stella`
  - `e5`
  - `row_serial`

- **`--num_rows`**  
  **Type:** `int`  
  **Default:** `100`  
  **Description:** Num rows to include for Dense Table Embedding models

- **`--persist`**  
  **Action:** `store_true`  
  **Default:** `False`  
  **Description:** Whether to persist the data. Defaults to False.

- **`--retrieval_results_dir`**  
  **Type:** `str`  
  **Default:** `None`  
  **Description:** Folder to persist retrieval results.

- **`--downstream_results_dir`**  
  **Type:** `str`  
  **Default:** `None`  
  **Description:** Folder to persist downstream results.

- **`--top_ks`**  
  **Type:** `str` (later converted to a list of integers)  
  **Default:** `"1 5 10 25 50"`  
  **Description:** Space separated list of top ks. for example '1 3 5 10'.

- **`--num_nih_tables`**  
  **Type:** `int`  
  **Default:** `None`  
  **Description:** Number of tables to include from the NIH dataset.

---

### Usage Example

Below is an example command on how to run the script with arguments:

```bash
python experiments/retrieval_evals.py \
  --retriever_name stella \
  --num_rows 150 \
  --persist \
  --retrieval_results_dir "./results/retrieval" \
  --downstream_results_dir "./results/downstream" \
  --top_ks "1 5 10" \
  --num_nih_tables 50
```

