Metadata-Version: 2.4
Name: pyterrier-splade
Version: 0.1.0
Summary: PyTerrier wrapper for SPLADE learned sparse indexing and retrieval
Author: Craig Macdonald, Nicola Tonellotto
Author-email: Sean MacAvaney <sean.macavaney@glasgow.ac.uk>
Maintainer: Craig Macdonald
Maintainer-email: Sean MacAvaney <sean.macavaney@glasgow.ac.uk>
License: Creative Commons Attribution-NonCommercial-ShareAlike
Project-URL: Repository, https://github.com/cmacdonald/pyt_splade
Project-URL: Bug Tracker, https://github.com/cmacdonald/pyt_splade/issues
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python
Classifier: Operating System :: OS Independent
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.7
Description-Content-Type: text/markdown
Requires-Dist: torch>=2.6.0
Requires-Dist: transformers
Requires-Dist: pyterrier>=1.0.0
Requires-Dist: pyterrier_alpha

# pyterrier_splade

An example of a SPLADE learned sparse indexing and retrieval using PyTerrier transformers. 

# Installation

```python
%pip install -q git+https://github.com/cmacdonald/pyt_splade.git
```

# Indexing

Indexing takes place as a pipeline: we apply SPLADE transformation of the documents, which maps raw text into a dictionary of BERT WordPiece tokens and corresponding weights. The underlying indexer, Terrier, is configured to handle arbitrary word counts without further tokenisation (`pretokenised=True`).

The Terrier indexer is configured to index tokens unchanged. 

```python

import pyterrier as pt

import pyterrier_splade
splade = pyterrier_splade.Splade()
indexer = pt.IterDictIndexer('./msmarco_psg', pretokenised=True)

indxr_pipe = splade.doc_encoder() >> indexer
index_ref = indxr_pipe.index(dataset.get_corpus_iter(), batch_size=128)

```

# Retrieval

Similarly, SPLADE encodes the query into BERT WordPieces and corresponding weights.
We apply this as a query encoding transformer.

```python

splade_retr = splade.query_encoder() >> pt.terrier.Retriever('./msmarco_psg', wmodel='Tf')

```

# Scoring

SPLADE can also be used as a text scoring function.

```python

first_stage = ... # e.g., BM25, dense retrieval, etc.
splade_scorer = first_stage >> dataset.text_loader() >> splade.scorer()

```

# PISA

For faster retrieval with SPLADE, you can use the fast PISA retrieval backend provided by [PyTerrier_PISA](https://github.com/terrierteam/pyterrier_pisa):

```python
import pyterrier_splade
splade = pyterrier_splade.Splade()
dataset = pt.get_dataset('irds:msmarco-passage')
index = PisaIndex('./msmarco-passage-splade', stemmer='none')

# indexing
idx_pipeline = splade.doc_encoder() >> index.toks_indexer()
idx_pipeline.index(dataset.get_corpus_iter())

# retrieval

retr_pipeline = splade.query_encoder() >> index.quantized()
```

# Demo

We have a demo of PyTerrier_SPLADE at https://huggingface.co/spaces/terrierteam/splade

# Note

Note that this package used to be named `pyt_splade`. The package is still available under that name
(but this may be removed in the future). The new name is `pyterrier_splade`.

# Credits 

 - Craig Macdonald
 - Sean MacAvaney
 - Nicola Tonellotto
