abacusai.document_retriever

Module Contents

Classes

DocumentRetriever

A vector store that stores embeddings for a list of document trunks.

class abacusai.document_retriever.DocumentRetriever(client, name=None, documentRetrieverId=None, createdAt=None, featureGroupId=None, featureGroupName=None, latestDocumentRetrieverVersion={}, documentRetrieverConfig={})

Bases: abacusai.return_class.AbstractApiClass

A vector store that stores embeddings for a list of document trunks.

Parameters:
  • client (ApiClient) – An authenticated API Client instance

  • name (str) – The name of the document retriever.

  • documentRetrieverId (str) – The unique identifier of the vector store.

  • createdAt (str) – When the vector store was created.

  • featureGroupId (str) – The feature group id associated with the document retriever.

  • featureGroupName (str) – The feature group name associated with the document retriever.

  • latestDocumentRetrieverVersion (DocumentRetrieverVersion) – The latest version of vector store.

  • documentRetrieverConfig (DocumentRetrieverConfig) – The config for vector store creation.

__repr__()

Return repr(self).

to_dict()

Get a dict representation of the parameters in this class

Returns:

The dict value representation of the class parameters

Return type:

dict

update(name=None, feature_group_id=None, document_retriever_config=None)

Updates an existing document retriever.

Parameters:
  • name (str) – The name group to update the document retriever with.

  • feature_group_id (str) – The ID of the feature group to update the document retriever with.

  • document_retriever_config (DocumentRetrieverConfig) – The configuration, including chunk_size and chunk_overlap_fraction, for document retrieval.

Returns:

The updated document retriever.

Return type:

DocumentRetriever

create_version()

Creates a document retriever version from the latest version of the feature group that the document retriever associated with.

Parameters:

document_retriever_id (str) – The unique ID associated with the document retriever to create version with.

Returns:

The newly created document retriever version.

Return type:

DocumentRetrieverVersion

refresh()

Calls describe and refreshes the current object’s fields

Returns:

The current object

Return type:

DocumentRetriever

describe()

Describe a Document Retriever.

Parameters:

document_retriever_id (str) – A unique string identifier associated with the document retriever.

Returns:

The document retriever object.

Return type:

DocumentRetriever

list_versions(limit=100, start_after_version=None)

List all the document retriever versions with a given ID.

Parameters:
  • limit (int) – The number of vector store versions to retrieve.

  • start_after_version (str) – An offset parameter to exclude all document retriever versions up to this specified one.

Returns:

All the document retriever versions associated with the document retriever.

Return type:

list[DocumentRetrieverVersion]

get_document_snippet(document_id, start_word_index=None, end_word_index=None)

Get a snippet from documents in the document retriever.

Parameters:
  • document_id (str) – The ID of the document to retrieve the snippet from.

  • start_word_index (int) – If provided, will start the snippet at the index (of words in the document) specified.

  • end_word_index (int) – If provided, will end the snippet at the index of (of words in the document) specified.

Returns:

The documentation snippet found from the document retriever.

Return type:

DocumentRetrieverLookupResult

restart()

Restart the document retriever if it is stopped.

Parameters:

document_retriever_id (str) – A unique string identifier associated with the document retriever.

wait_until_ready(timeout=3600)

A waiting call until document retriever is ready.

Parameters:

timeout (int, optional) – The waiting time given to the call to finish, if it doesn’t finish by the allocated time, the call is said to be timed out. Default value given is 3600 seconds.

get_status()

Gets the status of the document retriever.

Returns:

A string describing the status of a document retriever (pending, complete, etc.).

Return type:

str

get_matching_documents(query, filters=None, limit=None, result_columns=None, max_words=None, num_retrieval_margin_words=None, max_words_per_chunk=None)

Lookup document retrievers and return the matching documents from the document retriever deployed with given query.

Original documents are splitted into chunks and stored in the document retriever. This lookup function will return the relevant chunks from the document retriever. The returned chunks could be expanded to include more words from the original documents and merged if they are overlapping, and permitted by the settings provided. The returned chunks are sorted by relevance.

Parameters:
  • query (str) – The query to search for.

  • filters (dict) – A dictionary mapping column names to a list of values to restrict the retrieved search results.

  • limit (int) – If provided, will limit the number of results to the value specified.

  • result_columns (list) – If provided, will limit the column properties present in each result to those specified in this list.

  • max_words (int) – If provided, will limit the total number of words in the results to the value specified.

  • num_retrieval_margin_words (int) – If provided, will add this number of words from left and right of the returned chunks.

  • max_words_per_chunk (int) – If provided, will limit the number of words in each chunk to the value specified. If the value provided is smaller than the actual size of chunk on disk, which is determined during document retriever creation, the actual size of chunk will be used. I.e, chunks looked up from document retrievers will not be split into smaller chunks during lookup due to this setting.

Returns:

The relevant documentation results found from the document retriever.

Return type:

list[DocumentRetrieverLookupResult]