Metadata-Version: 2.3
Name: apophenia
Version: 0.1.4
Summary: Extract and structure all the data from a Git repository to make them usable in RAG.
Author-email: Hervé Beraud <herveberaud.pro@gmail.com>
License: MIT
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Software Development
Classifier: Topic :: Utilities
Requires-Python: >=3.9
Requires-Dist: faiss-cpu
Requires-Dist: gitpython
Requires-Dist: numpy
Requires-Dist: sentence-transformers
Provides-Extra: dev
Requires-Dist: black; extra == 'dev'
Requires-Dist: build; extra == 'dev'
Requires-Dist: commitizen; extra == 'dev'
Requires-Dist: isort; extra == 'dev'
Requires-Dist: pip-tools; extra == 'dev'
Requires-Dist: pre-commit; extra == 'dev'
Requires-Dist: twine; extra == 'dev'
Description-Content-Type: text/markdown

# Apophenia

![Build](https://github.com/4383/apophenia/actions/workflows/main.yml/badge.svg)
![PyPI](https://img.shields.io/pypi/v/apophenia.svg)
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/apophenia.svg)
![PyPI - Status](https://img.shields.io/pypi/status/apophenia.svg)
[![Downloads](https://pepy.tech/badge/apophenia)](https://pepy.tech/project/apophenia)
[![Downloads](https://pepy.tech/badge/apophenia/month)](https://pepy.tech/project/apophenia/month)

Apophenia give meaning to any existing Git repository.

Apophenia extract and structure all the data from a Git repository to make
them usable in RAG or in with AI agents.

Apophenia impose a meaningful interpretation on a nebulous stimulus (a Git
repo).

## Install

```bash
$ pip install apophenia
```

## Usage

Extract data from a given repository:

```bash
$ apophenia https://github.com/4383/niet \
  --faiss_path /tmp/results.faiss \
  --metadata_path /tmp/results.json
```

And use generated data in a RAG (python snippet example):

```python
import faiss
import json
import numpy as np
from sentence_transformers import SentenceTransformer
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the FAISS index and the JSON metadata previously generated
def load_index_and_metadata(faiss_path, metadata_path):
    index = faiss.read_index(faiss_path)
    with open(metadata_path, 'r', encoding='utf-8') as f:
        metadata = json.load(f)
    return index, metadata

# Embedding of the user request
def embed_query(query, model):
    return model.encode(query, convert_to_tensor=True).cpu().numpy()

# Seach in the FAISS index
def search_in_faiss(index, query_embedding, metadata, k=5):
    distances, indices = index.search(np.array([query_embedding]), k)
    results = []
    for i, idx in enumerate(indices[0]):
        result = metadata[idx]
        result['distance'] = distances[0][i]
        results.append(result)
    return results

# Build a prompt for a generative model
def build_prompt(query, retrieved_info):
    prompt = f"Answer the following question based on the retrieved information:\n\n"
    prompt += f"Question: {query}\n\n"
    prompt += "Retrieved Information:\n"
    for info in retrieved_info:
        content_type = info.get("type", "unknown")
        content_preview = info.get("content_preview", "No preview available")
        prompt += f"- {content_type.upper()}: {content_preview}\n"
    prompt += "\nYour Answer:"
    return prompt

# Generate a response with a generative model
def generate_response(prompt, model_name="EleutherAI/gpt-neo-125M", max_length=200):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name)
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids
    output = model.generate(input_ids, max_length=max_length)
    return tokenizer.decode(output[0], skip_special_tokens=True)


def run_rag_system(query, faiss_path, metadata_path, embedding_model_name, generative_model_name):
    # Load data (FAISS index and metadata, and embedding)
    index, metadata = load_index_and_metadata(faiss_path, metadata_path)
    embedding_model = SentenceTransformer(embedding_model_name)

    query_embedding = embed_query(query, embedding_model)

    # Search in FAISS
    retrieved_info = search_in_faiss(index, query_embedding, metadata)

    prompt = build_prompt(query, retrieved_info)

    response = generate_response(prompt, model_name=generative_model_name)

    return response, retrieved_info

if __name__ == "__main__":
    # Configuration
    FAISS_PATH = "results.faiss"
    METADATA_PATH = "results.json"
    EMBEDDING_MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"
    GENERATIVE_MODEL_NAME = "EleutherAI/gpt-neo-125M"

    query = "How does the authentication system work in this repository?"

    response, retrieved_info = run_rag_system(
        query=query,
        faiss_path=FAISS_PATH,
        metadata_path=METADATA_PATH,
        embedding_model_name=EMBEDDING_MODEL_NAME,
        generative_model_name=GENERATIVE_MODEL_NAME
    )

    print("Generated Response:")
    print(response)
    print("\nRetrieved Information:")
    for info in retrieved_info:
        print(info)
```

For more details:

```bash
$ apophenia -h
```

## Applications

Here’s a **list of potential applications** for using the apophenia. Its
generated results (FAISS vectors and JSON metadata) can be used within a **RAG
(Retrieval-Augmented Generation)** system:

- Generate enriched answers by combining documentation, commit messages,
  and code. Example questions include: *"How do I use the `authenticate_user`
  function?"* or *"What is the structure of this project?"*
- Quickly search for specific parts of the code or documentation. Identify
  relevant functions or files based on queries like *"Where is the
  authentication logic implemented?"* or *"Which module handles network
  connections?"*
- Retrieve historical information to understand bugs or errors. Analyze recent
  changes with queries like *"What are the latest modifications in this
  file?"* or *"Which commits mention this bug?"*
- Automatically generate a changelog based on commit messages and diffs for a
  new release.
- Identify outdated dependencies or technologies and plan migrations. For
  instance, answer queries like *"Which files are using Eventlet?"* or
  *"Which commits introduced asyncio?"*
- Search for changes related to vulnerabilities or critical dependencies.
  Example questions include *"Which files use OpenSSL?"* or *"Which commits
  fixed vulnerabilities?"*
- Generate technical guides or manuals from existing code and documentation
  fragments. For example, create an installation guide from README files and
  configuration scripts.
- Understand individual contributions or file evolution by asking questions
  like *"Who wrote this function?"* or *"What are John Doe's contributions?"*
- Search for specific concepts within the project, such as *"Where is the
  caching logic handled?"* or *"Which files mention secure connections?"*
- Simplify onboarding for new developers by providing guided answers like
  *"The main features of this project are documented in `README.md`."* or
  *"`auth.py` handles authentication logic."*
- Identify which files or functions are impacted by a specific commit with
  questions like *"Which files were modified by this commit?"* or *"Which
  tests are affected by this change?"*
- Extract code examples from existing fragments in files or commits. For
  instance, generate a snippet to illustrate how to use a specific function or
  module.
- Quickly find useful information to solve a technical issue, such as
  *"Which file is responsible for this exception?"* or *"Which commit
  introduced this error?"*
- Identify the libraries used and their versions. Example questions include
  *"Which version of Django is being used?"* or *"Which commits mention
  outdated dependencies?"*
- Search for changes related to performance optimization with questions like
  *"Which commits optimized this file?"* or *"Which functions were refactored
  for better performance?"*
- Identify team members who are most active in certain areas of the project by
  asking *"Who contributes the most to the networking module?"* or *"What are
  the primary files in this project?"*
- Create customized reports on the state or evolution of a project. For
  example, generate a report on the 10 most significant recent commits or list
  the main modules and the most modified files.
- Integrate extracted data into CI/CD pipelines. For instance, identify
  critical files for a specific build task.
- Compare versions of files or branches using diffs and commits.
- Identify areas of the code that need documentation or refactoring by asking
  *"Which files lack associated documentation?"* or *"Which commits mention
  suboptimal code?"*

If you recognize yourself in one of these examples then Apophenia is for you:

```bash
$ pip install apophenia
```

## Going Further with FAISS

You can use generated output FAISS with [langchain](
https://python.langchain.com/docs/integrations/vectorstores/faiss/)
or with any modern libraries like [llamaindex](
https://docs.llamaindex.ai/en/stable/api_reference/storage/vector_store/faiss/)

- https://github.com/facebookresearch/faiss
- https://pypi.org/project/faiss/

## Where apophenia stands for?

Apophenia (/æpoʊˈfiːniə/) is the tendency to perceive meaningful connections
between unrelated things.

Apophenia has also come to describe a human propensity to unreasonably seek
definite patterns in random information, such as can occur in gambling.

https://en.wikipedia.org/wiki/Apophenia
