Metadata-Version: 2.4
Name: papertuner
Version: 0.1.3
Summary: A package for creating ML research assistant models through paper dataset creation and model fine-tuning
Author-email: Your Name <your.email@example.com>
Project-URL: Homepage, https://github.com/yourusername/papertuner
Project-URL: Bug Tracker, https://github.com/yourusername/papertuner/issues
Project-URL: Documentation, https://github.com/yourusername/papertuner#readme
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: huggingface_hub==0.29.3
Requires-Dist: tenacity==9.0.0
Requires-Dist: PyMuPDF>=1.22.0
Requires-Dist: arxiv>=1.4.0
Requires-Dist: google-genai>=1.7.0
Requires-Dist: tqdm>=4.67.1
Requires-Dist: requests>=2.32.3
Requires-Dist: datasets==3.4.1
Requires-Dist: sentence-transformers==3.4.1
Requires-Dist: trl>=0.15.2
Requires-Dist: vllm>=0.8.1
Requires-Dist: torch>=2.6.0
Requires-Dist: unsloth[cu124-torch260]

# PaperTuner

PaperTuner is a Python package for creating research assistant models by processing academic papers and fine-tuning language models to provide methodology guidance and research approaches.

## Features

- Automated extraction of research papers from arXiv
- Section extraction to identify problem statements, methodologies, and results
- Generation of high-quality question-answer pairs for research methodology
- Fine-tuning of language models with GRPO (Growing Rank Pruned Optimization)
- Integration with Hugging Face for dataset and model sharing

## Installation

```bash
pip install papertuner
```

## Basic Usage

### As a Command-Line Tool

#### 1. Create a dataset from research papers

```bash
# Set up your environment variables
export GEMINI_API_KEY="your-api-key"
export HF_TOKEN="your-huggingface-token"  # Optional, for uploading to HF

# Run the dataset creation
papertuner-dataset --max-papers 100
```

#### 2. Train a model

```bash
# Train using the created or an existing dataset
papertuner-train --model "Qwen/Qwen2.5-3B-Instruct" --dataset "densud2/ml_qa_dataset"
```

### As a Python Library

Here's a complete example of creating a specialized biology research model:

```python
from papertuner import ResearchPaperProcessor, ResearchAssistantTrainer

# 1. Create a dataset from biology papers
processor = ResearchPaperProcessor(
    api_key="your-gemini-api-key",
    hf_repo_id="your-username/bio-research-qa"
)

# Use a biology-focused search query
bio_query = " OR ".join([
    "molecular biology",
    "cell biology",
    "genetics",
    "biochemistry",
    "systems biology",
    "synthetic biology",
    "bioinformatics",
    "genomics",
    "proteomics",
    "metabolomics"
])

# Process papers and create dataset
papers = processor.process_papers(
    max_papers=100,
    search_query=bio_query,
    clear_processed_data=True  # Start fresh
)

# 2. Train a specialized model
trainer = ResearchAssistantTrainer(
    model_name="Qwen/Qwen2.5-3B-Instruct",  # Base model
    lora_rank=64,
    output_dir="./bio_model",
    system_prompt="""You are a biology research assistant. Follow this format:
<think>
Analyze the biological research question step-by-step, considering:
- Relevant biological mechanisms
- Experimental approaches
- Key methodological considerations
- Potential limitations
</think>

Provide a clear, scientifically-grounded answer that explains both the 'how' and 'why'
of the biological approach or method."""
)

# Train the model
results = trainer.train("your-username/bio-research-qa")

# 3. Test the model with biology questions
questions = [
    "How would you design a CRISPR experiment to study gene function in mammalian cells?",
    "What approaches can be used to study protein-protein interactions in vivo?",
    "How would you analyze single-cell RNA sequencing data to identify cell types?"
]

for question in questions:
    response = trainer.run_inference(
        results["model"],
        results["tokenizer"],
        question,
        results["lora_path"]
    )
    print(f"\nQ: {question}")
    print(f"A: {response}\n")
```

## Configuration

You can configure the tool using environment variables or when initializing the classes:

- `GEMINI_API_KEY`: API key for generating QA pairs
- `HF_TOKEN`: Hugging Face token for uploading datasets and models
- `HF_REPO_ID`: Hugging Face repository ID for the dataset
- `PAPERTUNER_DATA_DIR`: Custom directory for storing data (default: ~/.papertuner/data)

## License

MIT License

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.
