Metadata-Version: 2.4
Name: aifiles
Version: 0.1.0
Summary: A Python library for downloading sample files in various formats for AI experimentation
Author-email: Prajyot Birajdar <work.prajyotbirajadar@gmail.com>
Maintainer-email: Prajyot Birajdar <work.prajyotbirajadar@gmail.com>
License: MIT
Project-URL: Homepage, https://prajyotb.netlify.app/
Project-URL: Documentation, https://github.com/itsbilyatt/aifiles#readme
Project-URL: Repository, https://github.com/itsbilyatt/aifiles
Project-URL: Issues, https://github.com/itsbilyatt/aifiles/issues
Project-URL: Changelog, https://github.com/itsbilyatt/aifiles/blob/main/CHANGELOG.md
Keywords: ai,sample-files,generative-ai,rag,multimodal
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Utilities
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: requests>=2.25.0
Requires-Dist: tqdm>=4.62.0
Requires-Dist: faker>=8.0.0
Requires-Dist: Pillow>=8.0.0
Requires-Dist: fpdf2>=2.0.0
Requires-Dist: openpyxl>=3.0.0
Requires-Dist: click>=8.0.0
Requires-Dist: rich>=10.0.0
Requires-Dist: PyYAML>=6.0
Provides-Extra: audio
Requires-Dist: scipy>=1.7.0; extra == "audio"
Provides-Extra: ml
Requires-Dist: numpy>=1.21.0; extra == "ml"
Requires-Dist: pandas>=1.3.0; extra == "ml"
Provides-Extra: all
Requires-Dist: aifiles[audio,ml]; extra == "all"
Dynamic: license-file

# Aifiles

[![PyPI version](https://badge.fury.io/py/aifiles.svg)](https://pypi.org/project/aifiles/)
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

A Python library for instantly downloading up to 10 real, meaningful sample files in any format — perfect for generative AI, agentic AI, RAG pipelines, multimodal model testing, document parsing, prompt engineering, and general AI/ML experimentation.

## ✨ Features

- **40+ file formats** supported across documents, data, images, audio, video, code, and more
- **Smart sourcing**: Downloads real files from public repositories, falls back to synthetic generation
- **AI-optimized**: Formats supported by LangChain, LlamaIndex, OpenAI Vision, and other AI tools
- **Developer-friendly**: Simple API, rich CLI, comprehensive error handling
- **Secure**: MIME validation, path sanitization, no hardcoded credentials

## 🚀 Installation

```bash
pip install aifiles

# For audio generation (optional)
pip install aifiles[audio]

# For all optional dependencies
pip install aifiles[all]
```

## 📖 Quick Start

### Python API

```python
from aifiles import get_files, list_formats, info, preview

# Get 5 PDF samples for RAG pipeline testing
files = get_files("pdf", count=5, output_dir="./rag_data")
print(files)
# ['/rag_data/sample_1.pdf', '/rag_data/sample_2.pdf', ...]

# Get 3 CSV files with sales data variant
files = get_files("csv", count=3, variant="sales")
# ['/samples/sales_1.csv', '/samples/sales_2.csv', ...]

# Get 10 WAV files for speech model testing
files = get_files("wav", count=10, output_dir="./audio_test")

# List all supported formats
formats = list_formats()
print(formats["documents"])  # ['pdf', 'docx', 'txt', 'md', ...]

# Get format information
meta = info("json")
print(meta)
# {
#   "mime_type": "application/json",
#   "category": "structured",
#   "use_cases": ["agent logs", "API responses", "chat history"],
#   "supported_by": ["OpenAI", "LangChain", "LlamaIndex", ...]
# }

# Preview a downloaded file
preview("./samples/sample1.csv")
```

### Command Line

```bash
# Get 3 PDF sample files
aifiles get pdf --count 3

# Get 5 CSV files with sales variant
aifiles get csv --count 5 --variant "sales data" --output ./my_samples

# List all supported formats
aifiles list-formats

# Show format info
aifiles info json

# Preview a file
aifiles preview ./samples/sample1.csv
```

## 📋 Supported Formats

### 📄 Documents
- **PDF** - Multi-page documents, invoices, research papers
- **DOCX** - Word documents, resumes, letters
- **TXT** - Plain text, prompts, logs, poetry
- **MD** - Markdown docs, README files, notes
- **RTF** - Rich text with formatting
- **ODT** - OpenDocument text files

### 📊 Structured / Data
- **CSV** - Tabular datasets, sales data, sensor readings
- **TSV** - Tab-separated data
- **JSON** - API responses, agent logs, chat histories
- **YAML** - Configurations, workflows, agent definitions
- **XML** - Structured markup, RSS feeds, SOAP data
- **XLSX** - Excel spreadsheets with charts
- **PARQUET** - Columnar data for ML pipelines
- **SQLITE** - Embedded database for agent memory

### 🖼️ Images
- **PNG** - Charts, diagrams, screenshots
- **JPG** - Photographs, real-world scenes
- **WEBP** - Modern compressed images
- **TIFF** - High-quality scanning/OCR
- **GIF** - Animated images for UI testing
- **SVG** - Vector graphics, logos, icons

### 🎵 Audio
- **WAV** - Raw speech audio for STT models
- **MP3** - Compressed speech or music
- **FLAC** - Lossless audio for high-fidelity testing
- **OGG** - Open-source compressed audio

### 🎥 Video
- **MP4** - General-purpose video for multimodal models
- **MOV** - Apple QuickTime video
- **AVI** - Legacy video format
- **MKV** - High-quality video container
- **WEBM** - Web-friendly open video format

### 💻 Code & Notebooks
- **PY** - Python scripts
- **JS** - JavaScript/Node.js scripts
- **TS** - TypeScript files
- **IPYNB** - Jupyter Notebooks with AI/ML examples
- **HTML** - Web pages for scraping/parsing
- **CSS** - Stylesheets
- **SQL** - Database queries, DDL scripts

### 📧 Email & Communication
- **EML** - Email messages with attachments
- **MSG** - Outlook email format
- **ICS** - Calendar events (iCalendar)
- **VCF** - Contact cards

### 🗜️ Archives
- **ZIP** - Compressed archive with mixed files
- **TAR** - Unix archive
- **GZ** - Gzipped content

### 🔬 Scientific / ML
- **HDF5** - Hierarchical scientific data
- **ARROW** - Apache Arrow columnar data
- **FEATHER** - Fast columnar data storage
- **NPY** - NumPy array binary format
- **PKL** - Python pickle (serialized objects)

### 🏗️ 3D / Spatial
- **OBJ** - 3D object mesh
- **STL** - 3D printing / mesh format
- **GLTF** - 3D scene for multimodal/spatial AI

### 🔐 Config / Infra
- **ENV** - Environment config files
- **TOML** - Project config (like pyproject.toml)
- **INI** - Legacy configuration files
- **DOCKERFILE** - Docker build files

## 🛠️ API Reference

### `get_files(format, count=1, output_dir="./samples", variant=None)`

Download sample files in the specified format.

**Parameters:**
- `format` (str): File format/extension (e.g., "pdf", "csv", "png")
- `count` (int): Number of files to fetch (1-10)
- `output_dir` (str): Directory to save files
- `variant` (str, optional): Content variant hint

**Returns:** List of absolute file paths

**Raises:**
- `InvalidCountError`: Count not between 1-10
- `FormatNotSupportedError`: Format not supported
- `FormatNotAvailableError`: Cannot fetch or generate files

### `list_formats()`

**Returns:** Dictionary of categorized formats

### `info(format)`

**Returns:** Dictionary with format metadata or None

### `preview(filepath)`

Prints file preview or metadata to console.

## 🤝 Contributing

We welcome contributions! Please see our [Contributing Guide](CONTRIBUTING.md) for details.

1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests
5. Submit a pull request

## 📄 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## 🙏 Acknowledgments

- Public sample file repositories for real file sources
- Open source libraries: requests, faker, Pillow, fpdf2, etc.
- AI community for inspiration and use cases

## 👤 Author

- **Prajyot Birajdar**
- **Email:** work.prajyotbirajadar@gmail.com
- **GitHub:** https://github.com/itsbilyatt
- **LinkedIn:** https://www.linkedin.com/in/prajyot-birajdar-1b09a1173
- **Portfolio:** https://prajyotb.netlify.app/

---

**Made with ❤️ for the AI developer community**
