Metadata-Version: 2.4
Name: chATLAS_Scrape
Version: 0.0.2
Summary: A python package for scraping input data for chATLAS.
Author: chATLAS Team
License: Apache-2.0
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: requests>=2.32.3
Requires-Dist: pandas>=2.0.0
Requires-Dist: matplotlib>=3.10.3
Requires-Dist: beautifulsoup4~=4.12.2
Requires-Dist: tqdm~=4.66.5
Dynamic: license-file

# chATLAS Scrape

A Python package for scraping and processing data for the chATLAS project. This package provides comprehensive tools for extracting markdown documentation from GitLab repositories and processing it for use in RAG (Retrieval-Augmented Generation) systems.

## Overview

chATLAS_Scrape is part of the chATLAS packages ecosystem, designed specifically for the ATLAS collaboration's documentation and knowledge management needs. It handles the data ingestion pipeline, extracting and processing markdown content from CERN GitLab repositories.

The GitLab scraping and preprocessing workflows are regularly executed as scheduled GitLab pipelines, ensuring that the chATLAS knowledge base stays up-to-date with the latest documentation changes across ATLAS repositories.

## Features

### GitLab Scraping
- **Multi-stage scraping pipeline**: Three-stage process for comprehensive project discovery and content extraction
- **Smart filtering**: Automatically excludes archived projects, forks, personal repositories, and irrelevant projects
- **Rate limiting**: Built-in rate limiting to respect GitLab API limits
- **Configurable**: Customizable filtering criteria, timeouts, and project selection rules

### Markdown Processing
- **Intelligent chunking**: Advanced text splitting that preserves markdown structure
- **Content preprocessing**: Handles GitLab-specific markdown features (admonitions, content tabs)
- **Structural preservation**: Maintains document hierarchy and formatting during processing
- **Filtering**: Content-based filtering to exclude non-relevant files


## Installation

Install the package using `uv` (recommended):

```bash
cd chATLAS_Scrape
uv sync
```

## Configuration

### Environment Variables

Set up your GitLab Personal Access Token:

```bash
export GITLAB_PAT="your_gitlab_personal_access_token"
```

### Configuration Options

The scraping behavior can be customized in `chATLAS_Scrape/gitlab/config.py`:

- **Base URL**: Defaults to `https://gitlab.cern.ch`
- **Project filtering**: Exclude projects containing specific keywords
- **Minimum file requirements**: Set minimum number of markdown files per project
- **API timeouts**: Configure REST (30s) and GraphQL (60s) request timeouts

## Usage

### Basic Scraping Workflow

1. **Project Discovery**: Find and filter relevant GitLab projects
2. **Content Extraction**: Download markdown files from selected projects
3. **Processing**: Clean and chunk the content for downstream use


### Running Tests

Execute the test suite:

```bash
uv run pytest
```
