Metadata-Version: 2.4
Name: ruby-miner
Version: 1.0.2
Summary: Mine and extract complete package lists from RubyGems registry
Author-email: Rinalic <rinalic39@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/oraoraoraaa/Package-List-Miner
Project-URL: Repository, https://github.com/oraoraoraaa/Package-List-Miner
Project-URL: Issues, https://github.com/oraoraoraaa/Package-List-Miner/issues
Keywords: ruby,rubygems,gem,package-mining,data-mining,registry
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Libraries
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Requires-Dist: requests>=2.31.0
Requires-Dist: tqdm>=4.66.0

# Ruby Gems Miner

A Python tool to mine and extract complete package lists from the RubyGems registry.

## Features

- Downloads all ~180,000 Ruby gems from RubyGems.org
- Fetches package metadata including homepage and repository URLs
- Rate-limited API calls to respect server resources (10 req/sec)
- Progress tracking with visual feedback
- Outputs standardized CSV format for cross-ecosystem analysis

## Installation

```bash
pip install ruby-miner
```

## Quick Start

```bash
ruby-miner
```

Or use as a Python module:

```python
from ruby_miner import mine_ruby
mine_ruby()
```

## Output

Generates a CSV file with gem information:
- Package ID, Platform, Name
- Homepage URL, Repository URL

## Performance

- Runtime: ~5 hours for complete dataset
- Rate limit: 10 requests per second
- Processes ~180,000 gems

## Data Source

- Gem Names: http://rubygems.org/names
- Gem Details: https://rubygems.org/api/v1/gems/{name}.json

## License

MIT License - see LICENSE file for details

### Gem Metadata Sources

For each gem, the script fetches:

```json
{
  "name": "rails",
  "homepage_uri": "https://rubyonrails.org",
  "source_code_uri": "https://github.com/rails/rails",
  "project_uri": "https://rubygems.org/gems/rails"
}
```

The script prioritizes:

1. **Homepage**: `homepage_uri` → `project_uri` → "nan"
2. **Repository**: `source_code_uri` → `homepage_uri` → "nan"

### Error Handling

If an API call fails (timeout, 404, etc.):

- Continues processing with "nan" values
- Logs no error (fails silently)
- Ensures complete dataset even with some missing data

## Files

- `mine_ruby.py`: Main script
- `requirements.txt`: Python dependencies (requests, tqdm)
- `setup.sh`: Automated setup script
- `specs.4.8.gz`: Temporary download file (deleted after use)
- `specs.4.8`: Temporary decompressed file (deleted after use)
- Output: `../../../Resource/Package/Package-List/Ruby_New.csv`

## Troubleshooting

### "Error downloading gem names"

Check that:

- You have internet connectivity
- rubygems.org is accessible: `curl -I http://rubygems.org/names`
- No firewall blocking the connection

### Script is very slow

This is expected behavior:

- Rate limiting (10 requests/second) is intentional
- With 180K+ gems, expect 5+ hours runtime
- Consider running overnight or in background

To run in background:

```bash
nohup python mine_ruby.py > output.log 2>&1 &
```

### "Permission denied" when creating output directory

Ensure you have write permissions to:

- Current directory (for temporary files)
- `Resource/Package/Package-List/` (for output)

### Incomplete data (many "nan" values)

This can occur if:

- API is temporarily unavailable
- Network issues during processing
- Some gems have incomplete metadata

**Note**: This is normal - not all gems have complete metadata on RubyGems.org.

### Virtual environment issues

If you encounter errors related to the virtual environment:

1. Delete the `venv` folder: `rm -rf venv`
2. Re-run the setup script: `./setup.sh`
3. Virtual environments cannot be moved after creation - recreate if you move the directory

## Performance Notes

- **Download Time**: Fast (gem names list is small)
- **Processing Time**: SLOW (~5 hours for 180K+ gems)
- **Memory Usage**: Low (processes one gem at a time)
- **Network Usage**: Moderate (many small API requests)

### Optimization Tips

To speed up processing (advanced users):

1. Reduce delay in `time.sleep(0.1)` (risks being rate-limited or blocked)
2. Use parallel requests (requires code modification)
3. Use RubyGems database dump instead of API (requires parsing Marshal format)

## Advantages

- **Complete Data**: Includes all public gems
- **Official API**: Uses RubyGems.org official endpoints
- **Detailed Metadata**: Gets homepage and repository URLs
- **Reliable**: Gracefully handles API failures

## Limitations

- **Slow Processing**: Rate limiting means long runtime
- **API Dependent**: Requires RubyGems.org to be available
- **Incomplete Metadata**: Not all gems have homepage/repository info
- **No Version Info**: Only captures latest gem information

## Alternative Approaches

For faster processing, consider:

1. **Database Dump**: RubyGems provides database dumps (requires PostgreSQL)
2. **Bulk API**: Some bulk endpoints may exist (check RubyGems API docs)
3. **Cached Data**: Use previously downloaded data and update incrementally

---

## Code Explanation

### Architecture

The Ruby Miner uses a two-phase approach:

1. **Phase 1**: Download complete list of gem names
2. **Phase 2**: Fetch detailed metadata for each gem

This is necessary because RubyGems doesn't provide a single complete dump like crates.io.

### 1. Specs File Download

```python
specs_url = "https://rubygems.org/specs.4.8.gz"
download_file(specs_url, specs_gz_path)
```

**Purpose**: Downloads compact specs (currently not parsed, but available for future use).

**Note**: The specs file is in Ruby Marshal format (binary), which is complex to parse in Python. Currently, we use the simpler names endpoint instead.

### 2. Gem Names List

```python
names_url = "http://rubygems.org/names"
gem_names = response.text.strip().split('\n')
```

**Format**: Simple newline-delimited text file.

```
rails
devise
rake
...
```

**Advantages**:

- Simple to parse
- Complete list of all gems
- Fast download

### 3. Detailed Metadata Fetching

```python
for gem_name in gem_names:
    time.sleep(0.1)  # Rate limiting
    response = requests.get(f"https://rubygems.org/api/v1/gems/{gem_name}.json")
```

**Process**:

1. Iterate through each gem name
2. Wait 0.1 seconds (rate limiting)
3. Fetch JSON metadata
4. Extract homepage and repository URLs
5. Write to CSV immediately (streaming write)

### 4. URL Extraction Priority

```python
homepage_url = gem_info.get('homepage_uri', '') or gem_info.get('project_uri', '') or "nan"
repo_url = gem_info.get('source_code_uri', '') or gem_info.get('homepage_uri', '') or "nan"
```

**Fallback Chain**:

- **Homepage**: Try `homepage_uri` first, fall back to `project_uri`
- **Repository**: Try `source_code_uri` first, fall back to `homepage_uri`

**Validation**:

```python
if homepage_url and not homepage_url.startswith('http'):
    homepage_url = "nan"
```

Ensures only valid HTTP/HTTPS URLs are kept.

---
