Metadata-Version: 2.4
Name: rsmetacheck
Version: 0.3.0
Summary: Detect metadata pitfalls in software repositories
License-File: LICENSE
Author: Anas El Hounsri
Requires-Python: >=3.11,<3.13
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Dist: requests
Requires-Dist: somef (==0.10.1)
Description-Content-Type: text/markdown

[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.18956787.svg)](https://doi.org/10.5281/zenodo.18956787)
![PyPI - Version](https://img.shields.io/pypi/v/rsmetacheck)

# Research Software MetaCheck (a Pitfall/Warning Detection Tool)

This project provides an automated tool for detecting common metadata quality issues (pitfalls & Warnings)
in software repositories. The tool analyzes SoMEF (Software Metadata Extraction Framework) output
files to identify various problems in repository metadata
files such as `codemeta.json`, `package.json`, `setup.py`, `DESCRIPTION`, and others.

## Overview

MetaCheck identifies **29 different types of metadata quality issues** across multiple programming languages
(Python, Java, C++, C, R, Rust). These pitfalls range from version mismatches and
license template placeholders to broken URLs and improperly formatted metadata fields.

### Supported Pitfall Types

The tool detects the following categories of issues:

- **Version-related pitfalls**: Version mismatches between metadata files and releases
- **License-related pitfalls**: Template placeholders, copyright-only licenses, missing version specifications
- **URL validation pitfalls**: Broken links for CI, software requirements, download URLs
- **Metadata format pitfalls**: Improper field formatting, multiple authors in single fields, etc...
- **Identifier pitfalls**: Invalid or missing unique identifiers, bare DOIs
- **Repository reference pitfalls**: Mismatched code repositories, Git shorthand usage

## Requirements

- **Python 3.11**
- Required Python packages:
  - `requests` (for URL validation)
  - `pathlib` (built-in)
  - `json` (built-in)
  - `re` (built-in)
  - `somef` (For extracting metadata from the repositories)

## Installation

### Using Poetry (Recommended)

1. **Clone the repository**:

   ```bash
   git clone https://github.com/SoftwareUnderstanding/RsMetaCheck.git
   cd RsMetaCheck
   ```

2. **Install with Poetry**:

   ```bash
   poetry install
   ```

3. **Configure SoMEF** (optional but recommended):
   Initially, the installation process will run `somef configure -a` to automatically set it up and install the necessary packages but the rate limit will be low. If you need more, you should reconfigure SoMEF, you can run the following command:
   ```bash
   poetry run somef configure
   ```
   Then add your GitHub authentication token to avoid API rate limits when analyzing repositories in batches.

### Using pip

Alternatively, you can install directly from GitHub:

```bash
pip install git+https://github.com/SoftwareUnderstanding/RsMetaCheck.git
```

## Usage

### GitHub Action

RsMetaCheck can be easily integrated into your CI/CD pipelines as a GitHub Action.

```yaml
name: RsMetaCheck

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  check-metadata:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout repository
        uses: actions/checkout@v4

      - name: Run RsMetaCheck
        uses: SoftwareUnderstanding/RsMetaCheck@v0.2.1 # Update to the latest version tag
        with:
          # Optional: Include passed checks in output (defaults to false)
          verbose: "false"
        env:
          # Optional: Provide token for SoMEF API rate limits
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
```

The action will generate `all_pitfalls_results.json`, along with the `pitfalls/` and `somef_outputs/` directories directly in your workflow workspace.

### Run the Detection Tool locally

#### Analyze a Single Repository

```bash
poetry run rsmetacheck --input https://github.com/tidyverse/tidyverse
```

#### Analyze a Specific Branch

You can analyze a specific branch of a repository by using the `--branch` or `-b` flag:

```bash
poetry run rsmetacheck --input https://github.com/tidyverse/tidyverse --branch develop
```

#### Analyze Multiple Repositories from a JSON File

```bash
poetry run rsmetacheck --input repositories.json
```

The `repositories.json` file should be structured as follows:

```json
{
  "repositories": [
    "https://gitlab.com/example/example_repo_1",
    "https://gitlab.com/example/example_repo_2",
    "https://github.com/example/example_repo_3"
  ]
}
```

#### Customize Output Paths

```bash
poetry run rsmetacheck --input repositories.json \
  --somef-output ./results/somef \
  --pitfalls-output ./results/pitfalls \
  --analysis-output ./results/summary.json
```

#### Skip SoMEF and Analyze Existing Outputs

If you've already run SoMEF separately:

```bash
poetry run rsmetacheck --skip-somef --input somef_outputs/*.json
```

Or for multiple paths:

```bash
poetry run rsmetacheck --skip-somef --input my_somef_outputs_1/*.json my_somef_outputs_2/*.json
```

#### Verbose Output for Passed Checks

By default, the JSON-LD files generated by RsMetaCheck will only contain information about pitfalls and warnings that were actually detected. If you want to include all tests in the final JSON-LD, even tests that the repository successfully passed, use the `--verbose` flag:

```bash
poetry run rsmetacheck --input https://github.com/tidyverse/tidyverse --verbose
```

### Output

The tool will:

- Process all JSON files in the SoMEF output directory (by default `somef_outputs` created by the tool)
- Display progress messages showing detected pitfalls
- Generate JSON-LD files of detailed Pitfalls and Warnings detected by the tool in `output_1_pitfalls.jsonld`,
  `output_2_pitfalls.jsonld`, etc... in `pitfalls` (by default created by the tool) directory
- Generate a comprehensive report in `all_pitfalls_results.json`

The output file contains:

- EVERSE standardized JSON-LD output of each repository
- Summary statistics of analyzed repositories
- Count and percentage for each pitfall type
- Language-specific breakdown for repositories with target languages

## Troubleshooting

### Common Issues

1. **"There is no valid repository URL" error**: Ensure the JSON file that contains the repositories
   has a valid structure and that you are inputing the correct path
2. **Network timeouts**: Some pitfalls validate URLs and may time out this is normal behavior

### Performance Notes

- URL validation pitfalls may take longer due to network requests
- Large datasets may require several minutes to complete analysis
- Progress is displayed in real-time showing which pitfalls are found

## Contributing

The system is designed with modularity in mind. Each pitfall detector is implemented as a
separate module in the `scripts/` directory, making it easy to add new pitfall types or modify
existing detection logic.

