Metadata-Version: 2.4
Name: rsmetacheck
Version: 0.3.2
Summary: Detect metadata pitfalls in software repositories
License-File: LICENSE
Author: Anas El Hounsri
Requires-Python: >=3.11,<3.13
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Dist: requests
Requires-Dist: somef (==0.10.3)
Description-Content-Type: text/markdown

[![Documentation Status](https://readthedocs.org/projects/rsmetacheck/badge/?version=latest)](https://rsmetacheck.readthedocs.io/en/latest/?badge=latest)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.18956787.svg)](https://doi.org/10.5281/zenodo.18956787)
[![PyPI - Version](https://img.shields.io/pypi/v/rsmetacheck)](https://pypi.org/project/rsmetacheck/)

# Research Software MetaCheck (a Pitfall/Warning Detection Tool)

This project provides an automated tool for detecting common metadata quality issues (pitfalls & Warnings)
in software repositories. The tool analyzes SoMEF (Software Metadata Extraction Framework) output
files to identify various problems in repository metadata
files such as `codemeta.json`, `package.json`, `setup.py`, `DESCRIPTION`, and others.

## Overview

MetaCheck identifies **29 different types of metadata quality issues** across multiple programming languages
(Python, Java, C++, C, R, Rust). These pitfalls range from version mismatches and
license template placeholders to broken URLs and improperly formatted metadata fields.

You can visit our [catalog](https://softwareunderstanding.github.io/RsMetaCheck/) to see in details what these pitfalls are, where are they usually detected and how to fix them.

### Supported Pitfall Types

The tool detects the following categories of issues:

- **Version-related pitfalls**: Version mismatches between metadata files and releases
- **License-related pitfalls**: Template placeholders, copyright-only licenses, missing version specifications
- **URL validation pitfalls**: Broken links for CI, software requirements, download URLs
- **Metadata format pitfalls**: Improper field formatting, multiple authors in single fields, etc...
- **Identifier pitfalls**: Invalid or missing unique identifiers, bare DOIs
- **Repository reference pitfalls**: Mismatched code repositories, Git shorthand usage

## Requirements

- **Python 3.11**
- Required Python packages:
  - `requests` (for URL validation)
  - `pathlib` (built-in)
  - `json` (built-in)
  - `re` (built-in)
  - `somef` (For extracting metadata from the repositories)

## Installation

### Using Poetry (Recommended)

1. **Clone the repository**:

   ```bash
   git clone https://github.com/SoftwareUnderstanding/RsMetaCheck.git
   cd RsMetaCheck
   ```

2. **Install with Poetry**:

   ```bash
   poetry install
   ```

3. **Configure SoMEF** (optional but recommended):
   Initially, the installation process will run `somef configure -a` to automatically set it up and install the necessary packages but the rate limit will be low. If you need more, you should reconfigure SoMEF, you can run the following command:
   ```bash
   poetry run somef configure
   ```
   Then add your GitHub authentication token to avoid API rate limits when analyzing repositories in batches.

### Using pip

Alternatively, you can install directly from GitHub:

```bash
pip install git+https://github.com/SoftwareUnderstanding/RsMetaCheck.git
```

## Usage

### GitHub Action

RsMetaCheck can be easily integrated into your CI/CD pipelines as a GitHub Action. We have set it up in GitHub Action in the following repository: [rs-metacheck-action](https://github.com/SoftwareUnderstanding/rs-metacheck-action) and is up in GitHub MarketPlace at [rsmetacheck actions](https://github.com/marketplace/actions/rsmetacheck).

The action will generate `all_pitfalls_results.json`, along with the `pitfalls/` and `somef_outputs/` directories directly in your workflow workspace.

### Run the Detection Tool locally

#### Analyze a Single Repository

```bash
poetry run rsmetacheck --input https://github.com/tidyverse/tidyverse
```

#### Analyze a Specific Branch

You can analyze a specific branch of a repository by using the `--branch` or `-b` flag:

```bash
poetry run rsmetacheck --input https://github.com/tidyverse/tidyverse --branch develop
```

#### Analyze Multiple Repositories from a JSON File

```bash
poetry run rsmetacheck --input repositories.json
```

The `repositories.json` file should be structured as follows:

```json
{
  "repositories": [
    "https://gitlab.com/example/example_repo_1",
    "https://gitlab.com/example/example_repo_2",
    "https://github.com/example/example_repo_3"
  ]
}
```

#### Customize Output Paths

```bash
poetry run rsmetacheck --input repositories.json \
  --somef-output ./results/somef \
  --pitfalls-output ./results/pitfalls \
  --analysis-output ./results/summary.json \
  --notes-output ./results/notes.json
```

#### Version Discrepancy Notes

When a metadata version differs from the release version by a small margin (all version components differ by less than 2, e.g., `0.4.3.dev1` vs `0.4.2`), MetaCheck records a **note** rather than a full pitfall. To capture these observations, use the `--notes-output` flag:

```bash
poetry run rsmetacheck --input https://github.com/example/repo --notes-output ./notes.json
```

The notes file is only created when there are observations to report and the `--notes-output` path is specified. Its structure is:

```json
{
  "total_notes": 1,
  "notes": [
    {
      "repository": "example/repo",
      "file_name": "repo_output.json",
      "code": "P001",
      "note": "Version discrepancy: metadata '0.4.3.dev1' vs release '0.4.2'"
    }
  ]
}
```

If the version difference is significant (any component differs by 2 or more, e.g., `0.12.4` vs `0.12.1`), it is still flagged as a pitfall.

#### Skip SoMEF and Analyze Existing Outputs

If you've already run SoMEF separately:

```bash
poetry run rsmetacheck --skip-somef --input somef_outputs/*.json
```

Or for multiple paths:

```bash
poetry run rsmetacheck --skip-somef --input my_somef_outputs_1/*.json my_somef_outputs_2/*.json
```

#### Verbose Output for Passed Checks

By default, the JSON-LD files generated by RsMetaCheck will only contain information about pitfalls and warnings that were actually detected. If you want to include all tests in the final JSON-LD, even tests that the repository successfully passed, use the `--verbose` flag:

```bash
poetry run rsmetacheck --input https://github.com/tidyverse/tidyverse --verbose
```

### Output

The tool will:

- Process all JSON files in the SoMEF output directory (by default `somef_outputs` created by the tool)
- Display progress messages showing detected pitfalls
- Generate JSON-LD files of detailed Pitfalls and Warnings detected by the tool in `output_1_pitfalls.jsonld`,
  `output_2_pitfalls.jsonld`, etc... in `pitfalls` (by default created by the tool) directory
- Generate a comprehensive report in `all_pitfalls_results.json`

The output file contains:

- EVERSE standardized JSON-LD output of each repository
- Summary statistics of analyzed repositories
- Count and percentage for each pitfall type
- Language-specific breakdown for repositories with target languages

## Troubleshooting

### Common Issues

1. **"There is no valid repository URL" error**: Ensure the JSON file that contains the repositories
   has a valid structure and that you are inputing the correct path
2. **Network timeouts**: Some pitfalls validate URLs and may time out this is normal behavior

### Performance Notes

- URL validation pitfalls may take longer due to network requests
- Large datasets may require several minutes to complete analysis
- Progress is displayed in real-time showing which pitfalls are found

## Contributing

The system is designed with modularity in mind. Each pitfall detector is implemented as a
separate module in the `scripts/` directory, making it easy to add new pitfall types or modify
existing detection logic.

