Metadata-Version: 2.4
Name: sonad
Version: 0.1.1
Summary: Software name disambiguation pipeline
Home-page: https://github.com/jelenadjuric01/Software-Disambiguation
Author: Jelena Duric
Author-email: Jelena Duric <djuricjelena611@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/jelenadjuric01/Software-Disambiguation
Project-URL: Issues, https://github.com/jelenadjuric01/Software-Disambiguation/issues
Keywords: software,disambiguation,metadata,machine learning,NLP
Requires-Python: ==3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas<2.3,>=1.3
Requires-Dist: numpy<1.27,>=1.21
Requires-Dist: cloudpickle>=2.0
Requires-Dist: scikit-learn<1.4,>=1.0
Requires-Dist: xgboost<2.2,>=1.5
Requires-Dist: lightgbm<4.1,>=3.3
Requires-Dist: sentence-transformers<3.0,>=2.2
Requires-Dist: textdistance>=4.2
Requires-Dist: beautifulsoup4<5.0,>=4.9
Requires-Dist: requests<2.33,>=2.25
Requires-Dist: SPARQLWrapper<2.1,>=1.8
Requires-Dist: lxml<6.0,>=4.9
Requires-Dist: elementpath==4.0.0
Requires-Dist: somef>=0.9.11
Dynamic: author
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-python

# SONAD: Software Name Disambiguation

![License](https://img.shields.io/badge/license-MIT-blue.svg)  
![Python](https://img.shields.io/badge/python-3.9-blue.svg)

**SONAD** (Software Name Disambiguation) is a command-line tool and Python package that links software mentions in scientific papers to their corresponding repository URLs. It leverages NLP, third-party tools like SOMEF, and metadata to resolve software names accurately. It is limited to fetching URLs from GitHub, PyPI and CRAN.

---

## Installation

Install using pip:

```
pip install sonad
```


---

## Initial Configuration

Before using SONAD, you **must install and configure SOMEF**  
(https://github.com/KnowledgeCaptureAndDiscovery/somef/?tab=readme-ov-file),  
which is used for software metadata extraction.

Follow their installation instructions to make sure `somef` runs correctly on your system.

It is also **strongly recommended to provide a GitHub API token** to avoid rate limits when querying GitHub. You can configure this once using:

```
sonad configure
```

Your token will be saved for future runs.

---

## Requirements

SONAD requires Python 3.9. All dependencies are installed automatically.

Some key libraries:
- pandas
- scikit-learn
- xgboost
- sentence-transformers
- beautifulsoup4
- requests
- SPARQLWrapper
- somef
- textdistance
- lxml
- cloudpickle

---

## Usage

After installation, you can run the main command:

```
sonad process -i <input_file.csv> -o <output_file.csv> [-t <temp_folder>] 
```

### Parameters

- `-i`, `--input` (required): Path to the input CSV file.
- `-o`, `--output` (required): Path where the output CSV will be saved.
- `-t`, `--temp` (optional): Folder where temporary files will be written it the folder is provided.

---

## Input Format

The input CSV must contain the following columns:

- `name`: The software name mentioned in the paper.
- `doi`: The DOI of the paper.
- `paragraph`: The paragraph in which the software is mentioned.

Optionally, it can include:

- `candidate_urls`: A comma-separated list of candidate software URLs that might correspond to the software.

### Example:

```
name,doi,paragraph,candidate_urls
Scikit-learn,10.1000/xyz123,"We used Scikit-learn for classification.","https://github.com/scikit-learn/scikit-learn"
```

---


## License

MIT License © Jelena Djuric  
https://github.com/jelenadjuric01

---

## Contributions

Feel free to submit issues or pull requests on the GitHub repository:  
https://github.com/jelenadjuric01/Software-Disambiguation
