Metadata-Version: 2.4
Name: sci-soft-models
Version: 0.2.0
Summary: Computational Models for Understanding Scientific Software
Author-email: Eva Maxfield Brown <evamaxfieldbrown@gmail.com>
License: MIT License
Project-URL: Homepage, https://github.com/evamaxfield/sci-soft-models
Project-URL: Bug Tracker, https://github.com/evamaxfield/sci-soft-models/issues
Project-URL: Documentation, https://evamaxfield.github.io/sci-soft-models
Project-URL: User Support, https://github.com/evamaxfield/sci-soft-models/issues
Classifier: Development Status :: 4 - Beta
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: dataclasses-json<0.7,>=0.6
Requires-Dist: numpy<3,>=1
Requires-Dist: requests<3,>=2
Requires-Dist: torch<3,>=2
Requires-Dist: transformers[sentencepiece,torch]<5,>=4.48
Provides-Extra: lint
Requires-Dist: pre-commit>=2.20.0; extra == "lint"
Provides-Extra: dev
Requires-Dist: ipython; extra == "dev"
Requires-Dist: jupyterlab; extra == "dev"
Provides-Extra: data
Requires-Dist: dvuploader>=0.3.0; extra == "data"
Requires-Dist: easyDataverse<0.5,>=0.4; extra == "data"
Provides-Extra: test
Requires-Dist: pytest<9,>=8; extra == "test"
Provides-Extra: training
Requires-Dist: datasets<3,>=2; extra == "training"
Requires-Dist: einops<0.9,>=0.8; extra == "training"
Requires-Dist: matplotlib<4,>=3; extra == "training"
Requires-Dist: scikit-learn<2,>=1; extra == "training"
Requires-Dist: python-dotenv==1.0.1; extra == "training"
Requires-Dist: pandas<3,>=1; extra == "training"
Requires-Dist: protobuf<6,>=5; extra == "training"
Requires-Dist: pyarrow>=12; extra == "training"
Requires-Dist: tabulate<0.10,>=0.9; extra == "training"
Requires-Dist: tiktoken<0.10,>=0.9; extra == "training"
Requires-Dist: tqdm<5,>=4; extra == "training"
Requires-Dist: typer<0.16,>=0.15; extra == "training"
Provides-Extra: coiled
Requires-Dist: bokeh<4,>=3; extra == "coiled"
Requires-Dist: coiled<2,>=1.76; extra == "coiled"
Requires-Dist: distributed; extra == "coiled"
Dynamic: license-file

# Predictive Models for Researching Scientific Software

Computational predictive models to assist in the identification, classification, and study of scientific software.

## Models

### Developer-Author Entity Matching

This model is a binary classifier that predicts whether a developer and an author are the same person. It is trained on a dataset of 3000 developer-author pairs that have been annotated as either matching or not matching.

#### Usage

Given a set of developers and authors, we use the model on each possible pair of developer and author to predict whether they are the same person. The model returns a list of only the found matches in `MatchedDevAuthor` objects, each containing the developer, author, and the confidence of the prediction.

```python
from sci_soft_models import dev_author_em

devs = [
    dev_author_em.DeveloperDetails(
        username="evamaxfield",
        name="Eva Maxfield Brown",
    ),
    dev_author_em.DeveloperDetails(
        username="nniiicc",
    ),
]

authors = [
    "Eva Brown",
    "Nicholas Weber",
]

matches = dev_author_em.match_devs_and_authors(devs=devs, authors=authors)
print(matches)
# [
#   MatchedDevAuthor(
#       dev=DeveloperDetails(
#           username='evamaxfield',
#           name='Eva Maxfield Brown',
#           email=None,
#       ),
#       author='Eva Brown',
#       confidence=0.9851127862930298
#   )
# ]
```

<h2>Extra Notes</h2>
<details>

### Developer-Author-EM Dataset

This model was originally created and managed as a part of [rs-graph](https://github.com/evamaxfield/rs-graph) and as such, to regenerate the dataset for annotation, the following steps can be taken:

```bash
git clone https://github.com/evamaxfield/rs-graph.git
cd rs-graph
git checkout c1d8ec89
pip install -e .
rs-graph-modeling create-developer-author-em-dataset-for-annotation
```

[Link to annotation set creation function](https://github.com/evamaxfield/rs-graph/blob/c1d8ec8999a7a26e5d1669e9531adaad13245393/rs_graph/bin/modeling.py#L168).

</details>
