Metadata-Version: 2.4
Name: provesid
Version: 0.3.0
Summary: A Python package for chemical identifier resolution and experimental property extraction
Project-URL: Homepage, https://github.com/USEtox/PROVESID
Project-URL: Documentation, https://usetox.github.io/PROVESID/
Project-URL: Repository, https://github.com/USEtox/PROVESID
Project-URL: Bug Tracker, https://github.com/USEtox/PROVESID/issues
Author-email: "Ali A. Eftekhari and USEtox team" <e.eftekhari@gmail.com>
License: MIT
License-File: LICENSE
Keywords: api,chemical-identifiers,cheminformatics,chemistry,pubchem
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Scientific/Engineering :: Chemistry
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.12
Requires-Dist: chembl-structure-pipeline>=1.2.4
Requires-Dist: numpy>=1.21.0
Requires-Dist: pandas>=1.3.0
Requires-Dist: polars>=1.39.3
Requires-Dist: py2opsin>=1.1.0
Requires-Dist: rapidfuzz>=3.0.0
Requires-Dist: rdkit>=2022.09.0
Requires-Dist: requests>=2.25.0
Requires-Dist: tqdm>=4.60.0
Provides-Extra: dev
Requires-Dist: black>=21.0; extra == 'dev'
Requires-Dist: flake8>=3.8; extra == 'dev'
Requires-Dist: mkdocs-material>=9.5.0; extra == 'dev'
Requires-Dist: mkdocs>=1.6.1; extra == 'dev'
Requires-Dist: mkdocstrings[python]>=0.19; extra == 'dev'
Requires-Dist: pytest>=6.0; extra == 'dev'
Provides-Extra: docs
Requires-Dist: jupytext>=1.16.0; extra == 'docs'
Requires-Dist: mkdocs-autorefs>=0.4.0; extra == 'docs'
Requires-Dist: mkdocs-jupyter>=0.24.0; extra == 'docs'
Requires-Dist: mkdocs-material>=9.5.0; extra == 'docs'
Requires-Dist: mkdocs>=1.6.1; extra == 'docs'
Requires-Dist: mkdocstrings[python]>=0.19; extra == 'docs'
Provides-Extra: test
Requires-Dist: pytest-cov>=2.0; extra == 'test'
Requires-Dist: pytest-mock>=3.0; extra == 'test'
Requires-Dist: pytest-timeout>=2.0; extra == 'test'
Requires-Dist: pytest-xdist>=2.0; extra == 'test'
Requires-Dist: pytest>=6.0; extra == 'test'
Description-Content-Type: text/markdown

# PROVESID

[![Documentation Status](https://github.com/USEtox/PROVESID/actions/workflows/mkdocs-deploy.yml/badge.svg)](https://usetox.github.io/PROVESID/)
[![Tests](https://github.com/USEtox/PROVESID/actions/workflows/test.yml/badge.svg)](https://github.com/USEtox/PROVESID/actions/workflows/test.yml)
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

PROVESID is a member of the family of pre**PRO**cessing and **VE**rification of **S**ubstance data `PROVES`. `PROVESID` provides Pythonic access to online services of chemical identifiers and data. The goal is to have a clean interface to the most important online databases with a simple, intuitive (and documented), up-to-date, and extendable interface. We offer interfaces to [PubChem](https://pubchem.ncbi.nlm.nih.gov/), [NCI chemical identifier resolver](https://cactus.nci.nih.gov/chemical/structure), [CAS Common Chemistry](https://commonchemistry.cas.org/), [IUPAC OPSIN](https://www.ebi.ac.uk/opsin/), [ChEBI](https://www.ebi.ac.uk/chebi/beta/), and [ClassyFire](http://classyfire.wishartlab.com/). We highly recommend the new users to jump head-first into [examples folder](./examples/) and get started by playing with the code. We also keep documenting the old and new functionalities [here](https://usetox.github.io/PROVESID/). The package also aims to provide an offline platform when data files are availbale from the mentioned online tools.

# Installation

The package can be installed from PyPi by running

```
pip install provesid
```

To install the latest development version (for developers and enthusiasts, and also for the latest features), clone or download this repository, for to the root folder and install it by

```
pip install -e .
```

We very strongly recommend using [uv](https://docs.astral.sh/uv/getting-started/installation/). `PROVESID` is has a small Python codebase but its data files, when fully downloaded by the user's request, can occupy more than 30 Gb of disk space! `uv` makes sure that the package is installed only once and linked in other virtual environments. It  barely changes your `pip` workflow, and is much faster -and more pleasant- to use. After installing `uv`, simply type:

```
uv pip install provesid
```

or for the development version (recommended for now):

```
uv pip install git+https://github.com/USEtox/PROVESID
```

# Examples

**PubChem**

```python
from provesid.pubchem import PubChemAPI
pc = PubChemAPI()  # Now with unlimited caching!
cids_aspirin = pc.get_cids_by_name('aspirin')
res_basic = pc.get_basic_compound_info(cids_aspirin[0])
```

which returns

```python
{
  "CID": 2244,
  "MolecularFormula": "C9H8O4",
  "MolecularWeight": "180.16",
  "SMILES": "CC(=O)OC1=CC=CC=C1C(=O)O",
  "InChI": "InChI=1S/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12)",
  "InChIKey": "BSYNRYMUTXBXSQ-UHFFFAOYSA-N",
  "IUPACName": "2-acetyloxybenzoic acid",
  "success": true,
  "cid": 2244,
  "error": null
}
```

**PubChem View for data**

```python
from provesid import PubChemView, get_property_table
logp_table = get_property_table(cids_aspirin[0], "LogP")
logp_table
```

which returns a table with the reported values of `logP` for aspirin (including the references for each data point).

**Chemical Identifier Resolver**

```python
from provesid import NCIChemicalIdentifierResolver
resolver = NCIChemicalIdentifierResolver()
# smiles for formaldehyde
smiles = resolver.resolve("50-00-0", 'smiles')
print(f"SMILES for CASRN 50-00-0 is {smiles}") # SMILES for CASRN 50-00-0 is C=O
# inchi for aspirin
inchi = resolver.resolve("50-78-2", "stdinchi") # InChI for 50-78-2 is InChI=1S/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12)
print(f"InChI for 50-78-2 is {inchi}")
```

**OPSIN**
This is the online `OPSIN` interface. A local interface also exist that uses `py2opsin` python package and the `JAVA` executables of the `OPSIN` library. You can use the local version (recommended) by loading the `PYOPSIN` clss instead of `OPSIN`.
```python
from provesid import OPSIN
opsin = OPSIN()
methane_result = opsin.get_id("methane")
```

which returns:

```python
{'status': 'SUCCESS',
 'message': '',
 'inchi': 'InChI=1/CH4/h1H4',
 'stdinchi': 'InChI=1S/CH4/h1H4',
 'stdinchikey': 'VNWKTOKETHGBQD-UHFFFAOYSA-N',
 'smiles': 'C'}
 ```

**CAS Common Chemistry**

```python
# One-time API key setup
from provesid import set_cas_api_key
set_cas_api_key("your-cas-api-key")  # Configure once

# Then use anywhere without specifying API key
from provesid import CASCommonChem
ccc = CASCommonChem()  # Automatically uses stored API key
water_info = ccc.cas_to_detail("7732-18-5")
print("Water (7732-18-5):")
print(f"  Name: {water_info.get('name')}")
print(f"  Molecular Formula: {water_info.get('molecularFormula')}")
print(f"  Molecular Mass: {water_info.get('molecularMass')}")
print(f"  SMILES: {water_info.get('smile')}")
print(f"  InChI: {water_info.get('inchi')}")
print(f"  Status: {water_info.get('status')}")
```

which returns

```
Water (7732-18-5):
  Name: Water
  Molecular Formula: H<sub>2</sub>O
  Molecular Mass: 18.02
  SMILES: O
  InChI: InChI=1S/H2O/h1H2
  Status: Success
```

**ChEBI**

Access to the European Bioinformatics Institute ChEBI (Chemical Entities of Biological Interest) database. See the [tutorial notebook](./examples/ChEBI/ChEBI_tutorial.ipynb).

**ZeroPM Global Chemical Inventory**

PROVESID now includes access to the [ZeroPM](https://database.zeropm.eu/) global chemical inventory database, which provides information about chemicals listed in regulatory inventories worldwide. The database is automatically downloaded on first use:

```python
from provesid.zeropm import ZeroPM

# Initialize - database downloads automatically if not present
zpm = ZeroPM()

# Query by CAS number
query_id = zpm.query_cas("50-00-0")  # Formaldehyde

# Get SMILES from CAS
smiles = zpm.get_smiles_from_cas("50-00-0")

# Search by chemical name
results = zpm.query_similar_name("formaldehyde", threshold=80)

# Query by regulatory inventory
eu_chemicals = zpm.query_by_inventory(inventory_name="REACH")

# Query by country
us_chemicals = zpm.query_by_country(country_name="United States")

# Get all available inventories
inventories = zpm.get_all_inventories()

# Get database statistics
stats = zpm.get_database_stats()
```

The database file (~400MB) is downloaded automatically from [GitHub](https://github.com/ZeroPM-H2020/global-chemical-inventory-database) on first use and cached locally. You can also manually download it:

```python
# Manual download (only needed if auto-download fails)
zpm = ZeroPM(auto_download=False)  # Skip auto-download
zpm.download_database()  # Manually trigger download
```

See the [ZeroPM tutorial notebook](./examples/zeropm/zeropm-example.ipynb) for more examples.

**ClassyFire**

See the [tutorial notebook](./examples/ClassyFire/classyfire_tutorial.ipynb).

# Other tools

Several other Python (and other) packages and sample codes are available. We are inspired by them and tried to improve upon them based on our personal experiences working with chemical identifiers and data.  

  - [PubChemPy](https://github.com/mcs07/PubChemPy) and [docs](https://docs.pubchempy.org/en/latest/)  
  - [CIRpy](https://github.com/mcs07/CIRpy) and [docs](https://cirpy.readthedocs.io/en/latest/)  
  - [IUPAC cookbook](https://iupac.github.io/WFChemCookbook/intro.html) for a tutorial on using various web APIs.  
  - more?

# TODO list

We will provide Python interfaces to more online services. Please [open an issue](https://github.com/USEtox/PROVESID/issues) and let us know what else you would like to have included.  

Add data and tool for [Chebi ontology](https://ftp.ebi.ac.uk/pub/databases/chebi/ontology/) data using [pronto](https://github.com/althonos/pronto)  
Add an interface to the [ChEMBL standardization pipeline](https://link.springer.com/article/10.1186/s13321-020-00456-1) using its [Python package](https://github.com/chembl/ChEMBL_Structure_Pipeline); this feature may be added to `IMPROVES`.  

Add [UniChem](https://www.ebi.ac.uk/unichem/api/docs) API