Metadata-Version: 2.4
Name: PFASgroups
Version: 2.2.3
Summary: A comprehensive cheminformatics package for automated detection, classification, and analysis of per- and polyfluoroalkyl substances (PFAS). Combines SMARTS pattern matching, molecular formula constraints, and graph-based pathfinding to identify 55 PFAS group classifications (28 OECD-defined groups and 27 generic categories) with fluorinated chain length determination.
Author: Luc Miaz
Author-email: Luc Miaz <luc@miaz.ch>
Maintainer-email: Luc Miaz <luc@miaz.ch>
License: CC-BY-NC-4.0
Project-URL: Homepage, https://github.com/lucmiaz/PFASGroups
Project-URL: Repository, https://github.com/lucmiaz/PFASGroups
Project-URL: Documentation, https://pfasgroups.readthedocs.io
Project-URL: Bug Tracker, https://github.com/lucmiaz/PFASGroups/issues
Keywords: PFAS,per- and polyfluoroalkyl substances,chemical classification,cheminformatics,environmental chemistry,molecular structure
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: License :: Other/Proprietary License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Chemistry
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.7
Description-Content-Type: text/markdown
Requires-Dist: rdkit
Requires-Dist: numpy
Requires-Dist: pandas
Requires-Dist: networkx
Requires-Dist: tqdm
Requires-Dist: svgutils
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Provides-Extra: database
Requires-Dist: sqlalchemy>=1.4.0; extra == "database"

# PFASgroups

A comprehensive cheminformatics package for automated detection, classification, and analysis of per- and polyfluoroalkyl substances (PFAS) in chemical databases.

## Overview

PFASgroups combines SMARTS pattern matching, molecular formula constraints, and graph-based pathfinding (using RDKit and NetworkX) to identify and classify PFAS compounds. The package enables systematic PFAS universe mapping and environmental monitoring applications.

## Key Features

### Core Capabilities
- **PFAS Group Identification**: Automated detection of 113 functional groups:
  - 72 non-telomer groups (OECD-defined and generic categories)
  - 40 fluorotelomer groups with linker validation (Groups 69-112)
  - 1 aggregate pattern-matching group (Group 113: Telomers)
- **Atom Reference Requirement**: For non-telomer groups, SMARTS patterns must match atoms that are part of or directly connected to the fluorinated component (per/polyfluorinated carbons), respecting the `max_dist_from_CF` constraint
- **Linker Validation**: CH₂-specific validation for 40 fluorotelomer groups to distinguish from direct-attachment analogues. Telomer groups use `linker_smarts` to allow functional groups separated from perfluoro chains by non-fluorinated linkers
- **Aggregate Groups**: Pattern-matching groups that collect related PFAS groups via regex (e.g., Group 113 matches all "telomer" groups)
- **Component Length Analysis**: Quantification of per- and polyfluorinated alkyl components with CF₂ unit counting
- **Graph Metrics**: Comprehensive structural characterization (branching, eccentricity, diameter, resistance, centrality)
- **Customizable Definitions**: Easy extension to additional PFAS groups and halogenated chemical classes via JSON configuration

### Additional Tools
- **Homologue Series Generation**: Iterative component shortening to explore theoretical chemical space
- **Fingerprint Generation**: PFAS fingerprints for machine learning applications
- **Visualization**: Assign and visualize PFAS groupings
- **Multiple Interfaces**: Python API, command-line tool, and browser-based JavaScript version (RDKitJS)
- **Batch Processing**: Efficient analysis of large chemical databases

## Installation

Clone the repository and install dependencies:

```sh
pip install -e .
```

After installation, the `pfasgroups` command will be available in your terminal.

## Benchmark Summary (Feb 2026)

Benchmarks were run on an Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz (4C/8T), 15.5 GB RAM, Python 3.9.23, RDKit 2025.09.2, NetworkX 3.2.1 (the old version of Python was taken for compatibility with PFAS-atlas).

| Dataset/Profile | Count | Atom range | PFASgroups mean/median (ms) | PFAS-Atlas mean/median (ms) | Relative speed | Notes |
| --- | --- | --- | --- | --- | --- | --- |
| OECD reference (real compounds) | 3,414 | Typical small/medium | 19.2 / 14.8 | 38.8 / 37.7 | 2.02x faster | Real-world dataset representing existing compounds. |
| Timing stress-test (full metrics) | 2,500 | 11-625 | 251.8 / 24.4 | 58.5 / 34.2 | 0.23x | Synthetic stress-test with large molecules; heavy-tail runtime. |
| Timing stress-test (no resistance) | 2,500 | 11-619 | 176.7 / 24.8 | N/A | 1.43x faster vs full | Disables effective graph resistance only. |
| Timing stress-test (no metrics) | 2,500 | 11-619 | 97.7 / 19.7 | N/A | 2.58x faster vs full | Disables all component graph metrics. |

Timing profile plots (full vs no resistance vs no metrics):
- [benchmark/reports/timing_profiles_comparison.png](benchmark/reports/timing_profiles_comparison.png)
- [benchmark/reports/timing_profiles_residuals.png](benchmark/reports/timing_profiles_residuals.png)

Disable or limit graph metrics in the Python API:

```python
from PFASgroups import parse_smiles

# Skip all component graph metrics (fastest)
parse_smiles(smiles_list, compute_component_metrics=False)

# Keep metrics but skip effective graph resistance entirely
parse_smiles(smiles_list, limit_effective_graph_resistance=0)

# Compute resistance only for components below a size threshold
parse_smiles(smiles_list, limit_effective_graph_resistance=200)
```

CLI equivalents:

```bash
# Skip all component graph metrics (fastest)
pfasgroups parse --no-component-metrics "C(C(F)(F)F)F"

# Skip effective graph resistance entirely
pfasgroups parse --limit-effective-graph-resistance 0 "C(C(F)(F)F)F"

# Compute resistance only for components below a size threshold
pfasgroups parse --limit-effective-graph-resistance 200 "C(C(F)(F)F)F"
```

## Quick Start

### Python API

```python
from PFASgroups import parse_smiles, generate_fingerprint

# Parse PFAS structures
smiles_list = ["C(C(F)(F)F)F", "FC(F)(F)C(F)(F)C(=O)O"]
results = parse_smiles(smiles_list)

# Generate fingerprints
fingerprints, group_info = generate_fingerprint(smiles_list)
```

### Command Line

```bash
# Parse SMILES strings
pfasgroups parse "C(C(F)(F)F)F" "FC(F)(F)C(F)(F)C(=O)O"

# Generate fingerprints
pfasgroups fingerprint "C(C(F)(F)F)F" --format dict

# List available PFAS groups
pfasgroups list-groups
```

## Custom Configuration

Use custom pathtype definitions and PFAS groups:

```python
# Load custom files entirely
from PFASgroups import get_componentSmartss, get_PFASGroups, parse_smiles

custom_paths = get_componentSmartss(filename='my_component_smartss.json')
custom_groups = get_PFASGroups(filename='my_groups.json')

results = parse_smiles(
    ["C(C(F)(F)F)F"],
    componentSmartss=custom_paths,
    pfas_groups=custom_groups
)
```

```python
# Or extend defaults with your custom groups
from PFASgroups import get_PFASGroups, PFASGroup, parse_smiles, compile_componentSmarts, get_componentSmartss

# Add custom PFAS groups
groups = get_PFASGroups()  # Get defaults
groups.append(PFASGroup(
    id=999,
    name="My Custom Group",
    smarts1="[C](F)(F)F",
    smarts2="[N+](=O)[O-]",
    componentSmarts="Perfluoroalkyl",
    constraints={"nF": [3, None]}
))

results = parse_smiles(["FC(F)(F)C(F)(F)[N+](=O)[O-]"], pfas_groups=groups)

# Custom max_dist_from_CF parameter
# For functional groups without formula constraints, when bycomponent=True,
# the max_dist_from_CF parameter limits the maximum bond distance between
# a functional group match and a fluorinated carbon terminal atom (default: 0)
groups.append(PFASGroup(
    id=998,
    name="Extended Distance Group",
    smarts1="[#6$([#6][OH1])]",
    smarts2=None,
    componentSmarts=None,
    constraints={},
    max_dist_from_CF=3  # Allow up to 3 bonds from fluorinated carbon
))

# Add custom path types (e.g., chlorinated analogs)
paths = get_componentSmartss()
paths['Perchlorinated'] = compile_componentSmarts(
    "[C;X4](Cl)(Cl)!@!=!#[C;X4](Cl)(Cl)",  # component pattern
    "[C;X4](Cl)(Cl)Cl"                     # end pattern
)

results = parse_smiles(["ClC(Cl)(Cl)C(Cl)(Cl)C(=O)O"], componentSmartss=paths)
```

```bash
# Via command line
pfasgroups parse --groups-file my_custom_groups.json "C(C(F)(F)F)F"

# List available groups and paths
pfasgroups list-groups
pfasgroups list-paths
```

## Documentation

- **[USER_GUIDE.md](USER_GUIDE.md)** - Complete documentation with examples
- **[QUICK_REFERENCE.md](QUICK_REFERENCE.md)** - Quick reference for common tasks

## Usage Examples

See [USER_GUIDE.md](USER_GUIDE.md) for comprehensive examples including:
- Basic PFAS parsing and analysis
- Fingerprint generation for machine learning
- Custom configuration files
- Batch processing
- Integration with pandas and scikit-learn

## Summary of changes by version

- **Version 2.2 (Feb 2026)**: Added linked_smarts option to specify a restriction on path between smarts groups and fluorinated component. Added new PFASgroups (telomers). **v2.2.2** Fixed telomers and added examples and counter-examples to each PFASgroup. Removed boundary O in fluorinated components (for both Per and Polyalkyl components).
**v2.2.3** Added resultsModel to offer easier plotting and summarising capabilities for results.

- **Version 2.1 (Jan 2026)**: Added support for multiple smarts, with individual minimum count, per PFASgroup.

- **Version 2.0 (Jan 2026)**: Major expansion of graph‑based component metrics, new coverage statistics, schema updates, and richer per‑component outputs.

### Version 2.0 (January 2026) - Comprehensive Graph Metrics

Major enhancement adding comprehensive NetworkX graph theory metrics for detailed component analysis:

**New Features:**
- **Component-Level Metrics**: Each fluorinated component now includes 15+ graph metrics:
  - `diameter` and `radius` - Graph eccentricity bounds
  - `center`, `periphery`, `barycenter` - Structural node sets
  - `effective_graph_resistance` - Sum of resistance distances
  - `component_fraction` - Fraction of molecule covered by component (includes all attached H, F, Cl, Br, I)
  - Distance metrics from functional groups to structural features
- **Molecular Coverage Metrics**: New fraction-based metrics quantify fluorination extent:
  - `mean_component_fraction` - Average coverage per component
  - `total_components_fraction` - Total coverage by union of all components (accounts for overlaps)
- **Summary Statistics**: Aggregated metrics across all components per PFAS group
- **Enhanced Database Models**: New `Components` model stores individual component data with all metrics
- **Improved Analysis**: Better understanding of molecular topology, branching, functional group positioning, and fluorination extent

**Breaking Changes:**
- `parse_mols` output now includes additional summary metric fields (`mean_diameter`, `mean_radius`, etc.)
- Database schema changes require migration (see `DATABASE_MIGRATION_GUIDE.md`)

**Metrics Explained:**
- `branching` (0-1): Measures linearity (1.0 = linear, 0.0 = highly branched) - renamed from "eccentricity"
- `mean_eccentricity`, `median_eccentricity`: Graph-theoretic eccentricity statistics for component nodes
- `smarts_centrality` (0-1): Functional group position (1.0 = central, 0.0 = peripheral)
- `component_fraction` (0-1): Fraction of total molecule atoms in this component (includes all attached atoms)
- `total_components_fraction` (0-1): Fraction of molecule covered by union of all components
- `diameter`: Maximum distance between any two atoms in component
- `radius`: Minimum eccentricity across component nodes
- `barycenter`: Nodes minimizing total distance to all other nodes
- `center`: Nodes with minimum eccentricity
- `periphery`: Nodes with maximum eccentricity

See `COMPREHENSIVE_METRICS_SUMMARY.md` for complete documentation.

- **Version 1.x**: Shift to component‑based analysis with improved SMARTS matching and better handling of branched/cyclic structures.

### Version 1.x - Component-Based Analysis

- Replaced chain-finding with connected component analysis
- Added support for branched and cyclic structures
- Improved SMARTS pattern matching for diverse PFAS classes

### Version 0.x - Path-Based Analysis

- Find SMARTS match connected to either a second SMARTS or a default path-related SMARTS using networkx shortest_path.

## Licence
<a rel="license" href="http://creativecommons.org/licenses/by-nd/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nd/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nd/4.0/">Creative Commons Attribution-NoDerivatives 4.0 International License</a>.

Contact me in case you want an exception to the No Derivatives term.

## Acknowledgments
This project is part of the [ZeroPM project](https://zeropm.eu/) (WP2) and has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 101036756. This work was developed at the [Department of Environmental Science](https://aces.su.se) at Stockholm University.<br />


<img alt="EU logo" src="https://zeropm.eu/wp-content/uploads/2021/12/flag_yellow_low.jpg" width=100/>     <a rel='zeropm_web' href="https://zeropm.eu/"/><img alt="zeropm logo" src="https://zeropm.eu/wp-content/uploads/2022/01/ZeroPM-logo.png" width=250 /></a><a rel='zeropm_web' href="https://su.se/"/><img alt="zeropm logo" src="https://eu01web.zoom.us/account/branding/p/5065401a-9915-4baa-9c16-665dcd743470.png" width=200 /></a>

[![Powered by RDKit](https://img.shields.io/badge/Powered%20by-RDKit-3838ff.svg?logo=data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQBAMAAADt3eJSAAAABGdBTUEAALGPC/xhBQAAACBjSFJNAAB6JgAAgIQAAPoAAACA6AAAdTAAAOpgAAA6mAAAF3CculE8AAAAFVBMVEXc3NwUFP8UPP9kZP+MjP+0tP////9ZXZotAAAAAXRSTlMAQObYZgAAAAFiS0dEBmFmuH0AAAAHdElNRQfmAwsPGi+MyC9RAAAAQElEQVQI12NgQABGQUEBMENISUkRLKBsbGwEEhIyBgJFsICLC0iIUdnExcUZwnANQWfApKCK4doRBsKtQFgKAQC5Ww1JEHSEkAAAACV0RVh0ZGF0ZTpjcmVhdGUAMjAyMi0wMy0xMVQxNToyNjo0NyswMDowMDzr2J4AAAAldEVYdGRhdGU6bW9kaWZ5ADIwMjItMDMtMTFUMTU6MjY6NDcrMDA6MDBNtmAiAAAAAElFTkSuQmCC)](https://www.rdkit.org/)
    
