PFASGroups GUI
Desktop application for PFAS structure classification, prioritisation, and modelling
1 Classification Tab
Purpose
Load a molecular dataset and classify each molecule against the built-in PFAS group
definitions. Matched groups are highlighted as colour-coded fragments in the
structure view (Tab 2).
Data sources
- File – CSV, Excel (.xlsx/.xls), or SQLite (.db/.sqlite). Browse for the
file, choose a sheet (Excel) and the SMILES / name columns.
- Database – SQLite, MariaDB, or PostgreSQL. Fill in the connection form
(host, port, database name, user, password, table name).
- Manual – Paste SMILES directly. One per line, or comma-separated with
optional names (
Name: SMILES).
Options
- Halogens – Which halogens to include in the analysis. Start with
F only for classical PFAS; broaden to include Cl/Br/I for expanded
screening.
- Saturation filter –
Perf. (CF₂/CF₃) restricts to
perfluorinated chains; All fluorinated includes partially
fluorinated structures.
- Compute metrics – Calculate graph-theoretic component metrics (needed
for Tabs 4–6).
- Check definitions – Additionally test each molecule against the PFAS
definitions ticked below the option.
Results automatically propagate to all other tabs once classification completes.
2 Results Tab
Purpose
Browse the classified molecules as interactive compound cards. Each card shows:
- The 2-D structure with matched PFAS groups highlighted in group-specific colours.
- A list of matched group names and component counts.
- PFAS definition badges (pass/fail) if definition checking was enabled.
Controls
- Search – filter cards by molecule name.
- Show All / Hide All – toggle structure visibility for all cards.
- Export CSV – save a summary table (name, SMILES, matched groups,
definition results) to a CSV file.
3 Definition & SMARTS Tester Tab
Sub-tab A – PFAS Definitions
Enter a SMILES string and test it against:
- Built-in definitions (default, when the custom JSON field is left blank).
- Custom
PFASDefinition JSON – paste an array of definition
objects. Each must have id, name,
smarts (list of SMARTS strings), and optionally
fluorineRatio and description.
Sub-tab B – Custom Group (SMARTS)
Define a custom HalogenGroup and test a molecule against it. Fields:
- componentSmarts – name of the underlying fluorine component pattern
(e.g.
Perfluoroalkyl).
- SMARTS dict (JSON) – mapping of SMARTS pattern → required minimum count,
e.g.
{"[CX3](=O)[OX2H1]": 1}.
- Constraints (JSON) – element-count constraints,
e.g.
{"gte": {"F": 4}}.
Diagnostics are printed as JSON in the right panel, showing which stage passed
or failed (component detection, SMARTS matching, constraint checking).
4 Prioritise Tab
Purpose
Rank the classified molecules to identify the most structurally diverse or
representative candidates.
Modes
- Intrinsic: total component size – sum of all matched fragment sizes.
- Intrinsic: max component size – largest single matched fragment.
- Intrinsic: total + max – weighted combination;
use α and β to balance the two terms.
- Reference-based KL divergence – divergence from a supplied reference
dataset. Supply a CSV/Excel/SQLite file as the reference.
Output
A sortable ranking table and a horizontal bar chart showing the top-30
priority scores. Higher scores indicate higher priority (more structural
novelty or complexity relative to the reference).
5 Chemical Space Tab
Purpose
Visualise the PFASGroups fingerprint space as an interactive 2-D scatter plot.
Methods
- UMAP – fast, topology-preserving. Adjust n_neighbours
(local structure) and min_dist (cluster compactness).
- PCA – linear; no hyper-parameters.
- t-SNE – non-linear; adjust perplexity.
Fingerprint presets
Select a FINGERPRINT_PRESETS key. best (binary +
effective_graph_resistance) gives the best inter-group discrimination according
to the MQG benchmark, but any preset can be used.
Colour by
By default, points are coloured by the dominant PFAS group. If a label
column was loaded with the data, select it here to colour by an arbitrary
property.
6 ML Modelling Tab
Purpose
Compare fingerprint descriptors for binary property prediction using
HistGradientBoostingClassifier with repeated stratified k-fold CV.
Target column
Choose a column from your loaded data that contains binary labels (0/1 or
True/False). Common examples: bioactive, toxic, active.
Fingerprint sets
- PFASGroups presets – any combination of the benchmark-validated
fingerprint presets (binary, best, best_2, …).
- Morgan (RDKit) – radius-2 Morgan fingerprint, 512 bits. Useful
baseline.
- ToxPrint (729 bits) – requires pyCSRML to be installed.
- TxP_PFAS (129 bits) – PFAS-specific ToxPrint subset; requires pyCSRML.
- Custom TSV/CSV – rows = molecules (same order as input), columns = bits.
No header row.
Bayesian correlated t-test
All fingerprint-set pairs are compared using the Bayesian correlated t-test
(Benavoli et al. 2017) with ROPE = 0.01 ROC-AUC. The table reports:
- P(A>B) – probability that set A is meaningfully better than B.
- P(ROPE) – probability that the two sets are practically equivalent.
- P(B>A) – probability that set B is meaningfully better than A.
A P(ROPE) > 0.9 indicates the two fingerprint sets perform equivalently
on this dataset.
Keyboard shortcuts
Ctrl+Q – quit the application.
Ctrl+W – close (same as Ctrl+Q on Windows).
Installation
Install the GUI dependencies into your environment:
pip install "PFASGroups[gui]"
Or from the repository:
pip install -e ".[gui]"
Then launch with:
pfasgroups-gui
References
- Benavoli A et al. (2017) Time for a Change: a Tutorial for Comparing
Multiple Classifiers Through Bayesian Analysis. JMLR 18(1):2653–2688.
- Nadeau C & Bengio Y (2003) Inference for the Generalization Error.
Machine Learning 52(3):239–281.