mcp-molecules TODO
==================

Format:
    [ ] N      top-level item (open)
    [x] N      top-level item (done)
        [ ] N.M     sub-item
            [ ] N.M.K   sub-sub-item
Keep every item very short (1-5 sentences).


[ ] 0 Molecular weight calculator updates
      MCP tool: molecular_weight_calculator (args: formula, unit=g/mol|kg/mol|
      Da|u|kDa, uncertainty, monoisotopic, composition). Every call returns all
      three mass flavors (nominal/average/monoisotopic) under "masses"; the
      monoisotopic flag picks which one is mirrored at the top level ("primary").
      Example:
        request:  {"formula": "H2O"}
        reply:    {"formula": "H2O", "unit": "g/mol", "weight": 18.01535,
                   "uncertainty": null, "monoisotopic": false,
                   "primary": "average", "atoms": {"H": 2, "O": 1},
                   "masses": {
                     "nominal":      {"weight": 18.0,     "uncertainty": null,
                                      "formatted": "18.00 g/mol"},
                     "average":      {"weight": 18.01535, "uncertainty": null,
                                      "formatted": "18.02 g/mol"},
                     "monoisotopic": {"weight": 18.01056, "uncertainty": null,
                                      "formatted": "18.0106 g/mol"}},
                   "formatted": "18.02 g/mol"}
    [x] 0.1 Accept Unicode subscript digits (U+2080-2089) in formulae in
            addition to normal ASCII numbers, e.g. parse "H₂O" the same as
            "H2O". Normalize subscripts to ASCII before parsing.
    [x] 0.2 Return all mass flavors at once. molecular_weight_calculator
            currently supports two flavors, selected one at a time by the
            monoisotopic flag: the standard atomic weight (average molar mass,
            the default) and the monoisotopic mass (most abundant natural
            isotope per element). Getting both forces a second round-trip. To
            cut the AI request -> MCP server -> AI interpreting turnaround,
            compute and return every flavor on every call (keep the flag for
            back-compat / which one is "primary"), so the model never has to
            re-ask for the other.
            DONE: output now carries a "masses" dict with every flavor (each
            {weight, uncertainty, formatted}) plus a "primary" key; the
            monoisotopic flag only selects which flavor is mirrored at the top
            level. avg + mono summed from load_weights(False/True).
    [x] 0.3 Add the remaining mass flavors from 4.4 to the all-flavors output
            (0.2). The three distinct, clearly-labeled flavors are:
              - nominal      sum of integer mass numbers of the most abundant
                             isotopes
              - average      standard atomic weight (average molar mass)
              - monoisotopic exact mass of the most abundant isotopes
            0.2 already covers average and monoisotopic; nominal is the new one
            to add. Return all three together so callers never conflate them.
            DONE: new weights.load_nominal() builds {symbol: mass number} (most
            abundant / most-stable isotope; D=2, T=3); summed into the "nominal"
            flavor. All three flavors returned together under "masses".

[ ] 1 Molecule name database tool (name -> formula and formula -> name)
      MCP tool: find_chemical_compound (args: query, by=auto|name|formula,
      limit). Examples:
        request:  {"query": "aspirin"}
        reply:    {"query": "aspirin", "interpreted_as": "name",
                   "matches": [{"name": "Aspirin", "formula": "C9H8O4"}],
                   "source": "pubchem", "license": "public-domain"}
        request:  {"query": "C6H12O6", "by": "formula", "limit": 3}
        reply:    {"query": "C6H12O6", "interpreted_as": "formula",
                   "matches": [{"name": "An inositol", "formula": "C6H12O6"},
                               {"name": "D-Glucose", "formula": "C6H12O6"},
                               {"name": "Hexose", "formula": "C6H12O6"}],
                   "source": "pubchem", "license": "public-domain"}
    [ ] 1.1 Design the storage format. The dataset is large (millions of
            names; many aliases per compound), so a flat CSV is a poor fit
            for fast lookups. Decide on an efficient, bundleable store that
            supports both directions (name -> formula, formula -> name) and
            ships inside the package. Evaluate options before collecting data.
        [x] 1.1.1 Bundled SQLite (recommended candidate): single file, no
                  deps (stdlib sqlite3), indexed name and formula columns,
                  read-only at runtime. Consider FTS5 for fuzzy/partial name
                  search. Check on-disk size for the full dataset.
                  DONE: stdlib sqlite3, read-only (mode=ro&immutable) via
                  importlib.resources as_file. FTS5 deferred (not guaranteed
                  on all platforms). WITHOUT ROWID lookup tables avoid a
                  duplicate name index. Full set too big to bundle -> ship a
                  bounded subset; current bundled DB = PubChem CIDs 1..100k,
                  ~14MB (names+formula).
        [ ] 1.1.2 Alternatives to weigh: compressed columnar (Parquet) for
                  size, on-disk key-value (dbm/LMDB), or a prebuilt sorted +
                  mmap'd index. Compare size, lookup speed, build complexity,
                  and zero-dependency installs.
        [x] 1.1.3 Schema: normalize names (lowercase, strip annotations like
                  "(9CI)"), store canonical name + aliases -> formula, keep
                  source/license tag per row. formula -> name returns the
                  canonical (preferred) name. Decide what to bundle vs build.
                  DONE: normalized 3-table schema (compounds/names/formulas);
                  shared naming.normalize_name + hill_formula; source/license
                  on compounds + meta. Preferred name via insertion order
                  (rank); CID proxy is rough (e.g. C6H12O6 -> "Hexose") --
                  proper notability rank is a later enhancement. Builder
                  tools/build_namedb.py (build), prebuilt .db shipped (bundle).
        [ ] 1.1.4 Keep each source in a separate SQLite file (one DB per
                  source), so the local cache (item 3) can add/update sources
                  independently and the query layers them. Support loading DB
                  files from the user's home (e.g. ~/.local/share or a
                  configurable dir), not only the bundled package location.

    [ ] 1.2 Shipped pre-fetched cache (bundled partial database): a curated,
            size-bounded subset shipped in the package so common lookups work
            fully offline with zero setup. This IS the Tier-1 cache (item 3),
            read before any network call; misses fall through to on-demand
            per-record query (item 2). Decide subset criteria and a max on-disk
            size budget for the wheel. Sources (all public domain): Wikidata
            and/or PubChem.
        [ ] 1.2.1 Wikidata -- CC0 1.0 (public domain), no obligations. P274
                  (formula) + English label/aliases; ~1.43M items have P274.
                  Query SPARQL (query.wikidata.org/sparql) or JSON dump; key
                  on P274, not class Q11173.
        [ ] 1.2.2 PubChem (NCBI/NLM) -- public domain (US gov). Largest set
                  (~119M compounds). Names: CID-Synonym-filtered.gz,
                  CID-Title.gz, CID-IUPAC.gz (FTP /pubchem/Compound/Extras/).
                  Formula not in Extras: extract PUBCHEM_MOLECULAR_FORMULA
                  from CURRENT-Full/SDF and join on CID, or use PUG-REST for
                  a curated subset. Acknowledgment requested, not required.

    [ ] 1.3 Additional data sources (license-checked reference). We are NOT
            bulk-downloading whole databases. Coverage beyond the bundled
            subset (1.2) comes from on-demand per-record query (item 2), cached
            locally (item 3). The sources below are vetted for licensing -- use
            only SAFE ones, whether for querying or for building future bundled
            subsets.

            SAFE TO REDISTRIBUTE (public domain / GPL-compatible):
        [ ] 1.3.1 EPA DSSTox / CompTox -- US public domain. >1M chemicals
                  with preferred name, synonyms, formula, SMILES, InChI.
                  Bulk CSV/SDF on epa.figshare.com. Best bundleable bulk set.
        [ ] 1.3.2 ChEBI (EBI) -- CC BY 4.0 (attribution). ~195k curated
                  entries, high quality, name + formula + synonyms. Bulk
                  SDF/OBO/OWL at ftp.ebi.ac.uk/pub/databases/chebi/.
        [ ] 1.3.3 ChEMBL (EBI) -- CC BY-SA 3.0 (attribution + share-alike).
                  ~2.4M compounds. OK but the data subset stays CC BY-SA
                  (cannot relicense); keep it clearly labeled if used.

            GENERATE INSTEAD OF SHIP (tools, no DB to redistribute):
        [ ] 1.3.4 OPSIN (MIT) parses systematic IUPAC names -> SMILES;
                  RDKit (BSD) CalcMolFormula -> formula. Runtime fallback for
                  systematic names; does NOT cover trivial names (aspirin,
                  caffeine) -- those still need a lookup table.

            DO NOT BUNDLE (license-incompatible / unavailable):
        [ ] 1.3.5 Avoid: DrugBank full (CC BY-NC / proprietary), NIST
                  Chemistry WebBook (SRD, all rights reserved), CAS Common
                  Chemistry (CC BY-NC), ChemIDplus (retired 2022 -> PubChem).

    [ ] 1.4 Tool implementation -- the layered query layer behind the MCP tool
            find_chemical_compound (the bidirectional lookup that replaced the
            directional name_to_formula / formula_to_name). A Source chain in
            src/mcp_molecules/names.py is searched in tier order; only Tier 1 is
            wired so far.
        [x] 1.4.1 Search the local, package-shipped databases (Tier 1, the
                  bundled subset from 1.2). Read-only; the first tier hit in the
                  lookup order. This is the always-available offline path.
                  DONE: Source protocol + BundledSource + find_compound in
                  names.py (auto name/formula routing with fallback); exposed via
                  the find_chemical_compound MCP tool.
        [x] 1.4.2 Search the cache databases (Tier 2, the writable user-dir
                  store from 1.1.4). Consulted after the bundled subset and
                  before any network call; holds records fetched in 1.4.3.
                  DONE: src/mcp_molecules/cache.py -- writable SQLite at
                  $MCP_MOLECULES_CACHE_DB or $XDG_DATA_HOME/mcp-molecules/
                  names_cache.db, created lazily on first write (pure reads of a
                  missing cache are a no-op). Bundled-mirror schema + per-row
                  source/license (cache mixes sources) + negcache table. Wired as
                  CacheSource in names.py SOURCES, between bundled and remote.
        [x] 1.4.3 Fetch records from the remote databases (Tier 3, online
                  fallback per item 2) on a miss in 1.4.1+1.4.2. Opt-in network,
                  fail soft to "not found" when offline, and write what comes
                  back into the Tier-2 cache (1.4.2).
                  DONE: RemoteSource in names.py -> remote.py (item 2.2). Network
                  opt-in via $MCP_MOLECULES_ONLINE; offline/errors degrade to []
                  (no raise). Hits are written to the Tier-2 cache; genuine "not
                  found" goes to the negcache (TTL, $MCP_MOLECULES_NEGCACHE_TTL,
                  default 1 week) so misses don't re-query; network errors are
                  NOT remembered, so a flaky link retries later.

[ ] 2 Per-request download of a chemical database record (online fallback).
      On a miss in the pre-fetched cache (item 3), query the live database(s)
      per request, return the record, and cache it. Opt-in network; graceful
      offline fallback; negative caching with a TTL. See databases.txt (Tier 3).
    [x] 2.0 Per-source cache files: store each source's fetched records in its
            own SQLite DB (e.g. names_cache_<source>.db) instead of the single
            mixed-source names_cache.db, so sources can be added/updated/dropped
            independently and the query layers them. Closes the 1.1.4 gap (one
            DB per source): per-file provenance replaces the per-row source/
            license columns; CacheSource (names.py) iterates the per-source
            files in the user dir.
            DONE: cache.py is now directory-oriented -- cache_dir()
            ($MCP_MOLECULES_CACHE_DIR, else $XDG_DATA_HOME/mcp-molecules) holds
            one names_cache_<source>.db per source; cache_path(source) +
            list_sources() discover them. Dropped the compounds.source/license
            columns (UNIQUE now on source_ref alone); provenance moved to each
            file's meta table, read via source_license(source). store/lookup_*/
            is_negative/remember_miss all take a leading source arg; negcache is
            per-source. names.CacheSource iterates list_sources() and returns the
            first file with a hit (with its provenance); RemoteSource writes/
            negcaches under remote.SOURCE. Env var renamed CACHE_DB -> CACHE_DIR.
    [ ] 2.1 PubChem (NCBI/NLM) -- public domain. Live PUG-REST lookup:
            name -> /compound/name/{name}/property/MolecularFormula,Title/JSON;
            formula -> /compound/fastformula/{formula}/cids/JSON; partial ->
            /autocomplete/compound/{prefix}/json. Resolves trivial + systematic
            names + CAS numbers. Reliable; rate limit 5 req/s, 400 req/min.
    [x] 2.2 Wikidata -- CC0 1.0 (public domain, no attribution obligation).
            A single-record lookup is light, so the bulk-extraction flakiness
            from 1.2.1 mostly does not bite here. name -> entity via the API
            (action=wbsearchentities&search={name}&language=en&type=item), then
            read P274 (chemical formula) from its claims (action=wbgetentities&
            props=claims|labels). formula -> name via SPARQL (?item wdt:P274
            "{hill}"; rdfs:label @en). Normalize P274 through hill_formula; send
            a descriptive User-Agent (WMF policy).
            DONE: src/mcp_molecules/remote.py. wikidata_by_name uses
            wbsearchentities -> wbgetentities (props=claims|labels|aliases),
            keeping only candidates whose en label/alias normalizes to the query
            and that carry P274. wikidata_by_formula uses SPARQL on the Hill key.
            P274 normalized via hill_formula (raw NFKC fallback); descriptive UA;
            fail-soft _get_json (returns None on any timeout/HTTP/parse error).
    [ ] 2.3 EPA DSSTox / CompTox -- US public domain (see 1.3.1). Live CCTE
            Chemical API at api-ccte.epa.gov (free x-api-key header). name ->
            /chemical/search/equal/{name} (or start-with/contain) returns
            DTXSID hits; then /chemical/detail/search/by-dtxsid/{dtxsid} for
            molFormula + preferredName + casrn. formula -> DTXSIDs via the
            equivalent formula search. Normalize molFormula through hill_formula;
            same fail-soft pattern as 2.2 (degrade to [] on any error). API key
            from $MCP_MOLECULES_EPA_API_KEY (sent as x-api-key); unset -> skip
            this source (treat like offline), so the keyless sources still work.
    [ ] 2.4 ChEBI (EBI) -- CC BY 4.0 (attribution; see 1.3.2). Live ChEBI Web
            Services at www.ebi.ac.uk/webservices/chebi/2.0. name ->
            getLiteEntity?search={name}&searchCategory=ALL returns ChEBI IDs;
            then getCompleteEntity?chebiId={id} for Formulae + chebiAsciiName.
            formula -> getLiteEntity with searchCategory=FORMULA. Normalize
            Formulae through hill_formula; keyless; fail-soft like 2.2 (degrade
            to [] on any error). Attribution required: carry source/license
            through to the result so it surfaces (the cache already records it
            per record / per 2.0 per-source file).

[ ] 3 Pre-fetched cache databases (bundled + user cache).
      Small SQLite stores read first, before any network call. Bundled subset
      ships in the wheel (read-only); fetched records (item 2) go to a writable
      user-dir cache. See databases.txt (Tier 1 + Tier 2).
    [x] 3.1 PubChem (NCBI/NLM) -- public domain. Bundled everyday subset BUILT:
            src/mcp_molecules/data/names_pubchem.db -- PubChem CIDs 1..100k,
            synonym-count rank>=50 + clean-name heuristic, 6,943 compounds,
            3.3 MB. Reproducible: fetch_pubchem -> curate -> build_namedb.
            DONE: committed to the repo (a .gitignore negation tracks just this
            .db while *.db stays ignored), so it is version-tracked and hatchling
            ships it in the wheel by default. Regenerate then commit on refresh.

[ ] 4 Calculator extensions (offline, deterministic, formula-in). New
      computations derivable from the parsed formula + bundled NIST atomic data,
      same ethos as the existing molecular_weight_calculator. No structure/
      connectivity needed.
    [x] 4.1 Isotope distribution (isotope_distribution tool): full isotopic
            pattern (M, M+1, M+2 ... relative intensities), not just the
            monoisotopic mass. The key output for mass-spec comparison. Built
            from NIST isotope abundances.
            DONE: src/mcp_molecules/isotopes.py -- per-element (mass, abundance)
            polynomials raised to the atom count by binary exponentiation and
            convolved across elements, with pruning to stay bounded. Explicit D/T
            pin a single isotope; no-natural-abundance elements (Tc) fall back to
            the most stable isotope. grouping='unit' (nominal centroid) or
            'exact' (resolved isotopologues); threshold/limit trim peaks. Also
            returns monoisotopic + average mass. Exposed as the
            isotope_distribution MCP tool; tests in tests/test_server.py.
            Example -- isotope_distribution("C8H10N4O2")  (caffeine):
                {
                  "formula": "C8H10N4O2",
                  "charge": 0,
                  "grouping": "unit",
                  "monoisotopic_mass": 194.08038,
                  "average_mass": 194.19092,
                  "monoisotopic_mz": null,
                  "base_peak": {
                    "nominal": 194, "mass": 194.08038,
                    "relative": 100.0, "abundance": 0.8988
                  },
                  "peaks": [
                    {"nominal": 194, "mass": 194.08038, "relative": 100.0,  "abundance": 0.8988},
                    {"nominal": 195, "mass": 195.08287, "relative": 10.305, "abundance": 0.0926},
                    {"nominal": 196, "mass": 196.08497, "relative": 0.892,  "abundance": 0.0080}
                  ],
                  "formatted": "194 (100%), 195 (10%), 196 (1%)",
                  "source": "NIST Atomic Weights and Isotopic Compositions",
                  "license": "..."
                }
    [ ] 4.2 Adduct / charge-state m/z: report m/z for common ions ([M+H]+,
            [M-H]-, [M+Na]+, multiply-charged) as (M +/- n*1.00728)/n. Pairs
            with 4.1. PARTLY DONE in 4.1: the isotope_distribution tool's
            `charge` param already gives protonation/deprotonation m/z at any
            |n| ([M+nH]+/[M-nH]-). Remaining: named non-proton adducts
            (Na/K/NH4/...), and surfacing m/z on molecular_weight_calculator.
    [ ] 4.3 Degree of unsaturation (rings + double bonds):
            (2C + 2 + N - H - X) / 2. One-line saturation/aromaticity hint.
    [x] 4.4 Distinct, clearly-labeled mass flavors: nominal vs average vs
            monoisotopic (exact) mass in one response -- people conflate them.
            DONE via 0.2 + 0.3: molecular_weight_calculator returns all three
            under "masses" on every call.
    [ ] 4.5 Expose Hill-system formula normalization (canonicalize an input
            formula). hill_formula already exists internally.
    [ ] 4.6 Formula arithmetic: add/subtract formulae (e.g. glucose - H2O, or
            sum a reaction's reactants) and report the net formula + mass.
    [ ] 4.7 Reaction balancing + stoichiometry: solve coefficients for an
            unbalanced equation (nullspace of the element matrix), then
            mol<->gram conversion, limiting reagent, theoretical yield.
    [ ] 4.8 Solution/prep math: molarity <-> mass <-> volume, dilution
            (C1V1=C2V2), ppm/molal. Needs only molar mass.
    [ ] 4.9 Inverse elemental analysis: given target mass %s (CHN), find the
            best-fit formula. Inverse of percent composition.

[ ] 5 Calculator extensions needing a small bundled static table (still
      offline). A tiny per-element properties table, no network.
    [ ] 5.1 Empirical -> molecular formula given a measured molar mass
            (multiply the empirical unit to the nearest integer multiple).
    [ ] 5.2 Per-element property lookups: valence, electronegativity,
            group/period, common oxidation states. Enables oxidation-state
            assignment for simple formulae.
    [ ] 5.3 Formula validity / sanity checks: nitrogen rule, even-electron
            rule, plausible valence balance; flag impossible compositions for a
            given mass.

[ ] 6 Out of scope for the calculator (structure-dependent). logP, pKa,
      boiling/melting point, density, solubility, SMILES/InChI generation all
      need connectivity a bare formula lacks -- these belong to the lookup side
      (items 1-2), not the calculator. OPSIN/RDKit (1.3.4) is the formula<->
      structure bridge if ever needed. Recorded here so we stop reconsidering.

[ ] 7 Input validation and error-handling fixes. Rough edges surfaced by the
      functional test pass (tests/test_*_functional.py); the tests currently
      pin the present behavior, so update them when each is fixed.
    [ ] 7.1 parse_formula rejects any whitespace: " H2O", "H2O ", and "H2 O"
            all raise FormulaError. Trim leading/trailing whitespace before
            parsing (decide whether internal spaces stay an error).
    [ ] 7.2 parse_formula("H0") returns [("H", 0)] instead of erroring. Reject
            zero atom counts (and confirm explicit "H1" is handled as intended).
    [ ] 7.3 The bundled name DB indexes some formula strings as name aliases
            (e.g. "C2H6O" -> Ethanol, "h2o" -> Water), so those queries resolve
            as interpreted_as="name" rather than "formula". Decide if this is
            wanted; if not, stop indexing bare formulae as names so formula
            queries route through the formula path.
    [x] 7.4 Investigate concurrent access: when multiple instances of this MCP
            server run at once, can a race condition corrupt or break the shared
            Tier-2 cache (writes/negcache) or the per-source SQLite files?
            Determine the failure modes and what locking/isolation is needed.
            DONE: no corruption (SQLite file locking protects the local-FS cache),
            but the default rollback journal raised "database is locked" under
            concurrent writers (reproduced: 6 procs x 200 writes), and those
            errors were NOT fail-soft -- cache.store/remember_miss are called
            unguarded by names.RemoteSource, so a lock timeout broke the whole
            find_chemical_compound call. Fix: cache._connect now opens WAL +
            15s busy_timeout (readers lock-free, writers serialize), and a
            _fail_soft wrapper degrades any residual OperationalError on every
            cache read/write to a skip-the-cache no-op. Regression tests in
            test_cache_remote.py (spawned multi-proc writers: no raise, 200/200
            rows, integrity ok). NFS/SMB caches stay out of scope (locking there
            is unreliable regardless).
