Metadata-Version: 2.4
Name: metagenomescope
Version: 1.3.0
Summary: Visualization tool for (meta)genome assembly graphs
Project-URL: homepage, https://marbl.github.io/MetagenomeScope/
Project-URL: source, https://github.com/marbl/MetagenomeScope
Author-email: MetagenomeScope Development Team <mfedarko@umd.edu>
Maintainer-email: Marcus Fedarko <mfedarko@umd.edu>
License-Expression: GPL-3.0-only
License-File: COPYING.txt
Keywords: assembly,bioinformatics,bubble,graph,metagenome
Classifier: Development Status :: 3 - Alpha
Classifier: Operating System :: MacOS :: MacOS X
Classifier: Operating System :: POSIX
Classifier: Operating System :: Unix
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: Topic :: Scientific/Engineering :: Visualization
Requires-Python: >=3.8
Requires-Dist: click
Requires-Dist: dash
Requires-Dist: dash-ag-grid>=33.3.3
Requires-Dist: dash-bootstrap-components
Requires-Dist: networkx
Requires-Dist: pandas
Requires-Dist: plotly
Requires-Dist: pyfastg>=0.2.0
Requires-Dist: pygraphviz
Provides-Extra: dev
Requires-Dist: black>=22.1.0; extra == 'dev'
Requires-Dist: flake8; extra == 'dev'
Requires-Dist: pytest; extra == 'dev'
Requires-Dist: pytest-cov; extra == 'dev'
Description-Content-Type: text/markdown

# <a href="https://marbl.github.io/MetagenomeScope/"><img src="https://raw.githubusercontent.com/fedarko/MetagenomeScope-1/refs/heads/desk/metagenomescope/assets/favicon.ico" alt="Icon" /></a> MetagenomeScope

<div align="center">
<a href="https://github.com/marbl/MetagenomeScope/actions/workflows/main.yml"><img src="https://github.com/marbl/Metagenomescope/actions/workflows/main.yml/badge.svg" alt="CI" /></a>
<a href="https://codecov.io/gh/marbl/MetagenomeScope"><img src="https://codecov.io/gh/marbl/MetagenomeScope/branch/main/graph/badge.svg" alt="Code Coverage" /></a>
<a href="https://pypi.org/project/metagenomescope"><img src="https://img.shields.io/pypi/v/metagenomescope?color=0073b7&labelColor=003d63" alt="PyPI" /></a>
<a href="https://anaconda.org/bioconda/metagenomescope"><img src="https://img.shields.io/conda/vn/bioconda/metagenomescope.svg?color=3eb049&labelColor=005500" alt="bioconda" /></a>
</div>

Interactive visualization tool for (meta)genome assembly graphs.

MetagenomeScope decomposes the graph into **structural patterns** and
highlights these as annotations on the graph. By default it lays out the graph
[**hierarchically**](https://en.wikipedia.org/wiki/Layered_graph_drawing),
using [Graphviz](https://graphviz.org/)'
[_dot_](https://graphviz.org/docs/layouts/dot/) algorithm.
These and other approaches help simplify the investigation of fine-grained
structures within assembly graphs.

MetagenomeScope also contains various functionalities for visualizing assembly
graphs at larger scales -- for example, highlighting scaffold paths on the graph and
drawing summary plots of the graph's structure.

MetagenomeScope supports the outputs of most modern assemblers,
can handle large graphs including tens of thousands of nodes,
and is backed by over five hundred automatic software tests.

The tool is under active development, so please let us know if you have any feedback!

## Screenshots

<table>
  <tbody>
    <tr align="center">
      <td><b>Stool metagenome assembly (<a href="https://github.com/marbl/MetaCarvel">MetaCarvel</a>)</b></td>
    </tr>
    <tr>
      <td><img src="https://raw.githubusercontent.com/marbl/MetagenomeScope/refs/heads/main/docs/res/screenshots/aug1cc2.png" alt="Second-largest component in a metagenome scaffold graph, showing various identified structural patterns." /></td>
    </tr>
    <tr align="center">
      <td><i>Data source: <a href="https://www.ncbi.nlm.nih.gov/bioproject/?term=SRS049959"><tt>SRS049959</tt></a></i></td>
    </tr>
  </tbody>
</table>

<table>
  <thead>
    <tr>
      <th>Human genome assembly (HG002, <a href="https://github.com/marbl/verkko">Verkko</a> v1.1)</th>
      <th>Yeast genome assembly (<a href="https://github.com/mikolmogorov/Flye">Flye</a>)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><img src="https://raw.githubusercontent.com/marbl/MetagenomeScope/refs/heads/main/docs/res/screenshots/hg002.png" alt="Entire HG002 (human genome) assembly graph." /></td>
      <td><img src="https://raw.githubusercontent.com/marbl/MetagenomeScope/refs/heads/main/docs/res/screenshots/flye_yeast.png" alt="Part of a yeast genome assembly graph, showing the visualization of a scaffold of edge sequences." /></td>
    </tr>
    <tr align="center">
      <td><i>Data source: <a href="https://github.com/marbl/HG002#downloads">T2T Consortium</a></i></td>
      <td><i>Data source: <a href="https://github.com/almiheenko/AGB/tree/master/test_data/flye_yeast">AGB</a></i></td>
    </tr>
  </tbody>
</table>

<table>
  <thead>
    <tr>
      <th>Summarizing graph structure in a <a href="https://en.wikipedia.org/wiki/Treemapping">treemap</a></th>
      <th>Interactive charts of graph statistics</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><img src="https://raw.githubusercontent.com/marbl/MetagenomeScope/refs/heads/main/docs/res/screenshots/aug1treemap.png" alt="Treemap of node counts per component." /></td>
      <td><img src="https://raw.githubusercontent.com/marbl/MetagenomeScope/refs/heads/main/docs/res/screenshots/aug1_covlen.png" alt="Scatterplot comparing total node length with average edge bundle sizes in a scaffold graph." /></td>
    </tr>
    <tr align="center">
      <td colspan="2"><i>Data source: <a href="https://www.ncbi.nlm.nih.gov/bioproject/?term=SRS049959"><tt>SRS049959</tt></a></i></td>
    </tr>
  </tbody>
</table>

## Installation

### Using [conda](https://conda.io/) or [mamba](https://mamba.readthedocs.io/)

<!-- yeah yeah yeah in theory you shouldn't need to say "-c conda-forge" but i
can't get this to work on my system without using that so i assume there is
some inherent jank that makes this occasionally required -->

```bash
mamba install -c bioconda -c conda-forge metagenomescope
```

### Using [pip](https://pip.pypa.io/)

First, you need to make sure that [Graphviz](https://graphviz.org/)
and [PyGraphviz](https://github.com/pygraphviz/pygraphviz) are installed
properly (so that PyGraphviz knows where to find Graphviz).
See PyGraphviz'
[`INSTALL.txt`](https://github.com/pygraphviz/pygraphviz/blob/main/INSTALL.txt)
for details.

(Probably the most consistent way to do this is just installing Graphviz and
PyGraphviz from conda-forge, but at that point you might as
well do the entire installation from within conda...)

Anyway, once Graphviz and PyGraphviz are installed, you should be able to just
run:

```bash
pip install metagenomescope
```

## Usage

```bash
mgsc -g graph.gfa
```

... where `graph.gfa` is a path to the assembly graph you want to visualize
(see information below on supported graph filetypes).

This will start a server using Dash.
The port number of the server defaults to `8050`, so navigate
to `localhost:8050` in a web browser to access the visualization.

### All command-line options

```
Usage: mgsc [OPTIONS]

  Visualizes an assembly graph.

  Please visit https://github.com/marbl/MetagenomeScope for more information.

Options:
  -g, --graph FILE          In GFA, FASTG, DOT, GML, or LastGraph format.  [required]
  -a, --agp FILE            AGP file describing paths.
  -t, --vtsv FILE           Verkko assembly.paths.tsv file describing paths.
  -i, --info FILE           Flye assembly_info.txt file describing contigs/scaffolds.
  -p, --port INTEGER RANGE  Server port number.  [default: 8050; 1024<=x<=65535]
  --rmdup [gfaonly|y|n]     Remove parallel edges.  [default: gfaonly]
  --decomp / --no-decomp    Do pattern decomposition.  [default: decomp]
  --dcheck / --no-dcheck    Do post-decomposition sanity check.  [default: no-dcheck]
  --debug / --no-debug      Use Dash's debug mode.  [default: no-debug]
  --verbose / --no-verbose  Log extra details.  [default: no-verbose]
  -v, --version             Show the version and exit.
  -h, --help                Show this message and exit.
```

### Supported assembly graph filetypes (`-g`)

| Filetype | Generated by | Notes |
| -------- | ------------ | ----- |
| **[GFA](https://gfa-spec.github.io/GFA-spec/) (`.gfa`)** | [Flye](https://github.com/mikolmogorov/Flye), [LJA](https://github.com/AntonBankevich/LJA), [hifiasm](https://github.com/chhylp123/hifiasm), [Verkko](https://github.com/marbl/verkko), ... | Both GFA 1 and GFA 2 files are accepted. [Currently](https://github.com/marbl/MetagenomeScope/issues/147) we visualize segments (`S`-lines), links (`L`-lines in GFA 1), [dovetail edges](https://github.com/GFA-spec/GFA-spec/issues/133) (some `E`-lines in GFA 2), and paths of segments (`P`-lines in GFA 1, `O`-lines in GFA 2). |
| **[FASTG](https://github.com/fedarko/pyfastg#the-fastg-file-format) (`.fastg`)** | [SPAdes](https://github.com/ablab/spades), [MEGAHIT](https://github.com/voutcn/megahit) | [Expects](https://github.com/fedarko/pyfastg) FASTG files produced by SPAdes or MEGAHIT. |
| **[DOT](https://en.wikipedia.org/wiki/DOT_(graph_description_language)) (`.dot`, `.gv`)** | [Flye](https://github.com/mikolmogorov/Flye), [LJA](https://github.com/AntonBankevich/LJA) | Expects DOT files produced by Flye or LJA. See "What filetype should I use for de Bruijn graphs?" in the FAQs below. |
| **[GML](https://networkx.org/documentation/stable/reference/readwrite/gml.html) (`.gml`)** | [MetaCarvel](https://github.com/marbl/MetaCarvel) | Expects GML files produced by MetaCarvel. |
| **[LastGraph](https://github.com/dzerbino/velvet/blob/master/Manual.pdf) (`.LastGraph`)** | [Velvet](https://github.com/dzerbino/velvet) | [Currently](https://github.com/marbl/MetagenomeScope/issues/147) we just visualize the raw structure (nodes and arcs). |

Should you run into [additional](https://xkcd.com/927/) assembly graph filetypes you'd like us to
support, feel free to open a GitHub issue.

### Displaying paths on the graph

Paths can optionally be specified through any of the following inputs:

<details>
  <summary><strong>AGP files (<code>-a</code>)</strong></summary>

<hr/>

_See the [AGP specification](https://www.ncbi.nlm.nih.gov/genbank/genome_agp_specification/) for details._

**If your graph is in DOT format:**
  - We assume the `component_id`s in column 6a of the AGP file correspond to edge IDs.

**Otherwise:**
  - We assume the `component_id`s correspond to node IDs.

<hr/>
</details>

<details>
  <summary><strong>Verkko <code>assembly.paths.tsv</code> files (<code>-t</code>)</strong></summary>

<hr/>

_See [Verkko's documentation](https://github.com/marbl/verkko#outputs) for details._

**If your graph is in DOT format:**
  - We assume names on each path correspond to edge IDs.

**Otherwise:**
  - We assume names on each path correspond to node IDs.

<hr/>
</details>

<details>
  <summary><strong>Flye <code>assembly_info.txt</code> files (<code>-i</code>)</strong></summary>

<hr/>

_See [Flye's documentation](https://github.com/mikolmogorov/Flye/blob/flye/docs/USAGE.md#output) for details._

**If your graph is in DOT format:**
  - We will visualize the edge-paths described in the `.txt` file.

**If your graph is in GFA format:**
  - The contigs in the GFA file should correspond to collapsed edge-paths in the `.txt` file, so we can't really visualize these edge-paths.

  - However, we will extract contig information from the `.txt` file (e.g. coverage) and show it in the interface as node data.

**If your graph is not in DOT or GFA format:**
  - We will ignore the `.txt` file. Flye should only generate DOT or GFA files, so like... where did you even get this data from :skull:

<hr/>
</details>

<details>
  <summary><strong><code>P</code>-lines in GFA 1 files, or <code>O</code>-lines in GFA 2 files (<code>-g</code>)</strong></summary>


<hr/>

**For GFA 1 paths ([`P`-lines](https://gfa-spec.github.io/GFA-spec/GFA1.html#p-path-line)):**
  - We will visualize these node-paths.

**For GFA 2 paths ([`o`-lines](https://gfa-spec.github.io/GFA-spec/GFA2.html#group)):**
  - We will show all of the nodes on these paths, "expanding" edges and recursive patterns accordingly.

  - For more details, see "How do you handle `O`-lines in GFA 2 files?" in the FAQs below.

<hr/>
</details>

## Structural patterns

### Types of patterns

MetagenomeScope detects and highlights five types of structural patterns on the graph:

<img src="https://raw.githubusercontent.com/marbl/MetagenomeScope/refs/heads/main/docs/res/screenshots/patterns.png" alt="Screenshot of MetagenomeScope's interface showing examples of the patterns it supports." />

#### 1. Bubbles (and bulges)

**Bubbles** ([Miller _et al._, 2010](https://pmc.ncbi.nlm.nih.gov/articles/PMC2874646/); [Nijkamp _et al._, 2013](https://pmc.ncbi.nlm.nih.gov/articles/PMC3916741/)) follow a diverge-coverge pattern. They generally indicate variation -- either real (e.g. an alternate path is caused by a SNP) or erroneous (e.g. an alternate path is caused by a sequencing error). We identify bubbles using a modified version of the algorithm given in [Onodera _et al._, 2013](https://link.springer.com/chapter/10.1007/978-3-642-40453-5_26).

Similarly, **bulges** ([Pevzner _et al._, 2004](https://pmc.ncbi.nlm.nih.gov/articles/PMC515325/); [Vasilinetc _et al._, 2015](https://academic.oup.com/bioinformatics/article/31/20/3262/195494)) are pairs of nodes where there exist multiple parallel edges from one node to another.

Bulges can typically be interpreted the same way as bubbles -- you generally see bulges in "edge-centric" (e.g. de Bruijn) graphs, and bubbles in "node-centric" (e.g. overlap) graphs. So, we label both bubbles and bulges identically.

#### 2. Frayed ropes

**Frayed ropes** ([Miller _et al._, 2010](https://pmc.ncbi.nlm.nih.gov/articles/PMC2874646/)) follow a converge-diverge pattern; they have the opposite structure as bubbles. They generally indicate interspersed repeats in the middle region.

#### 3 and 4. Chains and cyclic chains

**Chains** are just non-branching paths of at least two nodes. **Cyclic chains** are chains where the end node has an outgoing edge to the start node.
Cyclic chains represent a simpler form of what are known in edge-centric graphs as _whirl_ structures ([Pevzner _et al._, 2004](https://pmc.ncbi.nlm.nih.gov/articles/PMC515325/)).

#### 5. Bipartites

**Bipartites** are regions of the graph that can be partitioned into two layers of nodes (let's call them _Left_ and _Right_), such that all of the nodes in _Left_ have outgoing edges to all of the nodes in _Right_. We require that both _Left_ and _Right_ contain at least two nodes each. (Such a pattern is essentially a stricter version of a [complete bipartite graph](https://en.wikipedia.org/wiki/Complete_bipartite_graph).)

Surprisingly, bipartites pop up a lot in certain assembly graphs! These are less well-documented in the literature than the above types of patterns, but our suspicion is that these are another indication (like frayed ropes) of repeats -- and that a lot of these patterns in succession might indicate things like strain heterogeneity. See Figure 5 of [Li _et al._, 2012](https://academic.oup.com/bfg/article/11/1/25/191455) for an example of how a bipartite (or, viewed another way, a frayed rope) could be caused by a repeat.

### Boundary node splitting

Sometimes, it is best to consider a node as the child of two patterns.
A common example of this is a _bubble chain_ ([Dabbaghie  _et al._, 2022](https://pmc.ncbi.nlm.nih.gov/articles/PMC9438957/)), where multiple bubbles occur one after another.
In a bubble chain, the "end node" of one bubble is also the "start node" of another bubble!

To accommodate these kinds of cases, MetagenomeScope **splits the boundary nodes** of a pattern.
Splitting a node `A` transforms it into two nodes: `A-L` and `A-R`, which are connected by a single "fake edge" `A-L -> A-R`.
Because this allows a node to be in two patterns simultaneously, this makes it possible for us to identify a much richer set of patterns and describe the graph structure more accurately.

"Split nodes" and "fake edges" are drawn with distinct visual styles, in order to make them clearer -- split nodes are drawn in a way that looks like the node has been split in half, and fake edges are drawn as thick dashed lines.

<table>
  <tbody>
    <tr align="center">
      <td><b>Split nodes in a node-centric graph</b><br/><a href="https://github.com/marbl/MetaCarvel/">MetaCarvel</a> stool metagenome scaffold graph (the "large graph" shown below), component #17</td>
    </tr>
    <tr>
      <td><img src="https://raw.githubusercontent.com/marbl/MetagenomeScope/refs/heads/main/docs/res/screenshots/aug1_splitnode_cc17.png" alt="Example split nodes in a MetaCarvel graph." /></td>
    </tr>
    <tr align="center">
      <td><b>Split nodes in an edge-centric graph</b><br/><a href="https://github.com/AntonBankevich/LJA/">jumboDBG</a> de Bruijn graph of human chromosome 15 (available as <tt>chr15_full.gv</tt> in <a href="https://github.com/marbl/MetagenomeScope/tree/main/metagenomescope/tests/input"><tt>metagenome/tests/input/</tt></a>)</td>
    </tr>
    <tr>
      <td><img src="https://raw.githubusercontent.com/marbl/MetagenomeScope/refs/heads/main/docs/res/screenshots/chr15_full_splitnode.png" alt="Example split nodes in a jumboDBG graph." /></td>
    </tr>
  </tbody>
</table>

Note that node splitting is not always necessary -- as the figures above show, sometimes a boundary node of a pattern doesn't need to be the
boundary node of any other pattern.
To limit the amount of split nodes we need to show in the visualization,
we detect and remove unnecessary split nodes after finishing the decomposition procedure.

## Example datasets

Here are three graphs of various sizes, each produced by a different assembly program.

### 1. Small graph: Flye (DOT file; 61 nodes; 122 edges) -- _S. cerevisiae_ (yeast)

This data is from [AGB's GitHub repository](https://github.com/almiheenko/AGB/tree/master/test_data/flye_yeast).

```bash
wget https://raw.githubusercontent.com/marbl/MetagenomeScope/refs/heads/main/metagenomescope/tests/input/flye_yeast.gv
wget https://raw.githubusercontent.com/marbl/MetagenomeScope/refs/heads/main/metagenomescope/tests/input/flye_yeast_assembly_info.txt

mgsc -g flye_yeast.gv -i flye_yeast_assembly_info.txt
```

<table>
  <tbody>
    <tr align="center">
      <td><b>Entire graph</b></td>
    </tr>
    <tr>
      <td><img src="https://raw.githubusercontent.com/marbl/MetagenomeScope/refs/heads/main/docs/res/screenshots/flye_yeast_all.png" alt="Entire yeast assembly graph." /></td>
    </tr>
    <tr align="center">
      <td><b>Zoomed in on <tt>scaffold_34</tt> in component #1</b></td>
    </tr>
    <tr>
      <td><img src="https://raw.githubusercontent.com/marbl/MetagenomeScope/refs/heads/main/docs/res/screenshots/flye_yeast.png" alt="Yeast assembly graph, zoomed in on a scaffold path." /></td>
    </tr>
  </tbody>
</table>

> [!TIP]
> #### Label styles
> The "Labels" section of the interface has some settings that help make labels prettier. I produced the screenshot on the right above using the `Offset`, `Outline`, and `Rotate` edge label settings.

> [!NOTE]
> #### Drawing edge-centric graphs
> We draw DOT files from Flye and LJA using the typical conventions for drawing de Bruijn graphs -- with nodes represented as circles, and edges given labels with their length and coverage. This resembles the styles from various papers that show visualizations of these kinds of graphs, including [Pevzner _et al._, 2004](https://pmc.ncbi.nlm.nih.gov/articles/PMC515325/); [Mikheenko & Kolmogorov 2019](https://academic.oup.com/bioinformatics/article/35/18/3476/5306331); and the DOT outputs of [Flye](https://github.com/mikolmogorov/Flye) and [LJA](https://github.com/AntonBankevich/LJA).

### 2. Medium graph: Velvet (LastGraph file; 558 nodes; 664 edges) -- _E. coli_

This is an example graph from [Bandage](http://rrwick.github.io/Bandage/).

```bash
wget https://github.com/rrwick/Bandage/raw/refs/heads/gh-pages/samples/E_coli_LastGraph.zip
unzip E_coli_LastGraph.zip

mgsc -g E_coli_LastGraph
```

<table>
  <tbody>
    <tr align="center">
      <td><b>Hierarchical layout with <a href="https://graphviz.org/docs/layouts/dot/"><i>dot</i></a></b></td>
    </tr>
    <tr>
      <td><img src="https://raw.githubusercontent.com/marbl/MetagenomeScope/refs/heads/main/docs/res/screenshots/ecoli_dot.png" alt="E. coli assembly graph drawn using a hierarchical layout algorithm, dot." /></td>
    </tr>
    <tr align="center">
      <td><b>Force-directed layout with <a href="https://graphviz.org/docs/layouts/sfdp/"><i>sfdp</i></a></b><br/>(not showing patterns, and using an <a href="https://graphviz.org/docs/attrs/overlap_scaling/">overlap scaling factor</a> of <tt>-15</tt>)</td>
    </tr>
    <tr>
      <td><img src="https://raw.githubusercontent.com/marbl/MetagenomeScope/refs/heads/main/docs/res/screenshots/ecoli_sfdp_osfminus15.png" alt="E. coli assembly graph drawn using a force-directed layout algorithm, sfdp." /></td>
    </tr>
  </tbody>
</table>

> [!NOTE]
> #### Reverse-complementary nodes and edges
> As discussed in the FAQs below on "Reverse-complementary sequences," we represent pairs of nodes {`X`, `-X`} and pairs of edges {`A -> B`, `-B -> -A`} separately in the graph. This makes it easier to lay out the graph nicely.
>
> ##### Drawing "nonredundant" parts of the graph
> In graphs where such pairs of nodes / edges exist, there will be an additional drawing option available (under the "Draw"
> section of the UI) named **`Entire graph (nonredundant)`**.
>
> This drawing option will detect "redundant" pairs of connected components that are perfectly
> reverse-complementary to each other (e.g. one component looks like `A -> -B -> C` and the other looks like
> `-C -> B -> -A`), and only draw one of these components (we select the component with more forward-orientation nodes or edges).
>
> This drawing option will also draw components that have no perfect reverse-complement component -- for example, those that are "strand-mixed"
> and contain both node `X` and `-X` (e.g. component #1 in the above screenshots).
>
> You can think of this drawing method as kind of a mix of [the "single" and "double" modes](https://github.com/rrwick/Bandage/wiki/Single-vs-double-node-style)
> in Bandage. For pairs of redundant components, we only need to draw one of them, and for all other components we draw the entire thing.
>
> ##### Drawing the entire graph, including pairs of redundant components
> If you want to see _everything_, just select the **`Entire graph (all components)`** drawing option!

### 3. Large graph: MetaCarvel (GML file; 28,064 nodes; 21,769 edges) -- stool metagenome

This is a scaffold graph created by [MetaCarvel](https://github.com/marbl/MetaCarvel/).

[Here is a Zenodo record for these files](https://zenodo.org/records/18316065);
they are derived from [`SRS049959`](https://www.ncbi.nlm.nih.gov/bioproject/?term=SRS049959).
Note that this graph is fairly old (it dates back to August 2017!); MetaCarvel has been updated a decent amount since then.

```bash
wget https://zenodo.org/records/18316065/files/august1.gml
wget https://zenodo.org/records/18316065/files/scaffolds_august1_fixed.agp

# Use --verbose to show more information in the terminal about how long each step takes
mgsc -g august1.gml -a scaffolds_august1_fixed.agp --verbose
```

<table>
  <tbody>
    <tr align="center">
      <td><b>Entire graph</b><br/>(on my 2018 laptop this takes about 2.5 minutes to lay out and draw; see tip below)</td>
    </tr>
    <tr>
      <td><img src="https://raw.githubusercontent.com/marbl/MetagenomeScope/refs/heads/main/docs/res/screenshots/aug1_all.png" alt="Entire stool metagenome scaffold graph." /></td>
    </tr>
    <tr align="center">
      <td><b>Zoomed in on <tt>scaffold_1486</tt> in component #65</b></td>
    </tr>
    <tr>
      <td><img src="https://raw.githubusercontent.com/marbl/MetagenomeScope/refs/heads/main/docs/res/screenshots/aug1_scaffold.png" alt="Stool metagenome scaffold graph, zoomed in on a scaffold path." /></td>
    </tr>
  </tbody>
</table>

> [!TIP]
> #### dang that's a big graph
> ##### Understanding component sizes
> You can get a sense for the sizes of this graph's connected components by clicking the "Graph info" button in the left sidebar, then examining the charts in the "Components" tab.
>
> This shows that the largest connected component in this graph (i.e. the one with size rank #1) contains over six thousand nodes. You can certainly draw this component in MetagenomeScope, but it will take a few seconds to lay out and draw. The resulting interface may also become a bit sluggish due to the size of the graph. (You could even draw the entire graph if you wanted to, as shown above! But it will be even more sluggish.)
>
> ##### Drawing smaller component(s)
> If you would like to examine the smaller parts of this graph, you can start by drawing component #2 -- or even a range of components, for example #2 - 10. (Try pasting `2-10` into the "Component(s), by size rank" input to draw all of these components at once!)
>
> ##### Drawing subregions of components
> You can also draw only a subregion of a larger component, using functionality inspired by [Bandage](https://rrwick.github.io/Bandage/). In the "Draw" section, change the "Component(s), by size rank" dropdown to the "Around certain node(s)" option, and then type in `k99_38`. Try increasing the distance and redrawing to see more and more of component #1!

### Hodgepodge of other test datasets

See the [`metagenomescope/tests/input/`](https://github.com/marbl/MetagenomeScope/tree/main/metagenomescope/tests/input)
directory.

## FAQs

### Reverse-complementary sequences

<!-- use of <strong> here was stolen from strainflye's readme, which in turn is
based on https://codedragontech.com/createwithcodedragon/how-to-style-html-details-and-summary-tags/ -->
<details>
  <summary><strong>FAQ: How do you handle reverse-complementary nodes/edges?</strong></summary>

<hr/>

The answer to this depends on the filetype of the graph you are using.

##### "Explicit" graph filetypes (FASTG, DOT, GML)

When MetagenomeScope reads in FASTG, DOT, and GML files,
it assumes that _these files explicitly describe all of the nodes and edges in the graph_.
So, let's say you give MetagenomeScope the following [LJA](https://github.com/AntonBankevich/LJA)-style DOT file:

```dot
digraph g {
  1 -> 2 [label="edge1 A99(2.4)"];
}
```

We will interpret this as a graph with **two nodes** (`1`, `2`) and **one edge**
(`1 -> 2`).

##### "Implicit" graph filetypes (GFA, LastGraph)

However, for GFA and LastGraph files, MetagenomeScope cannot make the
assumption that these files explicitly describe all of the nodes and edges in
the graph. In these files, each declaration of a node / edge
(in GFA parlance, "segment" / "link"; in LastGraph parlance, "node"
/ "arc") also declares this node / edge's reverse complement.

So, let's say you give MetagenomeScope the following GFA file (based on
[this example](https://github.com/sjackman/gfalint/blob/master/examples/sample1.gfa)):

```gfa
H	VN:Z:1.0
S	1	CGATGCAA
S	2	TGCAAAGTAC
L	1	+	2	+	5M
```

We will interpret this as a graph with **four nodes** (`1`, `-1`, `2`, `-2`)
and **two edges** (`1 -> 2`, `-2 -> -1`). The presence of node `X`
["implies"](https://github.com/bcgsc/abyss/wiki/ABySS-File-Formats#reverse-complement)
the existence of the reverse complement node `-X`, and the presence of edge
`X -> Y` "implies" the existence of the reverse complement edge `-Y -> -X`.
Interpreting the graph file in this way is analogous to
[how "double mode" works in Bandage](https://github.com/rrwick/Bandage/wiki/Single-vs-double-node-style).

##### Based on the FASTG specification, shouldn't FASTG be an "implicit" instead of an "explicit" filetype?

It's complicated. The way I interpret the FASTG specification, each declaration
of an edge sequence implicitly also declares this edge sequence's reverse complement; however,
this is not the case for "adjacencies" between edge sequences.

In any case, the "dialect" of FASTG files produced by SPAdes and MEGAHIT lists edge sequences
and their reverse complements (as well as adjacencies between edge sequences and their reverse complements)
separately. Because of this, we consider FASTG to be an "explicit" filetype.
(See [pyfastg's documentation](https://github.com/fedarko/pyfastg#about-reverse-complements)
for details on how we handle reverse complements in FASTG files.)
<hr/>
</details>

<details>
  <summary><strong>FAQ: Why does my graph have node <code>X</code> and <code>-X</code> in the same component?</strong></summary>

<hr/>

One common reason this happens is the presence of [palindromic](https://en.wikipedia.org/wiki/Palindromic_sequence) sequences:
these can cause both a sequence and its reverse-complement to be connected to
each other.

This often occurs with the big ("hairball") component in an assembly graph.
<hr/>
</details>

<details>
  <summary><strong>FAQ: What happens if an edge is its own reverse complement?</strong></summary>

<hr/>

(This assumes that you have read the FAQ above on "How do you handle reverse-complementary nodes/edges?")

This can happen if an edge exists from `X -> -X` or from `-X -> X` in an
"implicit" graph file (GFA / LastGraph). Consider
[this GFA file](https://github.com/sjackman/assembly-graph/blob/master/loop.gfa):

```gfa
H	VN:Z:1.0
S	1	AAA
S	2	ACG
S	3	CAT
S	4	TTT
L	1	+	1	+	2M
L	2	+	2	-	2M
L	3	-	3	+	2M
L	4	-	4	-	2M
```

Since this GFA file contains four "link" lines, we might think at first that the corresponding graph
contains 4 × 2 = 8 edges. However, the graph only contains **6 unique
edges**. This is because the reverse complement of `2 -> -2` is itself:
we know from above that `X -> Y` implies `-Y -> -X`, but
`-(-2) -> -(2)` is equal to `2 -> -2`! The same goes for `-3 -> 3`:
`-(3) -> -(-3)` is equal to `-3 -> 3`.
Both of these edges "imply" themselves as their own reverse complements!

How do we handle this situation? As of writing,
when MetagenomeScope visualizes these graphs it will only draw one copy
of these "self-implying" edges. This matches
[the original visualization of this graph](https://github.com/sjackman/assembly-graph/blob/master/loop.gv.png), and also matches Bandage's visualization of this GFA file.

Notably, since we assume that "explicit" graph files (FASTG / DOT / GML)
explicitly define all of the nodes and edges in their graph, MetagenomeScope doesn't do anything
special for this case for these files. (If your DOT file describes one edge
from `X -> -X`, then that's fine; if it describes two or more edges from `X -> -X`,
then that's also fine, and we'll visualize all of them.)
<hr/>
</details>

### Graph structure

<details>
  <summary><strong>FAQ: What do you mean by a component's "size rank"?</strong></summary>

<hr/>

Given a graph with _N_ connected components: we sort these components by the number of
nodes they contain, from high to low. We then assign each of these components a
**size rank**, a number from 1 to _N_: the component with size rank #1 corresponds
to the largest component, and the component with size rank #_N_ corresponds to the
smallest component.

Often, we only care about looking at individual components in a graph -- laying out
and drawing the entire graph is not always a good idea when the graph is massive.
Component size ranks are a nice way of formalizing this.

Some details about component size ranks, if you are interested:

- The numbers shown in the treemap (accessible in the "Graph info" dialog)
  correspond exactly to component size ranks. So, the rectangle labelled
  #1 in the treemap corresponds to the largest component, the rectangle labelled
  #2 corresponds to the second-largest component, etc.

- The exact component sorting functionality accounts for ties by using four different sorting
  criteria, in the following order. Ties at one level cause later levels to be considered for
  breaking ties.
  - the number of "full" nodes in the component (treating a pair of split nodes 40-L → 40-R as a
    single node)
  - the number of "total" nodes in the component (treating a pair of split nodes 40-L → 40-R as
    two nodes)
  - the number of "total" edges in the component (including both real edges and "fake" edges
    between pairs of split nodes like 40-L → 40-R)
  - the number of patterns in the component

<hr/>
</details>

<details>
  <summary><strong>FAQ: Can my graphs have parallel edges?</strong></summary>

<hr/>

Yes! MetagenomeScope supports
[multigraphs](https://en.wikipedia.org/wiki/Multigraph). In general:
if your assembly graph file describes more than one edge from `X -> Y`, then
MetagenomeScope can visualize all of these "parallel" edges. (This is mostly
useful when visualizing de Bruijn graphs.)

The exact behavior of how we handle parallel edges is controlled by the
`--rmdup` command-line parameter; see below for details.

##### `--rmdup gfaonly`

There are a lot of GFA files floating around out there that declare both
`A -> B` and `-B -> -A` on separate lines. This describes a graph where every
single edge has a duplicate parallel edge!

Bandage and Gfapy, among other tools, silently ignore these edges. To match
this behavior, the default value of `--rmdup` (`gfaonly`) means that -- _for
GFA files only_ -- MetagenomeScope will detect and remove parallel edges.
[Currently](https://github.com/marbl/MetagenomeScope/issues/430), the choice of
which edge(s) are removed is arbitrary.

After removing parallel edges, MetagenomeScope will log information about the
number of removed edges on the command line.

##### `--rmdup n`

If you are using a GFA file where parallel edges have meaning, you can specify
`--rmdup n` to tell MetagenomeScope to **not** remove parallel edges.

##### `--rmdup y`

By default, MetagenomeScope will not remove parallel edges if the input graph
was not a GFA file.

However, if you would like to force it to remove parallel edges, then you can
specify `--rmdup y`.

##### Parallel edges in FASTG files?

Notably, parallel edges not supported right now for FASTG files. I don't think
I've ever seen any FASTG files that have parallel edges, so I don't think this
is a big priority, but I guess
[please let me know if you would like us to add support for it](https://github.com/fedarko/pyfastg/issues/8).

<hr/>
</details>

<details>
  <summary><strong>FAQ: What filetype should I use for de Bruijn graphs?</strong></summary>

<hr/>

If you are visualizing output from LJA or Flye, you _may_ want to use a DOT file instead of a GFA / FASTG file as input.

This is because GFA and FASTG [are not ideal](https://github.com/AntonBankevich/LJA/blob/main/docs/jumbodbg_manual.md#output-of-de-bruijn-graph-construction) for representing graphs in which sequences are stored on edges rather than nodes (i.e. de Bruijn / repeat graphs). The DOT files output by Flye and LJA should contain the _original_ structure of these graphs (in which edges and nodes in the visualization actually correspond to edges and nodes in the original graph, respectively); the GFA / FASTG files usually represent altered versions in which nodes and edges have been swapped, which is not always an ideal representation (especially if you are doing something where you really care about the structure of the original graph).

That being said, please note that -- if you are using an assembler that outputs graphs in different
filetypes -- these files may have additional differences beyond the usual filetype differences.
For example, [Flye's GFA and DOT files can have slightly different coverages](https://github.com/mikolmogorov/Flye/issues/597),
since Flye produces them at different times in its pipeline.
<hr/>
</details>

<details>
  <summary><strong>FAQ: How do you handle <code>E</code>-lines in GFA 2 files?</strong></summary>

<hr/>

We only visualize `E`-line edges that are classified as
"[dovetails](https://gfa-spec.github.io/GFA-spec/GFA2.html#edge)." That is, edges
that connect the ends of two nodes -- for example:

```
     | |
------->
     ------->
```

Note that our rules for classifying dovetail edges are currently somewhat
stricter than those outlined in the GFA 2 specification. See
[this issue](https://github.com/GFA-spec/GFA-spec/issues/133) for details.

<hr/>
</details>

### Paths

<details>
  <summary><strong>FAQ: How do you handle <code>O</code>-lines in GFA 2 files?</strong></summary>

<hr/>

I'm so glad you asked! Although I am doubtful that anybody is actually asking this question, so maybe you are
not a real person.

ANYWAY so okay here's the deal. In GFA 1 files, paths (`P`-lines) are relatively simple -- they can only
contain segments. So `P`-lines are refreshingly easy to reason about.

In GFA 2 files, however, paths (`O`-lines) can contain other things besides segments -- they can also contain
edges, or even other paths! There are some interesting considerations this flexibility brings up.

##### Paths that contain other paths

There is no guarantee that `O`-lines are given in any particular order, so you can have dreadful
situations where a path with ID `B` says that it contains a path with ID `A` (which is
defined in a later line in the file).

We handle this by -- after scanning through the entire GFA file -- creating a directed graph, where each
edge `A -> B` indicates that path `B` contains path `A`. We then use NetworkX to find a
[topological ordering](https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.dag.topological_sort.html)
of the nodes (paths) in this graph, and record paths (in terms of just their child segments
and nothing else) in this order. Using the topological ordering ensures that, when it is time
to record a path, we already know the exact contents of the paths that it contains -- so we can
safely "expand" the child paths.

If you really wanted to make our jobs hard, you could create a GFA file with **cycles**: where path
`B` contains `A` which contains `C` which contains `B` (or something like that).
If we detect this kind of situation, we will raise an error (because like how should we even handle this...?)

##### Paths that explicitly contain edges

In GFA 2, edges can optionally have IDs. You can refer to these edge IDs on an `O`-line in GFA 2.

[Currently](https://github.com/marbl/MetagenomeScope/issues/424), MetagenomeScope assumes that all
paths loaded from GFA files will only contain nodes -- not
a mixture of nodes and edges. Thus, when MetagenomeScope notices that a GFA 2 path contains an edge, it
converts this edge into the 2-tuple (source, target) in the path.

Later, we "expand" these 2-tuples (turning the path into just a basic list of segment IDs)
according to the following logic.

1. If "source" is not already given as the previous entry in the path, then we add it to the path.
2. If "target" is not already given as the next entry in the path (as a segment), then we add it to the path.

I think this should mostly match how other GFA 2 parsers (e.g. Gfapy) handle these kinds of mixed paths.

Note that if your GFA 2 file describes a multigraph (i.e. it has parallel edges, and
you specified edge-paths in order to disambiguate which specific edges the path traverses) then this process
will inherently cause some ambiguity.
If you have strong opinions about this, please feel free to file an issue so we can discuss.

<hr/>
</details>

<details>
  <summary><strong>FAQ: My graph is a DOT file from LJA that does not have edge IDs. Can I still create a "paths" file for it?</strong></summary>

<hr/>

Yes!

Some background: in some older LJA graphs, edges do not have explicitly set IDs.
MetagenomeScope will detect this, and automatically create edge IDs
in the format `SOURCE → TARGET (FIRST NT)`. (That is, using the literal right arrow
unicode symbol -- i.e. `U+2192`.)

So, if you want to specify paths through these graphs that do not have
edge IDs, then: you can prepare your AGP (`-a`) or TSV (`-t`) file as normal,
but just refer to edges' IDs in this format. (Make sure to label the orientation
of each of these edge IDs as `+`, even if it contains negative-strand node(s).)

Here is an example of how you could do this:

- [DOT graph file without edge IDs](https://github.com/marbl/MetagenomeScope/blob/main/metagenomescope/tests/input/chr15_subgraph_noids.gv)

- [AGP file](https://github.com/marbl/MetagenomeScope/blob/main/metagenomescope/tests/input/chr15_subgraph_noids.agp)

- [TSV paths file](https://github.com/marbl/MetagenomeScope/blob/main/metagenomescope/tests/input/chr15_subgraph_noids.paths.tsv)

<hr/>
</details>

### Patterns

<details>
  <summary><strong>FAQ: How can I run the pattern decomposition process programmatically?</strong></summary>

<hr/>

Creating a `metagenomescope.graph.AssemblyGraph` object will automatically run the decomposition process:

```python
>>> from metagenomescope.graph import AssemblyGraph
>>> ag = AssemblyGraph("graph.gfa")  # replace with your graph's filepath
```

At this point:

- The "decomposed graph" (where patterns are collapsed into nodes) is represented by `ag.decomposed_graph` (a [NetworkX `MultiDiGraph`](https://networkx.org/documentation/stable/reference/classes/multidigraph.html)).

- The "true graph" (i.e. with all patterns fully uncollapsed, revealing all "original" nodes and edges) is represented by `ag.graph` (also a [NetworkX `MultiDiGraph`](https://networkx.org/documentation/stable/reference/classes/multidigraph.html))
  - Note that this graph will still include split nodes and fake edges, if any remain after the decomposition process.

- All nodes, edges, and patterns will have unique integer IDs. These IDs can be used to look up information about nodes, edges, and patterns in the `ag.nodeid2obj`, `ag.edgeid2obj`, and `ag.pattid2obj` dictionaries, respectively.

Some examples of analyzing the decomposition results:

```python
>>> from metagenomescope.graph import AssemblyGraph
>>> ag = AssemblyGraph("metagenomescope/tests/input/E_coli_LastGraph")
>>> # Inspect nodes, edges, and patterns
>>> ag.nodeid2obj
{0: Node 0 (name: 1),
 1: Node 1 (name: -1),
 2: Node 2 (name: 2),
 ...}
>>> ag.edgeid2obj
{558: Edge 558 (orig: 0 -> 244; new: 0 -> 244; dec: 0 -> 1421),
 559: Edge 559 (orig: 1 -> 342; new: 1 -> 342; dec: 1527 -> 342),
 560: Edge 560 (orig: 2 -> 477; new: 2 -> 477; dec: 2 -> 477),
 ...}
>>> ag.pattid2obj
{1222: bubble1222 containing nodes [33, 283, 395, 39] from [33] to [39],
 1227: bubble1227 containing nodes [34, 76, 382, 303] from [34] to [76],
 1232: bubble1232 containing nodes [40, 43, 35, 501] from [35] to [43],
 ...}
>>> # Go through just the bubble patterns
>>> ag.bubbles
[bubble1222 containing nodes [33, 283, 395, 39] from [33] to [39],
 bubble1227 containing nodes [34, 76, 382, 303] from [34] to [76],
 bubble1232 containing nodes [40, 43, 35, 501] from [35] to [43],
 ...]
>>> # Look up a node by name (if a node was split, this will list both halves)
>>> ag.nodename2objs
defaultdict(<class 'list'>,
            {'1': [Node 0 (name: 1)],
             '-1': [Node 1 (name: -1)],
             '2': [Node 2 (name: 2)],
             ...
             '40-R': [Node 78 (name: 40-R)],
             '40': [Node 78 (name: 40-R), Node 1259 (name: 40-L)],
             '40-L': [Node 1259 (name: 40-L)],
             ...})
>>> # Examine split nodes
>>> for n in ag.nodeid2obj.values():
...     if n.split is not None:
...         print(n)
Node 32 (name: 17-L)
Node 33 (name: -17-R)
Node 34 (name: 18-R)
...
>>> # Distinguish fake from real edges
>>> for e in ag.edgeid2obj.values():
...     print(e, e.is_fake)
Edge 558 (orig: 0 -> 244; new: 0 -> 244; dec: 0 -> 1421) False
Edge 559 (orig: 1 -> 342; new: 1 -> 342; dec: 1527 -> 342) False
...
Edge*1634 (orig: 348 -> 1633; new: 348 -> 1633; dec: 1628 -> 1666) True
Edge*1639 (orig: 1638 -> 451; new: 1638 -> 451; dec: 1671 -> 1635) True
```

This interface should remain relatively stable, although I may change things slightly as development continues. If you have any questions, please reach out.

<hr/>
</details>

### Performance

<details>
  <summary><strong>FAQ: What's the biggest possible graph I can visualize?</strong></summary>

<hr/>

We're still figuring that out. There are a couple bottlenecks:

1. Processing the graph.

    - Because we ([currently](https://github.com/marbl/MetagenomeScope/issues/423)) store the entire graph in memory, massive graphs -- with millions of nodes / edges -- can become impractical to load on low-memory systems.

2. Laying out the graph.

    - We usually only lay out one component at a time, so generally the problem comes with laying out the large "hairball" component(s) of the graph, if any.

    - When you get to the order of, say, thousands of nodes, laying out a component will probably become somewhat slow (especially if you select the `Lay out patterns recursively` option in the draw options dialog).

    - To my understanding, a big factor here is the ratio of nodes to edges: when there are many more edges than nodes in a component (indicating a very densely connected structure), Graphviz has to do a lot of work to position things properly.

3. Drawing the graph's elements.

    - Cytoscape.js has a lot of optimizations built-in, but I think there are some inherent limitations of drawing using a HTML canvas.

    - With graphs containing thousands of nodes, interaction (e.g. zooming, panning) starts to feel a bit sluggish.

See the "Large graph" section under "Example datasets" above for some tips for working with large graphs.

<hr/>
</details>

## Known issues

- **Edge flattening:** In certain cases, we may be unable to draw an edge with complex control points.
  [Usually](https://github.com/marbl/MetagenomeScope/issues/360) this happens when Cytoscape.js does not
  accept the control points Graphviz produced for an edge, but
  [sometimes](https://github.com/marbl/MetagenomeScope/issues/394) Graphviz will be unable to create
  control points for an edge in the first place. (Or
  [sometimes](https://github.com/marbl/MetagenomeScope/issues/406) an edge will get routed into the
  middle of nowhere...)

  In any case: MetagenomeScope will detect these kinds of edges and "flatten" them into
  [simple Bezier edges](https://js.cytoscape.org/#style/bezier-edges)
  (usually straight lines). This way, we can at least draw _something_ for each edge in the graph.

## Development documentation

See [`CONTRIBUTING.md`](https://github.com/marbl/MetagenomeScope/blob/main/CONTRIBUTING.md).

## Changelog

See [`CHANGELOG.md`](https://github.com/marbl/MetagenomeScope/blob/main/CHANGELOG.md).

## License

MetagenomeScope is licensed under the
[GNU GPL, version 3](https://www.gnu.org/copyleft/gpl.html).

MetagenomeScope's code is distributed with
[Bootstrap](https://getbootstrap.com/),
[Bootstrap Icons](https://icons.getbootstrap.com/),
[Cytoscape.js](https://js.cytoscape.org/),
[layout-base](https://github.com/iVis-at-Bilkent/layout-base),
[cose-base](https://github.com/iVis-at-Bilkent/cose-base),
[cytoscape-fcose](https://github.com/iVis-at-Bilkent/cytoscape.js-fcose),
[dagre](https://github.com/dagrejs/dagre),
[cytoscape-dagre](https://github.com/cytoscape/cytoscape.js-dagre),
and
[cytoscape-svg](https://github.com/kinimesi/cytoscape-svg).
Please see the [`metagenomescope/assets/vendor/licenses/`](https://github.com/marbl/MetagenomeScope/tree/main/metagenomescope/assets/vendor/licenses/) directory for copies of these tools' licenses.

## Acknowledgements

Thanks to various people in the Pop, Knight, and Pevzner Labs over the years for their kind feedback and helpful suggestions.

Thanks also to the developers of the many excellent open-source software packages used by MetagenomeScope. In particular,
[Graphviz](https://graphviz.org/) (graph layout), [Cytoscape.js](https://js.cytoscape.org/) (interactive graph drawing), and
[Dash](https://dash.plotly.com/) (application framework) have been extremely helpful tools throughout the development of
this project.

Some of MetagenomeScope's software tests use data from other places. Please see the
[`metagenomescope/tests/input/`](https://github.com/marbl/MetagenomeScope/tree/main/metagenomescope/tests/input)
directory's README for a list of acknowledgements.

## Contact

Please [open a GitHub issue](https://github.com/marbl/MetagenomeScope/issues) if you have any questions or suggestions.
