tnseq_stats

This command is useful for calculating some metrics on input wig (or combined_wig files) for assessement of data quality.

Usage:

> usage: python3 src/transit.py tnseq_stats <file.wig>+ [-o <output_file>]
         python src/transit.py tnseq_stats -c <combined_wig> [-o <output_file>]

It generates a table (tab-separated text file that can be opened in Excel) with the following statistics in it:

Column Header

Column Definition

Comments

Filename

Name of sample (wig file)

Density

Fraction of sites with insertions.

“Well-saturated” Himar1 datasets have >30% saturation. Below this, statistical methods may have trouble.

mean_ct

Average read-count over all TA sites.

NZMean

Average read-count, excluding empty sites.

A value between 30-200 is usually good for Himar1 datasets. Too high or too low can indicate problems.

NZMedian

Median read-count, excluding empty sites.

As read-counts can often have spikes, median serves as a good robust estimate.

max_ct

Largest read-count at any TA site

Useful to determine whether there are outliers/spikes, which may indicate sequencing issues.

total_cts

Sum of total read-counts in the sample.

Indicates how much sequencing material was obtained. Typically >1M reads is desired for Himar1 datasets.

skewness

3rd-order moment of read-count distribution.

Sharp peak? Large skew may indicate issues with a dataset. Typically a skew < 50 is desired. May be higher when

kurtosis

4th-order moment of read-counts distribution

Lop-sided peak?

PTI

Pickand Tail Index

Another statistical measure that indicates presence of individual TA sites with high outlier counts. PTI>1.0 is not good.

Here is an example:

> python3 src/transit.py tnseq_stats -c src/pytransit/data/cholesterol_glycerol_combined.dat
dataset density mean_ct NZmean  NZmedian        max_ct  total_cts       skewness        kurtosis        pickands_tail_index
src/pytransit/data/cholesterol_H37Rv_rep1.wig   0.439   139.6   317.6   147     125355.5        10414005        54.8    4237.7  0.973
src/pytransit/data/cholesterol_H37Rv_rep2.wig   0.439   171.4   390.5   148     704662.8        12786637        105.8   14216.2 1.529
src/pytransit/data/cholesterol_H37Rv_rep3.wig   0.359   173.8   484.2   171     292294.8        12968502        42.2    2328.0  1.584
src/pytransit/data/glycerol_H37Rv_rep1.wig      0.419   123.3   294.5   160     8813.3  9195672 4.0     33.0    0.184
src/pytransit/data/glycerol_H37Rv_rep2.wig      0.516   123.8   240.1   127     8542.5  9235984 4.0     33.5    0.152

In this example, you can see the 5 samples have saturations in the range of 35.9-51.6% (which is descent). The NZMeans are 123-139, but this is post-normalization. (TTR normalization had already been applied to this combined_wig file, so the means can be expected to be scaled to around 100.) If you want to see the NZmeans for the raw data, re-generate the combined_wig file using ‘-n nonorm’, to skip the automatic normalization step. These samples also exhibit skewness that is on the high side (33-105.8). This is probably related to the fact that some individual TA sites have very high counts. For example, rep2 of cholesterol has a max count of 704662 at a single TA site, representing over 5% of the 12.78M reads (total insertion counts). TTR is supposed to be robust by ignoring the top 5% of most abundant sites, but still the rest of the distribution of counts could be skewed. This sample also has a high Pickands’ tail index (1.53, which is above 1.0). While, we currently don’t have recommendations for hard cutoffs to use for identifying bad samples (that might need to be re-sequenced), I would say that skew>30 and/or PTI>1.0 suggest that samples are noisy or lower-quality. See Quality Control <transit_quality_control> for more discussion of assessing quality of TnSeq datasets. However, doing resampling on this data still yielded insights into many genes required for cholesterol metabolism in M. tuberculosis. [REF]