tnseq_stats¶
This command is useful for calculating some metrics on input wig (or combined_wig files) for assessement of data quality.
Usage:¶
> usage: python3 src/transit.py tnseq_stats <file.wig>+ [-o <output_file>]
python src/transit.py tnseq_stats -c <combined_wig> [-o <output_file>]
It generates a table (tab-separated text file that can be opened in Excel) with the following statistics in it:
Column Header |
Column Definition |
Comments |
---|---|---|
Filename |
Name of sample (wig file) |
|
Density |
Fraction of sites with insertions. |
“Well-saturated” Himar1 datasets have >30% saturation. Below this, statistical methods may have trouble. |
mean_ct |
Average read-count over all TA sites. |
|
NZMean |
Average read-count, excluding empty sites. |
A value between 30-200 is usually good for Himar1 datasets. Too high or too low can indicate problems. |
NZMedian |
Median read-count, excluding empty sites. |
As read-counts can often have spikes, median serves as a good robust estimate. |
max_ct |
Largest read-count at any TA site |
Useful to determine whether there are outliers/spikes, which may indicate sequencing issues. |
total_cts |
Sum of total read-counts in the sample. |
Indicates how much sequencing material was obtained. Typically >1M reads is desired for Himar1 datasets. |
skewness |
3rd-order moment of read-count distribution. |
Sharp peak? Large skew may indicate issues with a dataset. Typically a skew < 50 is desired. May be higher when |
kurtosis |
4th-order moment of read-counts distribution |
Lop-sided peak? |
PTI |
Pickand Tail Index |
Another statistical measure that indicates presence of individual TA sites with high outlier counts. PTI>1.0 is not good. |
Here is an example:
> python3 src/transit.py tnseq_stats -c src/pytransit/data/cholesterol_glycerol_combined.dat
dataset density mean_ct NZmean NZmedian max_ct total_cts skewness kurtosis pickands_tail_index
src/pytransit/data/cholesterol_H37Rv_rep1.wig 0.439 139.6 317.6 147 125355.5 10414005 54.8 4237.7 0.973
src/pytransit/data/cholesterol_H37Rv_rep2.wig 0.439 171.4 390.5 148 704662.8 12786637 105.8 14216.2 1.529
src/pytransit/data/cholesterol_H37Rv_rep3.wig 0.359 173.8 484.2 171 292294.8 12968502 42.2 2328.0 1.584
src/pytransit/data/glycerol_H37Rv_rep1.wig 0.419 123.3 294.5 160 8813.3 9195672 4.0 33.0 0.184
src/pytransit/data/glycerol_H37Rv_rep2.wig 0.516 123.8 240.1 127 8542.5 9235984 4.0 33.5 0.152
In this example, you can see the 5 samples have saturations in the range of 35.9-51.6% (which is descent). The NZMeans are 123-139, but this is post-normalization. (TTR normalization had already been applied to this combined_wig file, so the means can be expected to be scaled to around 100.) If you want to see the NZmeans for the raw data, re-generate the combined_wig file using ‘-n nonorm’, to skip the automatic normalization step. These samples also exhibit skewness that is on the high side (33-105.8). This is probably related to the fact that some individual TA sites have very high counts. For example, rep2 of cholesterol has a max count of 704662 at a single TA site, representing over 5% of the 12.78M reads (total insertion counts). TTR is supposed to be robust by ignoring the top 5% of most abundant sites, but still the rest of the distribution of counts could be skewed. This sample also has a high Pickands’ tail index (1.53, which is above 1.0). While, we currently don’t have recommendations for hard cutoffs to use for identifying bad samples (that might need to be re-sequenced), I would say that skew>30 and/or PTI>1.0 suggest that samples are noisy or lower-quality. See Quality Control <transit_quality_control> for more discussion of assessing quality of TnSeq datasets. However, doing resampling on this data still yielded insights into many genes required for cholesterol metabolism in M. tuberculosis. [REF]