Contents:
A library for analysing codon usage bias with the quasispecies model.
A Class for fitness functions!
Calculates the number of transitions, transversions and staying-const when one codon mutates to another. and is implemented in C with scipy.weave. A pure Python version for easier readability is implemented via py_codon_mut_dist.
Parameters : | a: string with codon 1 : b: string with codon 2 : |
---|---|
Returns : | List with three doubles `results` with `results[0]` containing the number of unchanged nucleotides, : `results[1]` number of transitions and `results[2]` number of transversions. : |
Calculates relative codon frequency for each gene in codon histogram
Test against run of testdata2.ffn and handcalculated for amino acid A (codons 52 - 55) should give 0.24,0.3066,0.12,0.33
>>> calculate_RF_dic( (make_codon_histogram_dic( load_fasta("testdata2.ffn")))[0] ) # doctest : +NORMALIZE_WHITESPACE
{'fid|18348942|locus|VBIEscCol44059_0001|': array([ 0.375 , 0.625 , 0.14102564, 0.11538462, 0.15555556,
0.26666667, 0.13333333, 0.2 , 0.57894737, 0.42105263,
0. , 0. , 0.25 , 0.75 , 1. ,
1. , 0.07692308, 0.14102564, 0.03846154, 0.48717949,
0.12 , 0.16 , 0.04 , 0.68 , 0.58333333,
0.41666667, 0.36363636, 0.63636364, 0.3902439 , 0.43902439,
0.04878049, 0.09756098, 0.68421053, 0.31578947, 0. ,
1. , 0.14285714, 0.60714286, 0. , 0.25 ,
0.58823529, 0.41176471, 0.65517241, 0.34482759, 0.11111111,
0.13333333, 0. , 0.02439024, 0.29508197, 0.2295082 ,
0.09836066, 0.37704918, 0.24 , 0.30666667, 0.12 ,
0.33333333, 0.675 , 0.325 , 0.72340426, 0.27659574,
0.35714286, 0.35714286, 0.16071429, 0.125 ]), 'fid|129049020348348942|locus|VBIEscCol44059_0001|': array([ 0.375 , 0.625 , 0.14102564, 0.11538462, 0.15555556,
0.26666667, 0.13333333, 0.2 , 0.57894737, 0.42105263,
0. , 0. , 0.25 , 0.75 , 1. ,
1. , 0.07692308, 0.14102564, 0.03846154, 0.48717949,
0.12 , 0.16 , 0.04 , 0.68 , 0.58333333,
0.41666667, 0.36363636, 0.63636364, 0.3902439 , 0.43902439,
0.04878049, 0.09756098, 0.68421053, 0.31578947, 0. ,
1. , 0.14285714, 0.60714286, 0. , 0.25 ,
0.58823529, 0.41176471, 0.65517241, 0.34482759, 0.11111111,
0.13333333, 0. , 0.02439024, 0.29508197, 0.2295082 ,
0.09836066, 0.37704918, 0.24 , 0.30666667, 0.12 ,
0.33333333, 0.675 , 0.325 , 0.72340426, 0.27659574,
0.35714286, 0.35714286, 0.16071429, 0.125 ])}
returns codon_rscu for each gene in codon histogram
test like calculate_rf and checked against genomes.urv.es/optimizer
>>> calculate_RSCU_dic( (make_codon_histogram_dic( load_fasta("testdata2.ffn")))[0] )
{'fid|18348942|locus|VBIEscCol44059_0001|': array([ 0.75 , 1.25 , 0.84615385, 0.69230769, 0.93333333,
1.6 , 0.8 , 1.2 , 1.15789474, 0.84210526,
0. , 0. , 0.5 , 1.5 , 3. ,
1. , 0.46153846, 0.84615385, 0.23076923, 2.92307692,
0.48 , 0.64 , 0.16 , 2.72 , 1.16666667,
0.83333333, 0.72727273, 1.27272727, 2.34146341, 2.63414634,
0.29268293, 0.58536585, 2.05263158, 0.94736842, 0. ,
1. , 0.57142857, 2.42857143, 0. , 1. ,
1.17647059, 0.82352941, 1.31034483, 0.68965517, 0.66666667,
0.8 , 0. , 0.14634146, 1.18032787, 0.91803279,
0.39344262, 1.50819672, 0.96 , 1.22666667, 0.48 ,
1.33333333, 1.35 , 0.65 , 1.44680851, 0.55319149,
1.42857143, 1.42857143, 0.64285714, 0.5 ]), 'fid|129049020348348942|locus|VBIEscCol44059_0001|': array([ 0.75 , 1.25 , 0.84615385, 0.69230769, 0.93333333,
1.6 , 0.8 , 1.2 , 1.15789474, 0.84210526,
0. , 0. , 0.5 , 1.5 , 3. ,
1. , 0.46153846, 0.84615385, 0.23076923, 2.92307692,
0.48 , 0.64 , 0.16 , 2.72 , 1.16666667,
0.83333333, 0.72727273, 1.27272727, 2.34146341, 2.63414634,
0.29268293, 0.58536585, 2.05263158, 0.94736842, 0. ,
1. , 0.57142857, 2.42857143, 0. , 1. ,
1.17647059, 0.82352941, 1.31034483, 0.68965517, 0.66666667,
0.8 , 0. , 0.14634146, 1.18032787, 0.91803279,
0.39344262, 1.50819672, 0.96 , 1.22666667, 0.48 ,
1.33333333, 1.35 , 0.65 , 1.44680851, 0.55319149,
1.42857143, 1.42857143, 0.64285714, 0.5 ])}
Various organisms have different genetic codes. Given the code_name which must be a member of genetic_codes the global dictionaries for translating codons into amino acids are reset with the genetic code you want.:
genetic_codes['The Standard Code'] = 'FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG'
genetic_codes['The Vertebrate Mitochondrial Code'] = 'FFLLSSSSYY**CCWWLLLLPPPPHHQQRRRRIIMMTTTTNNKKSS**VVVVAAAADDEEGGGG'
genetic_codes['The Yeast Mitochondrial Code'] = 'FFLLSSSSYY**CCWWTTTTPPPPHHQQRRRRIIMMTTTTNNKKSSRRVVVVAAAADDEEGGGG'
genetic_codes['The Mold, Protozoan, and Coelenterate Mitochondrial Code'] = 'FFLLSSSSYY**CCWWLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG'
genetic_codes['The Invertebrate Mitochondrial Code'] = 'FFLLSSSSYY**CCWWLLLLPPPPHHQQRRRRIIMMTTTTNNKKSSSSVVVVAAAADDEEGGGG'
genetic_codes['The Ciliate, Dasycladacean and Hexamita Nuclear Code'] = 'FFLLSSSSYYQQCC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG'
genetic_codes['The Echinoderm and Flatworm Mitochondrial Code'] = 'FFLLSSSSYY**CCWWLLLLPPPPHHQQRRRRIIIMTTTTNNNKSSSSVVVVAAAADDEEGGGG'
genetic_codes['The Euplotid Nuclear Code'] = 'FFLLSSSSYY**CCCWLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG'
genetic_codes['The Bacterial, Archaeal and Plant Plastid Code'] = 'FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG'
genetic_codes['The Alternative Yeast Nuclear Code'] = 'FFLLSSSSYY**CC*WLLLSPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG'
genetic_codes['The Ascidian Mitochondrial Code'] = 'FFLLSSSSYY**CCWWLLLLPPPPHHQQRRRRIIMMTTTTNNKKSSGGVVVVAAAADDEEGGGG'
genetic_codes['The Alternative Flatworm Mitochondrial Code'] = 'FFLLSSSSYYY*CCWWLLLLPPPPHHQQRRRRIIIMTTTTNNNKSSSSVVVVAAAADDEEGGGG'
genetic_codes['The Blepharisma Nuclear Code'] = 'FFLLSSSSYY*QCC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG'
genetic_codes['Chlorophycean Mitochondrial Code'] = 'FFLLSSSSYY*LCC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG'
genetic_codes['Trematode Mitochondrial Code'] = 'FFLLSSSSYY**CCWWLLLLPPPPHHQQRRRRIIMMTTTTNNNKSSSSVVVVAAAADDEEGGGG'
genetic_codes['Scenedesmus obliquus mitochondrial Code'] = 'FFLLSS*SYY*LCC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG'
genetic_codes['Thraustochytrium Mitochondrial Code'] = 'FF*LSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG'
genetic_codes['Pterobranchia mitochondrial code'] = 'FFLLSSSSYY**CCWWLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSSKVVVVAAAADDEEGGGG'
Examples
After changing the genetic code to something different than the standard code, the global amino_acids variable should have changed to the new code. >>> _=change_amino_acid_code(‘The Mold, Protozoan, and Coelenterate Mitochondrial Code’) >>> amino_acids ‘FFLLSSSSYY**CCWWLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG’
Calculates the number of transitions, transversions and staying-const when one codon mutates to another. and is implemented in C with scipy.weave. A pure Python version for easier readability is implemented via py_codon_mut_dist.
Parameters : | a: string with codon 1 : b: string with codon 2 : |
---|---|
Returns : | List with three doubles `results` with `results[0]` containing the number of unchanged nucleotides, : `results[1]` number of transitions and `results[2]` number of transversions. : |
The steady state is the eigenvector belonging to the largest eigenvalue of the evolution matrix for a specific amino acid
loads a list of highly expressed genes and returns and returns an index where 0 if no heg and 1 if heg is returned. the format used is that from ecai/heg and it tries to find the id in the description of the histogram keys
Loads fitnessfunctions from file. Either config from file oder list of filenames.
Load fitnessmatrix form file. Either config from file or list of filenames
Tests whether the string codon contains an ambigous letter that is found in ambig_fasta_chars
Examples
The table of codons, codons should not contain any ambigous codons >>> map(is_ambig_codon,codons) [False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False]
However, the string ‘aay’ should >>> is_ambig_codon(‘aay’) True
Loads a fasta file with filename as arg. Returns generator object with list of genes
Parameters : | filename : str
|
---|---|
Returns : | fasta : Seq
|
Examples
If everything is right, a generator object is returned
>>> load_fasta("testdata.ffn")
<generator object parse at ...>
The file must exist
>>> load_fasta("yikes")
Open: yikes failed
[Errno 2] No such file or directory: 'yikes'
However, we use Biopython and the fasta file’s syntax is not checked here!
>>> load_fasta("testdata_fail.ffn")
<generator object parse at ...>
Load fasta from a url
Parameters : | url:str :
Returns : —— : fasta:Bio.SeqIO :
Examples : ——- : |
---|
loads a genbank file with filename as arg. RReturns generator object with list of gene
Parameters : | filename : str
|
---|---|
Returns : | genes : Seq
|
loads a nucleotide file with plaintext sequence. filename as arg. Returns list of genes
returns codon_hist,aa_hist for one gene
Parameters : | gene : SeqIO
|
---|---|
Returns : | codon_hist : dict with each gene.id as key and num_codonsx1 np.array for each codon aa_hist : dict with each gene.id as key and num_aax1 np.array for each aacid |
Examples
The histogram should reproduce what www.kazusa.org.jp/codon/cgi-bin/countcodon.cgi computes for the gene in testdata.ffn:
UUU 12.6( 9) UCU 9.8( 7) UAU 15.4( 11) UGU 4.2( 3)
UUC 21.1( 15) UCC 16.9( 12) UAC 11.2( 8) UGC 12.6( 9)
UUA 15.4( 11) UCA 8.4( 6) UAA 0.0( 0) UGA 1.4( 1)
UUG 12.6( 9) UCG 12.6( 9) UAG 0.0( 0) UGG 5.6( 4)
CUU 8.4( 6) CCU 4.2( 3) CAU 9.8( 7) CGU 22.5( 16)
CUC 15.4( 11) CCC 5.6( 4) CAC 7.0( 5) CGC 25.3( 18)
CUA 4.2( 3) CCA 1.4( 1) CAA 11.2( 8) CGA 2.8( 2)
CUG 53.4( 38) CCG 23.9( 17) CAG 19.7( 14) CGG 5.6( 4)
AUU 36.5( 26) ACU 5.6( 4) AAU 28.1( 20) AGU 7.0( 5)
AUC 16.9( 12) ACC 23.9( 17) AAC 19.7( 14) AGC 8.4( 6)
AUA 0.0( 0) ACA 0.0( 0) AAA 26.7( 19) AGA 0.0( 0)
AUG 29.5( 21) ACG 9.8( 7) AAG 14.0( 10) AGG 1.4( 1)
GUU 25.3( 18) GCU 25.3( 18) GAU 37.9( 27) GGU 28.1( 20)
GUC 19.7( 14) GCC 32.3( 23) GAC 18.3( 13) GGC 28.1( 20)
GUA 8.4( 6) GCA 12.6( 9) GAA 47.8( 34) GGA 12.6( 9)
GUG 32.3( 23) GCG 35.1( 25) GAG 18.3( 13) GGG 9.8( 7)
in the first field. Remember, the order of codons in this package is given in the variable codons
>>> make_codon_histogram( load_fasta("testdata.ffn").next() )
([0.012640449438202247, 0.021067415730337078, 0.015449438202247191, 0.012640449438202247, 0.0098314606741573031, 0.016853932584269662, 0.0084269662921348312, 0.012640449438202247, 0.015449438202247191, 0.011235955056179775, 0.0, 0.0, 0.0042134831460674156, 0.012640449438202247, 0.0014044943820224719, 0.0056179775280898875, 0.0084269662921348312, 0.015449438202247191, 0.0042134831460674156, 0.053370786516853931, 0.0042134831460674156, 0.0056179775280898875, 0.0014044943820224719, 0.023876404494382022, 0.0098314606741573031, 0.0070224719101123594, 0.011235955056179775, 0.019662921348314606, 0.02247191011235955, 0.025280898876404494, 0.0028089887640449437, 0.0056179775280898875, 0.036516853932584269, 0.016853932584269662, 0.0, 0.029494382022471909, 0.0056179775280898875, 0.023876404494382022, 0.0, 0.0098314606741573031, 0.028089887640449437, 0.019662921348314606, 0.026685393258426966, 0.014044943820224719, 0.0070224719101123594, 0.0084269662921348312, 0.0, 0.0014044943820224719, 0.025280898876404494, 0.019662921348314606, 0.0084269662921348312, 0.032303370786516857, 0.025280898876404494, 0.032303370786516857, 0.012640449438202247, 0.0351123595505618, 0.037921348314606744, 0.018258426966292134, 0.047752808988764044, 0.018258426966292134, 0.028089887640449437, 0.028089887640449437, 0.012640449438202247, 0.0098314606741573031], [0.033707865168539325, 0.10955056179775281, 0.063202247191011238, 0.026685393258426966, 0.0014044943820224719, 0.016853932584269662, 0.0056179775280898875, 0.0351123595505618, 0.016853932584269662, 0.030898876404494381, 0.05758426966292135, 0.053370786516853931, 0.029494382022471909, 0.039325842696629212, 0.047752808988764044, 0.040730337078651688, 0.085674157303370788, 0.10533707865168539, 0.056179775280898875, 0.066011235955056174, 0.078651685393258425])
returns codon_hist for all genes in a fasta file in form of a dictionary with fasta identifiers as keys
The histogram should reproduce what www.kazusa.org.jp/codon/cgi-bin/countcodon.cgi for the first gene in the testdata2.ffn:
UUU 12.6( 9) UCU 9.8( 7) UAU 15.4( 11) UGU 4.2( 3)
UUC 21.1( 15) UCC 16.9( 12) UAC 11.2( 8) UGC 12.6( 9)
UUA 15.4( 11) UCA 8.4( 6) UAA 0.0( 0) UGA 1.4( 1)
UUG 12.6( 9) UCG 12.6( 9) UAG 0.0( 0) UGG 5.6( 4)
CUU 8.4( 6) CCU 4.2( 3) CAU 9.8( 7) CGU 22.5( 16)
CUC 15.4( 11) CCC 5.6( 4) CAC 7.0( 5) CGC 25.3( 18)
CUA 4.2( 3) CCA 1.4( 1) CAA 11.2( 8) CGA 2.8( 2)
CUG 53.4( 38) CCG 23.9( 17) CAG 19.7( 14) CGG 5.6( 4)
AUU 36.5( 26) ACU 5.6( 4) AAU 28.1( 20) AGU 7.0( 5)
AUC 16.9( 12) ACC 23.9( 17) AAC 19.7( 14) AGC 8.4( 6)
AUA 0.0( 0) ACA 0.0( 0) AAA 26.7( 19) AGA 0.0( 0)
AUG 29.5( 21) ACG 9.8( 7) AAG 14.0( 10) AGG 1.4( 1)
GUU 25.3( 18) GCU 25.3( 18) GAU 37.9( 27) GGU 28.1( 20)
GUC 19.7( 14) GCC 32.3( 23) GAC 18.3( 13) GGC 28.1( 20)
GUA 8.4( 6) GCA 12.6( 9) GAA 47.8( 34) GGA 12.6( 9)
GUG 32.3( 23) GCG 35.1( 25) GAG 18.3( 13) GGG 9.8( 7)
in the first field. Remember, the order of codons in this package is given in the variable codons
>>> make_codon_histogram_dic( load_fasta("testdata2.ffn") )[0]
{'fid|18348942|locus|VBIEscCol44059_0001|': [0.012640449438202247, 0.021067415730337078, 0.015449438202247191, 0.012640449438202247, 0.0098314606741573031, 0.016853932584269662, 0.0084269662921348312, 0.012640449438202247, 0.015449438202247191, 0.011235955056179775, 0.0, 0.0, 0.0042134831460674156, 0.012640449438202247, 0.0014044943820224719, 0.0056179775280898875, 0.0084269662921348312, 0.015449438202247191, 0.0042134831460674156, 0.053370786516853931, 0.0042134831460674156, 0.0056179775280898875, 0.0014044943820224719, 0.023876404494382022, 0.0098314606741573031, 0.0070224719101123594, 0.011235955056179775, 0.019662921348314606, 0.02247191011235955, 0.025280898876404494, 0.0028089887640449437, 0.0056179775280898875, 0.036516853932584269, 0.016853932584269662, 0.0, 0.029494382022471909, 0.0056179775280898875, 0.023876404494382022, 0.0, 0.0098314606741573031, 0.028089887640449437, 0.019662921348314606, 0.026685393258426966, 0.014044943820224719, 0.0070224719101123594, 0.0084269662921348312, 0.0, 0.0014044943820224719, 0.025280898876404494, 0.019662921348314606, 0.0084269662921348312, 0.032303370786516857, 0.025280898876404494, 0.032303370786516857, 0.012640449438202247, 0.0351123595505618, 0.037921348314606744, 0.018258426966292134, 0.047752808988764044, 0.018258426966292134, 0.028089887640449437, 0.028089887640449437, 0.012640449438202247, 0.0098314606741573031], 'fid|129049020348348942|locus|VBIEscCol44059_0001|': [0.012640449438202247, 0.021067415730337078, 0.015449438202247191, 0.012640449438202247, 0.0098314606741573031, 0.016853932584269662, 0.0084269662921348312, 0.012640449438202247, 0.015449438202247191, 0.011235955056179775, 0.0, 0.0, 0.0042134831460674156, 0.012640449438202247, 0.0014044943820224719, 0.0056179775280898875, 0.0084269662921348312, 0.015449438202247191, 0.0042134831460674156, 0.053370786516853931, 0.0042134831460674156, 0.0056179775280898875, 0.0014044943820224719, 0.023876404494382022, 0.0098314606741573031, 0.0070224719101123594, 0.011235955056179775, 0.019662921348314606, 0.02247191011235955, 0.025280898876404494, 0.0028089887640449437, 0.0056179775280898875, 0.036516853932584269, 0.016853932584269662, 0.0, 0.029494382022471909, 0.0056179775280898875, 0.023876404494382022, 0.0, 0.0098314606741573031, 0.028089887640449437, 0.019662921348314606, 0.026685393258426966, 0.014044943820224719, 0.0070224719101123594, 0.0084269662921348312, 0.0, 0.0014044943820224719, 0.025280898876404494, 0.019662921348314606, 0.0084269662921348312, 0.032303370786516857, 0.025280898876404494, 0.032303370786516857, 0.012640449438202247, 0.0351123595505618, 0.037921348314606744, 0.018258426966292134, 0.047752808988764044, 0.018258426966292134, 0.028089887640449437, 0.028089887640449437, 0.012640449438202247, 0.0098314606741573031]}
make a codon and amino acid histogram from a file parser object. But this time combine all genes so that only one histogram is returned.
Returns : | codon_hist:dict[‘combined genome’] :: np.array x 64 :
|
---|
Given the mutationmatrix, the fitnessfunctions, the fitnessmatrices (the amino-acid identity matrix) and the selection strength, this builds the evolutionmatrix
ups, not correct! read models_of_dna_evolution on wiki
The probability when we put r number of codons in a gene into n bins of different codons to find a codon with k occurences in the gene is ..math:
p_k = left(
rac{r}{k}) ight) rac{(n-1)^{r-k}} rac{n^r}
hence, the probability of finding more than one occurence is
- ..math:
- p_{k>1} (63/64)^r( (64/63)^r - 1 )
if we want to to be reasonably sure our gene contains at least one codon from every kind we have to solve the inequality p_{k>1} > p-value. A p-value of 0.95 implies we have to use a gene with at least 500 nucleotides
given the steady state and the target sequence, let us optimize!