Loading iTP-Seq data#
Automatic loading from a directory#
The easiest approach to create a DataSet is to use a consistent format of
the file names (see Naming conventions).
The parsing step creates four file for each input fast files:
the inverse-toeprint sequences as nucleotides (
<file_prefix>.nuc.txt)the inverse-toeprint sequences as amino-acids (
<file_prefix>.aa.itp.txt)metadata as JSON (
<file_prefix>.itp.json)a log file (
<file_prefix>.itp.log)
All the files share the same prefix and the JSON files are used to identify the replicates.
Default behavior#
By default DataSet expects a prefix with the XXX_YYYDD format. XXX
(alphanumeric) will be assigned as a “lib-type” key, YYY (letters) as a “sample”
key, and DD (digits) as the “replicate”. For example nnn15_noa1.
Therefore a directory containing 3 “noa” and 3 “tcx” replicates would look like:
nnn15_noa1.aa.itp.txt nnn15_noa3.aa.itp.txt nnn15_tcx2.aa.itp.txt
nnn15_noa1.assembled.fastq nnn15_noa3.assembled.fastq nnn15_tcx2.assembled.fastq
nnn15_noa1.itp.json nnn15_noa3.itp.json nnn15_tcx2.itp.json
nnn15_noa1.itp.log nnn15_noa3.itp.log nnn15_tcx2.itp.log
nnn15_noa1.nuc.itp.txt nnn15_noa3.nuc.itp.txt nnn15_tcx2.nuc.itp.txt
nnn15_noa2.aa.itp.txt nnn15_tcx1.aa.itp.txt nnn15_tcx3.aa.itp.txt
nnn15_noa2.assembled.fastq nnn15_tcx1.assembled.fastq nnn15_tcx3.assembled.fastq
nnn15_noa2.itp.json nnn15_tcx1.itp.json nnn15_tcx3.itp.json
nnn15_noa2.itp.log nnn15_tcx1.itp.log nnn15_tcx3.itp.log
nnn15_noa2.nuc.itp.txt nnn15_tcx1.nuc.itp.txt nnn15_tcx3.nuc.itp.txt
Loading this directory will automatically assign the 3 Replicates to 2 Samples (“tcx” and “noa”). In addition, if a sample is named “noa”, it is automatically assigned as a reference to the other samples that share the same keys (other than “sample”):
In [1]: from itpseq import DataSet
In [2]: data = DataSet('.') # current directory containing the data files
In [3]: data
Out[3]:
DataSet(data_path='.',
file_pattern='(?P<lib_type>[^_]+)_(?P<sample>[^_\\d]+)(?P<replicate>\\d+)',
samples=[Sample(nnn15.noa:[1, 2, 3]),
Sample(nnn15.tcx:[1, 2, 3], ref: nnn15.noa)],
)
In [4]: data.samples
Out[4]:
{'nnn15.noa': Sample(nnn15.noa:[1, 2, 3]),
'nnn15.tcx': Sample(nnn15.tcx:[1, 2, 3], ref: nnn15.noa)}
In [5]: data.replicates
Out[5]:
{'nnn15.noa.1': Replicate(nnn15.noa.1),
'nnn15.noa.2': Replicate(nnn15.noa.2),
'nnn15.noa.3': Replicate(nnn15.noa.3),
'nnn15.tcx.1': Replicate(nnn15.tcx.1),
'nnn15.tcx.2': Replicate(nnn15.tcx.2),
'nnn15.tcx.3': Replicate(nnn15.tcx.3)}
This detection is due to the default regular expression file_pattern:
(?P<lib_type>[^_]+)_(?P<sample>[^_\\d]+)(?P<replicate>\\d+).
The lib_type and sample keys are automatically used to group the
Replicates into a Sample and to create the Sample name.
It is possible to specify the keys to use to group the Replicates:
In [6]: DataSet('.', keys=['sample']) # ignoring "lib_type"
Out[6]:
DataSet(data_path='.',
file_pattern='(?P<lib_type>[^_]+)_(?P<sample>[^_\\d]+)(?P<replicate>\\d+)',
samples=[Sample(noa:[1, 2, 3]),
Sample(tcx:[1, 2, 3], ref: noa)],
)
Custom prefix and keys#
Let’s imagine a dataset with two drugs (drugA and drugB), one control (noa) and a few different concentrations for the drugs (10, 20, 30µM):
drugA1_10µM.itp.json drugA3_20µM.itp.json drugB2_30µM.itp.json
drugA1_20µM.itp.json drugA3_30µM.itp.json drugB3_10µM.itp.json
drugA1_30µM.itp.json drugB1_10µM.itp.json drugB3_20µM.itp.json
drugA2_10µM.itp.json drugB1_20µM.itp.json drugB3_30µM.itp.json
drugA2_20µM.itp.json drugB1_30µM.itp.json noa1.itp.json
drugA2_30µM.itp.json drugB2_10µM.itp.json noa2.itp.json
drugA3_10µM.itp.json drugB2_20µM.itp.json noa3.itp.json
The different parts of the filename can be defined through file_pattern:
(?P<sample>[^_]+)(?P<replicate>\\d+)(_(?P<concentration>\\d+µM))?
(?P<sample>[^_]+): match the sample name (anything but_)(?P<replicate>\d+): match digits defining the replicate number(_(?P<concentration>\d+µM))?: optionally match_followed by a concentration
In [7]: from itpseq import DataSet
In [8]: data = DataSet('.', file_pattern='(?P<sample>[^_]+)(?P<replicate>\d+)(_(?P<concentration>\d+µM))?')
In [9]: data
Out[9]:
DataSet(data_path='.',
file_pattern='(?P<sample>[^_]+)(?P<replicate>\\d+)(_(?P<concentration>\\d+µM))?',
samples=[Sample(drugA.10µM:[1, 2, 3], ref: noa),
Sample(drugA.20µM:[1, 2, 3], ref: noa),
Sample(drugA.30µM:[1, 2, 3], ref: noa),
Sample(drugB.10µM:[1, 2, 3], ref: noa),
Sample(drugB.20µM:[1, 2, 3], ref: noa),
Sample(drugB.30µM:[1, 2, 3], ref: noa),
Sample(noa:[1, 2, 3])],
)
In [10]: data.samples
Out[10]:
{'drugA.10µM': Sample(drugA.10µM:[1, 2, 3], ref: noa),
'drugA.20µM': Sample(drugA.20µM:[1, 2, 3], ref: noa),
'drugA.30µM': Sample(drugA.30µM:[1, 2, 3], ref: noa),
'drugB.10µM': Sample(drugB.10µM:[1, 2, 3], ref: noa),
'drugB.20µM': Sample(drugB.20µM:[1, 2, 3], ref: noa),
'drugB.30µM': Sample(drugB.30µM:[1, 2, 3], ref: noa),
'noa': Sample(noa:[1, 2, 3])}
It is also possible to define the keys that will be used to assign the
replicate. For instance, using ref_labels={'sample': 'drugA'} would define
drugA as a reference to the samples that match the other same keys.
In [11]: data = DataSet('.',
....: file_pattern='(?P<sample>[^_]+)(?P<replicate>\d+)(_(?P<concentration>\d+µM))?',
....: ref_labels={'sample': 'drugA'},
....: )
....:
Multiple references for {}: [drugA.30µM, drugA.20µM, drugA.10µM]
In [12]: data
Out[12]:
DataSet(data_path='.',
file_pattern='(?P<sample>[^_]+)(?P<replicate>\\d+)(_(?P<concentration>\\d+µM))?',
samples=[Sample(drugA.10µM:[1, 2, 3]),
Sample(drugA.20µM:[1, 2, 3]),
Sample(drugA.30µM:[1, 2, 3]),
Sample(drugB.10µM:[1, 2, 3], ref: drugA.10µM),
Sample(drugB.20µM:[1, 2, 3], ref: drugA.20µM),
Sample(drugB.30µM:[1, 2, 3], ref: drugA.30µM),
Sample(noa:[1, 2, 3])],
)
Manual loading from a directory#
It is also possible to create Replicate, Sample, and
DataSet objects manually.
In [13]: from itpseq import DataSet, Sample, Replicate
In [14]: R1 = Replicate(replicate='1', file_prefix='nnn15_tcx1') # relative to current directory
In [15]: R2 = Replicate(replicate='2', file_prefix='nnn15_tcx2')
In [16]: R3 = Replicate(replicate='3', file_prefix='nnn15_tcx3')
In [17]: N1 = Replicate(replicate='1', file_prefix='nnn15_noa1')
In [18]: N2 = Replicate(replicate='2', file_prefix='nnn15_noa2')
In [19]: N3 = Replicate(replicate='3', file_prefix='nnn15_noa3')
In [20]: S = Sample(replicates=[R1, R2, R3],
....: name='tcx',
....: reference=Sample(replicates=[N1, N2, N3], name='noa'),
....: )
....:
In [21]: S
Out[21]: Sample(tcx:[1, 2, 3], ref: noa)
Or using a dictionary of samples/replicates:
In [22]: data = DataSet({'tcx': [{'file_prefix': 'nnn15_tcx1'},
....: {'file_prefix': 'nnn15_tcx2'},
....: {'file_prefix': 'nnn15_tcx3'}
....: ],
....: 'noa': [{'file_prefix': 'nnn15_noa1'},
....: {'file_prefix': 'nnn15_noa2', 'replicate': 'custom_name'},
....: {'file_prefix': 'nnn15_noa3'}
....: ]},
....: ref_mapping={'tcx': 'noa'})
....:
Creating temporary cache directory: "/tmp/tmp_nmngpft"
In [23]: data
Out[23]:
DataSet(samples=[Sample(tcx:[rep1, rep2, rep3], ref: noa),
Sample(noa:[rep1, custom_name, rep3])],
)