Loading iTP-Seq data#

Automatic loading from a directory#

The easiest approach to create a DataSet is to use a consistent format of the file names (see Naming conventions).

The parsing step creates four file for each input fast files:

  • the inverse-toeprint sequences as nucleotides (<file_prefix>.nuc.txt)

  • the inverse-toeprint sequences as amino-acids (<file_prefix>.aa.itp.txt)

  • metadata as JSON (<file_prefix>.itp.json)

  • a log file (<file_prefix>.itp.log)

All the files share the same prefix and the JSON files are used to identify the replicates.

Default behavior#

By default DataSet expects a prefix with the XXX_YYYDD format. XXX (alphanumeric) will be assigned as a “lib-type” key, YYY (letters) as a “sample” key, and DD (digits) as the “replicate”. For example nnn15_noa1.

Therefore a directory containing 3 “noa” and 3 “tcx” replicates would look like:

nnn15_noa1.aa.itp.txt       nnn15_noa3.aa.itp.txt       nnn15_tcx2.aa.itp.txt
nnn15_noa1.assembled.fastq  nnn15_noa3.assembled.fastq  nnn15_tcx2.assembled.fastq
nnn15_noa1.itp.json         nnn15_noa3.itp.json         nnn15_tcx2.itp.json
nnn15_noa1.itp.log          nnn15_noa3.itp.log          nnn15_tcx2.itp.log
nnn15_noa1.nuc.itp.txt      nnn15_noa3.nuc.itp.txt      nnn15_tcx2.nuc.itp.txt
nnn15_noa2.aa.itp.txt       nnn15_tcx1.aa.itp.txt       nnn15_tcx3.aa.itp.txt
nnn15_noa2.assembled.fastq  nnn15_tcx1.assembled.fastq  nnn15_tcx3.assembled.fastq
nnn15_noa2.itp.json         nnn15_tcx1.itp.json         nnn15_tcx3.itp.json
nnn15_noa2.itp.log          nnn15_tcx1.itp.log          nnn15_tcx3.itp.log
nnn15_noa2.nuc.itp.txt      nnn15_tcx1.nuc.itp.txt      nnn15_tcx3.nuc.itp.txt

Loading this directory will automatically assign the 3 Replicates to 2 Samples (“tcx” and “noa”). In addition, if a sample is named “noa”, it is automatically assigned as a reference to the other samples that share the same keys (other than “sample”):

In [1]: from itpseq import DataSet

In [2]: data = DataSet('.') # current directory containing the data files

In [3]: data
Out[3]: 
DataSet(data_path='.',
        file_pattern='(?P<lib_type>[^_]+)_(?P<sample>[^_\\d]+)(?P<replicate>\\d+)',
        samples=[Sample(nnn15.noa:[1, 2, 3]),
                 Sample(nnn15.tcx:[1, 2, 3], ref: nnn15.noa)],
        )

In [4]: data.samples
Out[4]: 
{'nnn15.noa': Sample(nnn15.noa:[1, 2, 3]),
 'nnn15.tcx': Sample(nnn15.tcx:[1, 2, 3], ref: nnn15.noa)}

In [5]: data.replicates
Out[5]: 
{'nnn15.noa.1': Replicate(nnn15.noa.1),
 'nnn15.noa.2': Replicate(nnn15.noa.2),
 'nnn15.noa.3': Replicate(nnn15.noa.3),
 'nnn15.tcx.1': Replicate(nnn15.tcx.1),
 'nnn15.tcx.2': Replicate(nnn15.tcx.2),
 'nnn15.tcx.3': Replicate(nnn15.tcx.3)}

This detection is due to the default regular expression file_pattern: (?P<lib_type>[^_]+)_(?P<sample>[^_\\d]+)(?P<replicate>\\d+).

The lib_type and sample keys are automatically used to group the Replicates into a Sample and to create the Sample name.

It is possible to specify the keys to use to group the Replicates:

In [6]: DataSet('.', keys=['sample'])  # ignoring "lib_type"
Out[6]: 
DataSet(data_path='.',
        file_pattern='(?P<lib_type>[^_]+)_(?P<sample>[^_\\d]+)(?P<replicate>\\d+)',
        samples=[Sample(noa:[1, 2, 3]),
                 Sample(tcx:[1, 2, 3], ref: noa)],
        )

Custom prefix and keys#

Let’s imagine a dataset with two drugs (drugA and drugB), one control (noa) and a few different concentrations for the drugs (10, 20, 30µM):

drugA1_10µM.itp.json  drugA3_20µM.itp.json  drugB2_30µM.itp.json
drugA1_20µM.itp.json  drugA3_30µM.itp.json  drugB3_10µM.itp.json
drugA1_30µM.itp.json  drugB1_10µM.itp.json  drugB3_20µM.itp.json
drugA2_10µM.itp.json  drugB1_20µM.itp.json  drugB3_30µM.itp.json
drugA2_20µM.itp.json  drugB1_30µM.itp.json  noa1.itp.json
drugA2_30µM.itp.json  drugB2_10µM.itp.json  noa2.itp.json
drugA3_10µM.itp.json  drugB2_20µM.itp.json  noa3.itp.json

The different parts of the filename can be defined through file_pattern:

(?P<sample>[^_]+)(?P<replicate>\\d+)(_(?P<concentration>\\d+µM))?

  • (?P<sample>[^_]+): match the sample name (anything but _)

  • (?P<replicate>\d+): match digits defining the replicate number

  • (_(?P<concentration>\d+µM))?: optionally match _ followed by a concentration

In [7]: from itpseq import DataSet

In [8]: data = DataSet('.', file_pattern='(?P<sample>[^_]+)(?P<replicate>\d+)(_(?P<concentration>\d+µM))?')

In [9]: data
Out[9]: 
DataSet(data_path='.',
        file_pattern='(?P<sample>[^_]+)(?P<replicate>\\d+)(_(?P<concentration>\\d+µM))?',
        samples=[Sample(drugA.10µM:[1, 2, 3], ref: noa),
                 Sample(drugA.20µM:[1, 2, 3], ref: noa),
                 Sample(drugA.30µM:[1, 2, 3], ref: noa),
                 Sample(drugB.10µM:[1, 2, 3], ref: noa),
                 Sample(drugB.20µM:[1, 2, 3], ref: noa),
                 Sample(drugB.30µM:[1, 2, 3], ref: noa),
                 Sample(noa:[1, 2, 3])],
        )

In [10]: data.samples
Out[10]: 
{'drugA.10µM': Sample(drugA.10µM:[1, 2, 3], ref: noa),
 'drugA.20µM': Sample(drugA.20µM:[1, 2, 3], ref: noa),
 'drugA.30µM': Sample(drugA.30µM:[1, 2, 3], ref: noa),
 'drugB.10µM': Sample(drugB.10µM:[1, 2, 3], ref: noa),
 'drugB.20µM': Sample(drugB.20µM:[1, 2, 3], ref: noa),
 'drugB.30µM': Sample(drugB.30µM:[1, 2, 3], ref: noa),
 'noa': Sample(noa:[1, 2, 3])}

It is also possible to define the keys that will be used to assign the replicate. For instance, using ref_labels={'sample': 'drugA'} would define drugA as a reference to the samples that match the other same keys.

In [11]: data = DataSet('.',
   ....:                file_pattern='(?P<sample>[^_]+)(?P<replicate>\d+)(_(?P<concentration>\d+µM))?',
   ....:                ref_labels={'sample': 'drugA'},
   ....:                )
   ....: 
Multiple references for {}: [drugA.30µM, drugA.20µM, drugA.10µM]

In [12]: data
Out[12]: 
DataSet(data_path='.',
        file_pattern='(?P<sample>[^_]+)(?P<replicate>\\d+)(_(?P<concentration>\\d+µM))?',
        samples=[Sample(drugA.10µM:[1, 2, 3]),
                 Sample(drugA.20µM:[1, 2, 3]),
                 Sample(drugA.30µM:[1, 2, 3]),
                 Sample(drugB.10µM:[1, 2, 3], ref: drugA.10µM),
                 Sample(drugB.20µM:[1, 2, 3], ref: drugA.20µM),
                 Sample(drugB.30µM:[1, 2, 3], ref: drugA.30µM),
                 Sample(noa:[1, 2, 3])],
        )

Manual loading from a directory#

It is also possible to create Replicate, Sample, and DataSet objects manually.

In [13]: from itpseq import DataSet, Sample, Replicate

In [14]: R1 = Replicate(replicate='1', file_prefix='nnn15_tcx1') # relative to current directory

In [15]: R2 = Replicate(replicate='2', file_prefix='nnn15_tcx2')

In [16]: R3 = Replicate(replicate='3', file_prefix='nnn15_tcx3')

In [17]: N1 = Replicate(replicate='1', file_prefix='nnn15_noa1')

In [18]: N2 = Replicate(replicate='2', file_prefix='nnn15_noa2')

In [19]: N3 = Replicate(replicate='3', file_prefix='nnn15_noa3')

In [20]: S = Sample(replicates=[R1, R2, R3],
   ....:            name='tcx',
   ....:            reference=Sample(replicates=[N1, N2, N3], name='noa'),
   ....:           )
   ....: 

In [21]: S
Out[21]: Sample(tcx:[1, 2, 3], ref: noa)

Or using a dictionary of samples/replicates:

In [22]: data = DataSet({'tcx': [{'file_prefix': 'nnn15_tcx1'},
   ....:                         {'file_prefix': 'nnn15_tcx2'},
   ....:                         {'file_prefix': 'nnn15_tcx3'}
   ....:                        ],
   ....:                 'noa': [{'file_prefix': 'nnn15_noa1'},
   ....:                         {'file_prefix': 'nnn15_noa2', 'replicate': 'custom_name'},
   ....:                         {'file_prefix': 'nnn15_noa3'}
   ....:                        ]},
   ....:                ref_mapping={'tcx': 'noa'})
   ....: 
Creating temporary cache directory: "/tmp/tmp_nmngpft"

In [23]: data
Out[23]: 
DataSet(samples=[Sample(tcx:[rep1, rep2, rep3], ref: noa),
                 Sample(noa:[rep1, custom_name, rep3])],
        )