Tutorial 5: Parallelization

[1]:
import homelette as hm

import time

Introduction

Welcome to the fifth tutorial on homelette. This tutorial is about parallelization in homelette. When modelling hundreds or thousands of models, some processes can be significantly sped up by dividing the workload on multiple processes in parallel (supported by appropriate hardware).

There are possibilities to parallelize both model generation and evaluation in homelette.

Alignment and Task setup

For this tutorial, we are using the same alignment as in Tutorial 1. Identical to previous tutorials, the alignment is imported and annotated, and a Task object is set up.

[2]:
# read in the alignment
aln = hm.Alignment('data/single/aln_1.fasta_aln')

# annotate the alignment
aln.get_sequence('ARAF').annotate(
    seq_type = 'sequence')
aln.get_sequence('3NY5').annotate(
    seq_type = 'structure',
    pdb_code = '3NY5',
    begin_res = '1',
    begin_chain = 'A',
    end_res = '81',
    end_chain = 'A')

# initialize task object
t = hm.Task(
    task_name = 'Tutorial5',
    target = 'ARAF',
    alignment = aln,
    overwrite = True)

Parallel model generation

When trying to parallelize model generation, homelette makes use of the parallelization methods implemented in the packages that homelette uses, if they are available. Model generation with modeller can be parallized and is available in homelette through a simple handler [1,2].

All modeller based, pre-implemented routines have the argument n_threads which can be used to use parallelization. The default is n_threads = 1 which does not activate parallelization, but any number > 1 will distribute the workload on the number of threads requested using the modeller.parallel submodule.

[3]:
# use only 1 thread to generate 20 models
start = time.time()
t.execute_routine(
    tag = '1_thread',
    routine = hm.routines.Routine_automodel_default,
    templates = ['3NY5'],
    template_location = './data/single/',
    n_models = 20)
print('Elapsed time: ' + str(time.time() - start))
Elapsed time: 32.36234951019287
[4]:
# use 4 threads to generate 20 models faster
start  = time.time()
t.execute_routine(
    tag = '4_threads',
    routine = hm.routines.Routine_automodel_default,
    templates = ['3NY5'],
    template_location = './data/single/',
    n_models = 20,
    n_threads = 4)
print('Elapsed time: ' + str(time.time() - start))
Elapsed time: 9.40773320198059

Using multiple threads can significantly speed up model generation, especially if a large number of models are generated.

Note

Please be aware that the modeller.parallel submodule uses the Python module pickle, which requires objects to be pickled to be saved in a separate file. In practical terms, if you want to run parallelization in modeller with a custom object (i.e. a custom defined routine, see Tutorial 4), you cannot make use of parallelization unless you have imported it from a separate file. Therefore we recommend that custom routines and evaluation are saved in a separate file and then imported from there.

The following code block shows how custom building blocks could be put in an external file (data/extension.py) and then imported for modelling and analysis.

[5]:
# import from custom file
from data.extension import Custom_Routine, Custom_Evaluation

?Custom_Routine
Init signature: Custom_Routine()
Docstring:      Custom routine waiting to be implemented.
File:           /home/data/extension.py
Type:           type
Subclasses:

[6]:
!cat data/extension.py
'''
Examples of custom objects for homelette in a external file.
'''


class Custom_Routine():
    '''
    Custom routine waiting to be implemented.
    '''
    def __init__(self):
        print('TODO: implement this')


class Custom_Evaluation():
    '''
    Custom evaluation waiting to be implemented.
    '''
    def __init__(self):
        print('TODO: implement this')

Alternatively, you could use the /homelette/extension/ folder in which extensions are stored. See our comments on extensions in our documentation for more details.

Parallel model evaluation

homelette can also use parallelization to speed up model evaluation. This is internally archieved by using concurrent.futures.ThreadPoolExecutor.

In order to use parallelization when performing evaluations, use the n_threads argument in Task.evaluate_models.

[7]:
# use 1 thread for model evaluation
start  = time.time()
t.evaluate_models(hm.evaluation.Evaluation_mol_probity, n_threads=1)
print('Elapsed time: ' + str(time.time() - start))
Elapsed time: 303.9152281284332
[8]:
# use 4 threads for model evaluation
start  = time.time()
t.evaluate_models(hm.evaluation.Evaluation_mol_probity, n_threads=4)
print('Elapsed time: ' + str(time.time() - start))
Elapsed time: 86.51389622688293

For some evaluation schemes, using parallelization can lead to a significant speedup.

Note

Please be advised that for some (very fast) evaluation methods, the time investment of spawning new child processes might not compensate for the speedup gained by parallelization. Test your usecase on your system in a small setting and use at your own discretion.

[9]:
# use 1 thread for model evaluation
start  = time.time()
t.evaluate_models(hm.evaluation.Evaluation_dope, n_threads=1)
print('Elapsed time: ' + str(time.time() - start))
Elapsed time: 8.079042911529541
[10]:
# use 4 threads for model evaluation
start  = time.time()
t.evaluate_models(hm.evaluation.Evaluation_dope, n_threads=4)
print('Elapsed time: ' + str(time.time() - start))
Elapsed time: 13.420895338058472

Note

When creating and using custom evaluation metrics, please make sure to avoid race conditions. Task.evaluate_models is implemented with a protection against race conditions, but this is not bulletproof. Also, if you need to create temporary files, make sure to create file names with model-specific names (i.e. by using the model name in the file name). Defining custom evaluations in a separate file is not necessary, as parallelization of evaluation methods does not rely on pickle.

Note

In case some custom evaluation metrics are very memory-demanding, running it in parallel can easily overwhelm the system. Again, we encourage you to test your usecase on your system in a small setting.

Further reading

Congratulation on completing Tutorial 5 about parallelization in homelette. Please note that there are other tutorials, which will teach you more about how to use homelette:

  • Tutorial 1: Learn about the basics of homelette.

  • Tutorial 2: Learn more about already implemented routines for homology modelling.

  • Tutorial 3: Learn about the evaluation metrics available with homelette.

  • Tutorial 4: Learn about extending homelette’s functionality by defining your own modelling routines and evaluation metrics.

  • Tutorial 6: Learn about modelling protein complexes.

  • Tutorial 7: Learn about assembling custom pipelines.

  • Tutorial 8: Learn about automated template identification, alignment generation and template processing.

References

[1] Šali, A., & Blundell, T. L. (1993). Comparative protein modelling by satisfaction of spatial restraints. Journal of Molecular Biology, 234(3), 779–815. https://doi.org/10.1006/jmbi.1993.1626

[2] Webb, B., & Sali, A. (2016). Comparative Protein Structure Modeling Using MODELLER. Current Protocols in Bioinformatics, 54(1), 5.6.1-5.6.37. https://doi.org/10.1002/cpbi.3

Session Info

[11]:
# session info
import session_info
session_info.show(html = False, dependencies = True)
-----
data                NA
homelette           1.3
session_info        1.0.0
-----
PIL                         7.0.0
altmod                      NA
anyio                       NA
attr                        19.3.0
babel                       2.9.1
backcall                    0.2.0
certifi                     2021.10.08
chardet                     3.0.4
charset_normalizer          2.0.8
cycler                      0.10.0
cython_runtime              NA
dateutil                    2.7.3
debugpy                     1.5.1
decorator                   4.4.2
entrypoints                 0.3
idna                        3.3
importlib_resources         NA
ipykernel                   6.5.1
ipython_genutils            0.2.0
jedi                        0.18.1
jinja2                      3.0.3
json5                       NA
jsonschema                  4.2.1
jupyter_server              1.12.1
jupyterlab_server           2.8.2
kiwisolver                  1.0.1
markupsafe                  2.0.1
matplotlib                  3.1.2
modeller                    10.1
mpl_toolkits                NA
nbclassic                   NA
nbformat                    5.1.3
numpy                       1.17.4
ost                         2.2.0
packaging                   20.3
pandas                      0.25.3
parso                       0.8.2
pexpect                     4.8.0
pickleshare                 0.7.5
pkg_resources               NA
prometheus_client           NA
promod3                     3.2.0
prompt_toolkit              3.0.23
ptyprocess                  0.7.0
pvectorc                    NA
pydev_ipython               NA
pydevconsole                NA
pydevd                      2.6.0
pydevd_concurrency_analyser NA
pydevd_file_utils           NA
pydevd_plugins              NA
pydevd_tracing              NA
pygments                    2.10.0
pyparsing                   2.4.6
pyrsistent                  NA
pytz                        2019.3
qmean                       NA
requests                    2.26.0
send2trash                  NA
sitecustomize               NA
six                         1.14.0
sniffio                     1.2.0
storemagic                  NA
swig_runtime_data4          NA
terminado                   0.12.1
tornado                     6.1
traitlets                   5.1.1
urllib3                     1.26.7
wcwidth                     NA
websocket                   1.2.1
zipp                        NA
zmq                         22.3.0
-----
IPython             7.30.0
jupyter_client      7.1.0
jupyter_core        4.9.1
jupyterlab          3.2.4
notebook            6.4.6
-----
Python 3.8.10 (default, Jun  2 2021, 10:49:15) [GCC 9.4.0]
Linux-4.15.0-162-generic-x86_64-with-glibc2.29
-----
Session information updated at 2021-11-29 19:07