Copyright 2014 Sebastian Raschka
smilite is a Python 3 module to download and analyze SMILE strings (Simplified Molecular-Input Line-entry System) of chemical compounds from ZINC (a free database of commercially-available compounds for virtual screening, http://zinc.docking.org)
You can use the following command to install PyPrind:
pip install smilite
or
easy_install smilite
Alternatively, you download the package manually from the Python Package Index https://pypi.python.org/pypi/smilite, unzip it, navigate into the package, and use the command:
python3 setup.py install
After you installed the smilite module, you can import it in Python via import smilite
.
The current functions include:
def get_zinc_smile(zinc_id): Gets the corresponding SMILE string for a ZINC ID query from the ZINC online database. Requires an internet connection. Keyword arguments: zinc_id (str): A valid ZINC ID, e.g. 'ZINC00029323' Returns the SMILE string for the corresponding ZINC ID. E.g., 'COc1cccc(c1)NC(=O)c2cccnc2'
def generate_zincid_smile_csv(zincid_list, out_file): Generates a CSV file of ZINC_ID,SMILE_string entries by querying the ZINC online database. Keyword arguments: zincid_list (str): Path to a UTF-8 or ASCII formatted file that contains 1 ZINC_ID per row. E.g., ZINC0000123456 ZINC0000234567 [...] out_file (str): Path to a new output CSV file that will be written. print_prgress_bar (bool): Prints a progress bar to the screen if True.
def check_duplicate_smiles(zincid_list, out_file, compare_simplified_smiles=False, print_progress_bar=False): Scans a ZINC_ID,SMILE_string CSV file for duplicate SMILE strings. Keyword arguments: zincid_list (str): Path to a UTF-8 or ASCII formatted file that contains 1 ZINC_ID per row. E.g., ZINC12345678,Cc1ccc(cc1C)OCCOc2c(cc(cc2I)/C=N/n3cnnc3)OC ZINC01234567,C[C@H]1CCCC[NH+]1CC#CC(c2ccccc2)(c3ccccc3)O [...] out_file (str): Path to a new output CSV file that will be written. compare_simplified_smiles (bool): If true, SMILE strings will be simplified for the comparison.
def simplify_smile(smile_str): Simplifies a SMILE string by removing hydrogen atoms (H), chiral specifications ('@'), charges (+ / -), '#'-characters, and square brackets ('[', ']'). Keyword Arguments: smile_str (str): A smile string, e.g., CC@HNCCS(=O)(=O)[O-]) Returns a simplified SMILE string, e.g., CC(CCC(=O)NCCS(=O)(=O)O)
If you downloaded the smilite package from https://pypi.python.org/pypi/smilite or https://github.com/rasbt/smilite, you can use the command line scripts I provide in the scripts/
dir.
Generates a ZINC_ID,SMILE_STR csv file from a input file of ZINC IDs. The input file should consist of 1 columns with 1 ZINC ID per row.
Usage:
[shell]>> python3 gen_zincid_smile_csv.py in.csv out.csv
Example:
[shell]>> python3 gen_zincid_smile_csv.py ../examples/zinc_ids.csv ../examples/zid_smiles.csv
Screen Output:
Downloading SMILES 0% 100% [########## ] | ETA[sec]: 106.525
Input example file format:
zinc_ids.csv
Output example file format:
zid_smiles.csv
Compares SMILE strings in a 2 column CSV file (ZINC_ID,SMILE_string) to identify duplicates. Generates a new CSV file with ZINC IDs of identified duplicates listed in a 3rd-nth column(s).
Usage:
[shell]>> python3 comp_smile_strings.py in.csv out.csv [simplify]
Example 1:
[shell]>> python3 gen_zincid_smile_csv.py ../examples/zinc_ids.csv ../examples/zid_smiles.csv
Input example file format:
zid_smiles.csv
Output example file format 1:
comp_smiles.csv
Where
- 1st column: ZINC ID
- 2nd column: SMILE string
- 3rd column: number of duplicates
- 4th-nth column: ZINC IDs of duplicates
Example 2:
[shell]>> python3 comp_smile_strings.py ../examples/zid_smiles.csv ../examples/comp_simple_smiles.csv simplify
Output example file format 2:
comp_simple_smiles.csv
If you have any questions or comments about PyPrind, please feel free to contact me via
eMail: se.raschka@gmail.com
or Twitter: @rasbt
VERSION 1.1.1
VERSION 1.1.0
generate_zincid_smile_csv()
function