src package
Submodules
src.columns module
The columns that will be extracted from the PDB metadata. We find these to be the most relevant as they capture the essence of structures in the PDB database.
- Columns:
rcsb_id: The unique identifier for the structure in the PDB database.
exptl_method: The method used to determine the structure.
release_date: The date the structure was released.
polymer_entity_instance_count: The number of polymer entities in the structure.
polymer_entity_count_RNA: The number of RNA polymer entities in the structure.
resolution: The resolution of the structure.
selected_polymer_entity_types: The types of polymer entities in the structure.
pdbx_keywords: Keywords associated with the structure.
src.dataset_creator module
- class src.dataset_creator.DatasetCreator(dataset_name: str, rfam_cm_path: str, structure_determination_methodology: str | None = 'experimental', rcsb_entity_polymer_type: str | None = 'RNA', dstart: int | None = 0, dend: int | None = 10000, download_all: bool | None = False, exptl_method: List[str] | None = ['X-RAY DIFFRACTION'], resolution: float | None = 3.6, year_range: List[int] | int | None = 2024, polymer_entity_instance_count: int | None = None, polymer_entity_count_RNA: int | None = None, selected_polymer_entity_types: List[str] | None = None, pdbx_keywords: List[str] | None = ['RNA'], polymer_type: str | None = 'polyribonucleotide', sequence_length: int | None = 40, sequence_identity: float | None = 50.0, auto_download: bool | None = True, alignment_tool: str | None = 'clustal', e_value_cmscan: float | None = 0.0001, save: bool | None = False)
Bases:
object
It is used to create a dataset of RNA structure from the RCSB PDB database. This makes use of the DatasetDownload, StructureLevelFilter and PolymerLevelFilter classes. It can be used to download the dataset, apply filters at the structure and polymer level, all the parameters can be controlled by the user. Make sure to download Rfam.cm file from https://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/Rfam.cm.gz, unzip it and use infernal cmpress to compress it. Desired DATA_PATH can be set as an environment variable. When you run this class for the first time, it will create a folder for PDB files or you can create a dir called ‘pdb_files’ one level below the current directory.
- Parameters:
dataset_name (str) – Name of the dataset.
rfam_cm_path (str) – Path to the Rfam.cm file.
structure_determination_methodology (str) – Structure determination methodology.
rcsb_entity_polymer_type (str) – Polymer type.
dstart (int) – Start index for pagination.
dend (int) – End index for pagination.
download_all (bool) – Download all entries.
exptl_method (list) – Experimental method.
resolution (float) – Resolution.
year_range (int or list) – Year range.
polymer_entity_instance_count (int) – Polymer entity instance count.
polymer_entity_count_RNA (int) – Polymer entity count RNA.
selected_polymer_entity_types (list) – Selected polymer entity types.
pdbx_keywords (list) – PDBx keywords.
polymer_type (str) – Polymer type.
sequence_length (int) – Sequence length.
sequence_identity (float) – Sequence identity.
auto_download (bool) – Auto download PDB file.
alignment_tool (str) – Alignment tool.
e_value_cmscan (float) – E-value for CMScan.
save (bool) – Save filtered data, if set to False, it will not save data at each filter level.
- dataset_name
Name of the dataset.
- Type:
str
- structure_determination_methodology
Structure determination methodology.
- Type:
str
- rcsb_entity_polymer_type
Polymer type. [‘Protein’, ‘DNA’, ‘RNA’, ‘NA-hybrid’, ‘Other’] a list can’t be passed
- Type:
str
- dstart
Start index for pagination.
- Type:
int
- dend
End index for pagination.
- Type:
int
- download_all
Download all entries.
- Type:
bool
- exptl_method
Experimental method. [‘X-RAY DIFFRACTION’, ‘SOLUTION NMR’, ‘ELECTRON MICROSCOPY’, ‘SOLID-STATE NMR’] a list can be passed Further details can be found at www.wwpdb.org or www.rcsb.org
- Type:
list
- resolution
Resolution.
- Type:
float
- year_range
Year range.
- Type:
int or list
- polymer_entity_instance_count
Polymer entity instance count.
- Type:
int
- polymer_entity_count_RNA
Polymer entity count RNA.
- Type:
int
- selected_polymer_entity_types
Selected polymer entity types.
- Type:
list
- pdbx_keywords
PDBx keywords. [‘RNA’, ‘DNA/RNA’, ‘RIBOSOME’, ‘RIBOZYME’] a list can be passed Further details can be found at https://www.wwpdb.org or https://www.rcsb.org or in the raw data
- Type:
list
- polymer_type
Polymer type.
- Type:
str
- sequence_length
Sequence length.
- Type:
int
- sequence_identity
Sequence identity.
- Type:
float
- auto_download
Auto download PDB file.
- Type:
bool
- alignment_tool
Alignment tool.
- Type:
str
- rfam_cm_path
Path to the Rfam.cm file.
- Type:
str
- e_value_cmscan
E-value for CMScan.
- Type:
float
- save
Save filtered data.
- Type:
bool
- dataset_files
Path to the dataset folder.
- Type:
str
- final_fasta_path
Path to the final FASTA file.
- Type:
str
- cmscan_path
Path to the CMScan output file.
- Type:
str
- tblout
Path to the tblout file.
- Type:
str
- clean_tblout_path
Path to the clean tblout file.
- Type:
str
- final_fams_path
Path to the final families file.
- Type:
str
- final_pdb_list_path
Path to the final PDB list file.
- Type:
str
- final_chain_ids_path
Path to the final chain IDs file.
- Type:
str
- cmd
cmscan command.
- Type:
str
- data
DatasetDownload component.
- Type:
- structure_filter
StructureLevelFilter component.
- Type:
- polymer_level_filter
PolymerLevelFilter component.
- Type:
- save_filtered_data()
Save PDB list and DataFrame for each level of filters.
- save_pdb_list()
Save list of PDBs to a file.
- save_dataframe()
Save DataFrame to a file.
- get_without_filter_df()
Get dataframe without applying any filters.
- get_structure_filtered_df()
Get dataframe after applying structure-level filters.
- get_polymer_filtered_df()
Get dataframe after applying polymer-level filters.
- get_final_df()
Get final dataframe after applying all filters.
- create_final_fasta_file()
Create final FASTA file.
- run_cmscan()
Run CMScan.
- get_final_families()
Get final families.
- get_final_pdb_list()
Get final PDB list.
NOTE: When you use different params it is better to create a explantory dataset_name because internally all files will have same names and it will be hard to distinguish
Example
dataset_name = ‘test’ os.environ[‘DATA_PATH’] = str(Path(__file__).resolve().parents[1] / f’data/{dataset_name}’) DATA_PATH = os.environ.get(‘DATA_PATH’) dc = DatasetCreator(your_params_here)
- apply_filters(df: DataFrame, df_polymer_filtered_list: list) DataFrame
Apply all filters.
- apply_polymer_filters(pdb_list: list) Tuple[DataFrame, list]
Apply polymer-level filters.
- apply_structure_filters(df: DataFrame) DataFrame
Apply structure-level filters.
- create_final_fasta_file(df) None
Create final FASTA file.
- run() None
Run CMScan.
- save_dataframe(df: DataFrame, filename: str) None
Save DataFrame to a file.
- save_pdb_list(df: DataFrame, filename: str) None
Save list of PDBs to a file.
- src.dataset_creator.main()
src.dataset_download module
- class src.dataset_download.DatasetDownload(structure_determination_methodology: str = 'experimental', rcsb_entity_polymer_type: str = 'RNA', dstart: int = 0, dend: int = 25, download_all: bool = False)
Bases:
object
Download RNA dataset from RCSB PDB database. It is used to download the dataset from the RCSB PDB database. It uses the search API to get the list of PDB IDs and then uses the GraphQL API to get data for each PDB ID. The data is then converted into a DataFrame with the columns specified in the COLUMNS variable. The dataframe can be accessed using the df attribute and is also saved as a CSV file.
- Parameters:
structure_determination_methodology (str) – Structure determination methodology.
rcsb_entity_polymer_type (str) – Polymer type, possible values are “Protein”, “DNA”, “RNA” ,”NA-hybrid”, “Other”
used (If "Protein" or any other value is) –
RNA. (the tool will likely work but cmscan will fail as it is designed to work with) –
dstart (int) – Start index for pagination.
dend (int) – End index for pagination.
download_all (bool) – Download all entries.
- SEARCH_API_BASE_URI
Base URI for search API.
- Type:
str
- DATA_API_BASE_URI_GRAPHQL
Base URI for data API.
- Type:
str
- dstart
Start index for pagination.
- Type:
int
- dend
End index for pagination.
- Type:
int
- structure_determination_methodology
Structure determination methodology.
- Type:
str
- rcsb_entity_polymer_type
Polymer type.
- Type:
str
- data_path
Path to save the data.
- Type:
str
- df
Dataframe of the dataset.
- Type:
pd.DataFrame
- get_search_api_query()
Get search API query.
- get_graphql_query()
Get GraphQL query.
- get_pdb_list()
Get list of PDB IDs.
- get_data_for_each_pdb()
Get data for each PDB ID.
- get_data_as_df()
Get data as DataFrame.
- save_data_as_csv()
Save data as CSV.
NOTE: If you wish to supply multiple values for structure_determination_methodology or rcsb_entity_polymer_type, you would have to change the search query to include multiple values.
- DATA_API_BASE_URI_GRAPHQL = 'https://data.rcsb.org/graphql'
- SEARCH_API_BASE_URI = 'https://search.rcsb.org/rcsbsearch/v2/query'
- get_data_as_df()
This method converts the data for each PDB ID into a DataFrame.
- Returns:
Dataframe of the whole dataset obtained from the RCSB PDB database using the search and GraphQL APIs.
- Return type:
pd.DataFrame
- get_data_for_each_pdb()
This method fetches data for each PDB ID using the GraphQL API.
- Returns:
combined data for each PDB ID.
- Return type:
json
- get_graphql_query(pdb_ids: list)
Get GraphQL query, this is used to get data for each PDB ID.
- Parameters:
pdb_ids (list) – List of PDB IDs.
- Returns:
GraphQL query.
- Return type:
dict
- get_pdb_list()
This method fetches the list of PDB IDs from the RCSB PDB database using the search API.
- Returns:
List of PDB IDs.
- Return type:
list
- get_search_api_query()
Get search API query, this is used to search for entries in the RCSB PDB database. It filters entries based on structure determination methodology and polymer type. This query is used to get the list of PDB IDs.
- Returns:
Search API query.
- Return type:
dict
NOTE: If you wish to supply multiple values for structure_determination_methodology or rcsb_entity_polymer_type, you would have to change the search query to include multiple values.
- save_data_as_csv(path: str | PathLike)
Save data as CSV.
- Parameters:
path (str, os.PathLike) – Path to save the data.
- Returns:
None
src.pdb_filter module
- class src.pdb_filter.PDBFilter(pdb_parser: MMCIF2Dict, pdb_id: str, polymer_type: str, sequence_length: int, auto_download: bool = False)
Bases:
object
This is used to apply filters on PDB files. It analyses the PDB files at the chain and polymer level. It checks if each chain in the PDB file is of the given polymer type and has the required sequence length.
- Parameters:
pdb_parser (Bio.PDB.MMCIFParser) – PDB parser.
pdb_id (str) – PDB ID.
polymer_type (str) – Polymer type, common possible values are ‘polyribonucleotide’, ‘polydeoxyribonucleotide’, ‘polypeptide’ etc.
sequence_length (int) – Sequence length.
auto_download (bool) – Auto download PDB file if not found.
- pdb_id
PDB ID.
- Type:
str
- polymer_type
Polymer type.
- Type:
str
- sequence_length
Sequence length.
- Type:
int
- auto_download
Auto download PDB file.
- Type:
bool
- pdb_parser
PDB parser.
- Type:
Bio.PDB.MMCIFParser
- pdb_file
Path to the PDB file.
- Type:
str
- structure_dict
Structure dictionary.
- Type:
dict
- _check_and_auto_download()
Check if the PDB file exists and auto download if required.
- check_polymer_type()
Check if the PDB file satisfies the given criteria.
- all_characters_are_n(s)
Check if all characters in the string are ‘N’. This is helpful to remove chains with all ‘N’ characters as they can’t be processed by Clustal Omega.
- Parameters:
s (str) – Input string.
- Returns:
True if all characters are ‘N’, False otherwise.
- Return type:
bool
- check_polymer_type()
Check if the PDB file contains the given polymer type and has the required sequence length.
- Returns:
List of tuples containing PDB ID, chain ID and corresponding sequence.
- Return type:
list
src.polymer_level_filter module
- class src.polymer_level_filter.PolymerLevelFilter(polymer_type: str, sequence_length: int, sequence_identity: float, auto_download: bool = False, alignment_tool: str = 'clustal')
Bases:
object
This is used to apply PDBFilter for multiple PDB files. It analyses the PDB files at the chain and polymer level. It also calculates the sequence identity between the sequences of the PDB files. You can use two alignment tools: clustal and emboss. They should be installed on your system. We recommend using clustal as it is faster.
This gives the final dataframe after applying all the filters.
- Parameters:
polymer_type (str) – Polymer type.
sequence_length (int) – Sequence length.
sequence_identity (float) – Sequence identity.
auto_download (bool) – Auto download PDB file.
alignment_tool (str) – Alignment tool. Default is clustal.
- polymer_type
Polymer type.
- Type:
str
- sequence_length
Sequence length.
- Type:
int
- sequence_identity
Sequence identity.
- Type:
float
- auto_download
Auto download PDB file.
- Type:
bool
- alignment_tool
Alignment tool.
- Type:
str
- pdb_parser
PDB parser.
- Type:
Bio.PDB.MMCIFParser
- DATA_PATH
Data path.
- Type:
str
- sequence_identity_mat_file
Sequence identity matrix file.
- Type:
str
- sequence_identity_mat_path
Sequence identity matrix path.
- Type:
str
- apply_filters_on_pdb_id()
Apply filters on PDB ID.
- apply_filters_on_list()
Apply filters on list of PDB IDs.
- create_combined_fasta_file()
Create combined fasta file.
- create_sequence_identity_mat_emboss()
Create sequence identity matrix using emboss.
- get_sequence_identity_df_emboss()
Get sequence identity dataframe using emboss.
- create_sequence_identity_mat_clustal()
Create sequence identity matrix using clustal.
- get_sequence_identity_df_clustal()
Get sequence identity dataframe using clustal.
- apply_filter_on_df()
Apply filter on dataframe.
- create_final_fasta_file()
Create final fasta file.
#! NOTE: When applying similarity cutoff, if the df contains resolution column with no values, it will raise an error.
- apply_filter_on_df(df: DataFrame, data_list: list)
Apply filter on dataframe.
- Parameters:
df (pd.DataFrame) – Dataframe.
data_list (list) – List of tuples containing PDB ID, chain ID, and sequence.
- Returns:
Filtered dataframe.
- Return type:
pd.DataFrame
- apply_filters_on_list(pdb_list: list)
Using this method, you can apply filters on a list of PDB IDs.
- Parameters:
pdb_list (list) – List of PDB IDs.
- Returns:
List of tuples containing PDB ID, chain ID, and sequence.
- Return type:
list
- apply_filters_on_pdb_id(pdb_id: str)
Using this method, you can apply filters on a single PDB ID.
- Parameters:
pdb_id (str) – PDB ID.
- Returns:
List of tuples containing PDB ID, chain ID, and sequence.
- Return type:
list
- create_combined_fasta_file(data_list: list)
Create a combined fasta file from the data list.
- Parameters:
data_list (list) – List of tuples containing PDB ID, chain ID, and sequence.
- Returns:
None
- create_final_fasta_file(df_final: DataFrame)
Create final fasta file.
- Parameters:
df_final (pd.DataFrame) – Final dataframe.
- Returns:
None
- create_sequence_identity_mat_clustal(combined_fasta_file: str)
Create sequence identity matrix using clustal.
- Parameters:
combined_fasta_file (str) – Combined fasta file.
- Returns:
None
- create_sequence_identity_mat_emboss(data_list: list)
Create sequence identity matrix using emboss.
- Parameters:
data_list (list) – List of tuples containing PDB ID, chain ID, and sequence.
- Returns:
Sequence identity matrix.
- Return type:
pd.DataFrame
- get_sequence_identity_df_clustal(data_list: list)
Get sequence identity dataframe using clustal.
- Parameters:
data_list (list) – List of tuples containing PDB ID, chain ID, and sequence.
- Returns:
Sequence identity dataframe.
- Return type:
pd.DataFrame
- get_sequence_identity_df_emboss(data_list: list)
Get sequence identity dataframe using emboss.
- Parameters:
data_list (list) – List of tuples containing PDB ID, chain ID, and sequence.
- Returns:
Sequence identity dataframe.
- Return type:
pd.DataFrame
src.structure_filter module
- class src.structure_filter.StructureLevelFilter(exptl_method: List[str], resolution: float, year_range: List[int], polymer_entity_instance_count: int, polymer_entity_count_RNA: int, selected_polymer_entity_types: List[str], pdbx_keywords: List[str])
Bases:
object
Filter the dataset at the structure level. Using this class we can filter the dataset based on structure level attributes. It works on the columns specified in the COLUMNS variable and filters the dataset based on the given criteria.
- Parameters:
exptl_method (list) – Experimental method.
resolution (float) – Resolution.
year_range (int or list) – Year range.
polymer_entity_instance_count (int) – Polymer entity instance count.
polymer_entity_count_RNA (int) – Polymer entity count RNA.
selected_polymer_entity_types (list) – Selected polymer entity types.
pdbx_keywords (list) – PDBx keywords.
- exptl_method
Experimental method.
- Type:
list
- resolution
Resolution.
- Type:
float
- year_range
Year range.
- Type:
int or list
- polymer_entity_instance_count
Polymer entity instance count.
- Type:
int
- polymer_entity_count_RNA
Polymer entity count RNA.
- Type:
int
- selected_polymer_entity_types
Selected polymer entity types.
- Type:
list
- pdbx_keywords
PDBx keywords.
- Type:
list
- _check_dataframe()
Check if the dataframe columns match the expected columns.
- apply_exptl_method_filter()
Apply filter based on experimental method.
- apply_resolution_filter()
Apply filter based on resolution.
- apply_year_range_filter()
Apply filter based on year range.
- apply_polymer_entity_instance_count_filter()
Apply filter based on polymer entity instance count.
- apply_polymer_entity_count_RNA_filter()
Apply filter based on polymer entity count RNA.
- apply_selected_polymer_entity_types_filter()
Apply filter based on selected polymer entity types.
- apply_pdbx_keywords_filter()
Apply filter based on PDBx keywords.
- apply_filters()
Apply all the filters.
- apply_exptl_method_filter(df: DataFrame)
Apply filter based on experimental method.
- Parameters:
df (pd.DataFrame) – Input dataframe.
- Returns:
Filtered dataframe.
- Return type:
pd.DataFrame
- apply_filters(df: DataFrame)
Apply all the filters.
- Parameters:
df (pd.DataFrame) – Input dataframe.
- Returns:
Filtered dataframe.
- Return type:
pd.DataFrame
- apply_pdbx_keywords_filter(df: DataFrame)
Apply filter based on PDBx keywords.
- Parameters:
df (pd.DataFrame) – Input dataframe.
- Returns:
Filtered dataframe.
- Return type:
pd.DataFrame
- apply_polymer_entity_count_RNA_filter(df: DataFrame)
Apply filter based on polymer entity count RNA.
- Parameters:
df (pd.DataFrame) – Input dataframe.
- Returns:
Filtered dataframe.
- Return type:
pd.DataFrame
- apply_polymer_entity_instance_count_filter(df: DataFrame)
Apply filter based on polymer entity instance count.
- Parameters:
df (pd.DataFrame) – Input dataframe.
- Returns:
Filtered dataframe.
- Return type:
pd.DataFrame
- apply_resolution_filter(df: DataFrame)
Apply filter based on resolution.
- Parameters:
df (pd.DataFrame) – Input dataframe.
- Returns:
Filtered dataframe.
- Return type:
pd.DataFrame
- apply_selected_polymer_entity_types_filter(df: DataFrame)
Apply filter based on selected polymer entity types.
- Parameters:
df (pd.DataFrame) – Input dataframe.
- Returns:
Filtered dataframe.
- Return type:
pd.DataFrame
- apply_year_range_filter(df: DataFrame)
Apply filter based on year range. It can be a single year or a range of two years.
- Parameters:
df (pd.DataFrame) – Input dataframe.
- Returns:
Filtered dataframe.
- Return type:
pd.DataFrame
src.utils module
- src.utils.contacts_from_pdb(structure: Structure, chain: str, distance_threshold: float, sequence_length: int)
- src.utils.download_fasta_file(pdb_id, output_dir)
This is used to download a fasta file.
- Parameters:
pdb_id (str) – The PDB ID.
output_dir (str) – The output directory.
- src.utils.download_fasta_files(pdb_list: list, output_dir: str)
This is used to download a list of fasta files.
- Parameters:
pdb_list (list) – The list of PDB IDs.
output_dir (str) – The output directory.
- src.utils.download_pdb_file(pdb_id, output_dir)
This is used to download a PDB file.
- Parameters:
pdb_id (str) – The PDB ID.
output_dir (str) – The output directory.
- src.utils.download_pdb_files(pdb_list: list, output_dir: str)
This is used to download a list of PDB files.
- Parameters:
pdb_list (list) – The list of PDB IDs.
output_dir (str) – The output directory.
- src.utils.extract_sequence_from_combined_fasta(combined_fasta_file: str, output_dir: str, pdb_id: str)
It takes as input a combined fasta file and extracts the sequence of a given pdb_id. If the fasta header is not of the format {pdb_id}_{chain}, it will throw an error. :param combined_fasta_file: The combined fasta file. :type combined_fasta_file: str :param output_dir: The output directory. :type output_dir: str :param pdb_id: The pdb_id. :type pdb_id: str
- Returns:
Saves the sequence to a file with name {output_dir}/{pdb_id}_{chain}.fa
- src.utils.extract_sequences_for_pdb_ids(combined_fasta_file: str, output_dir: str, pdb_ids: list)
It takes as input a combined fasta file and extracts the sequence of a given pdb_id. If the fasta header is not of the format {pdb_id}_{chain}, it will throw an error. :param combined_fasta_file: The combined fasta file. :type combined_fasta_file: str :param output_dir: The output directory. :type output_dir: str :param pdb_ids: The list of pdb_ids. :type pdb_ids: list
- Returns:
Saves the sequence to a file with name {output_dir}/{pdb_id}_{chain}.fa
- src.utils.generate_contact_map_from_mmcif_file(mmcif_file: str, output_dir: str, chain: str, seq_len: int, distance_cutoff: float = 8.0, save: bool = True, width: int = 0)
It generates a contact map from a mmcif file. It used MMCIFParser from Bio.PDB to parse the mmcif file.
- src.utils.generate_contact_map_from_pdb_file(pdb_file: str, output_dir: str, chain: str, seq_len: int, distance_cutoff: float = 8.0, save: bool = True, width: int = 0)
It generates a contact map from a pdb file. It uses the PDBParser from the Bio.PDB module to parse the pdb file.
- src.utils.get_final_fam_pdb_chain_csv(clean_tblout_path)
- src.utils.parse_fasta(fasta_file)
- src.utils.remove_backbone_contacts(contacts: ndarray, width: int = 0)
- src.utils.save_sequences(final_fasta, output_dir)
- src.utils.write_sequences_to_files(sequences, output_dir)