src package

Submodules

src.columns module

The columns that will be extracted from the PDB metadata. We find these to be the most relevant as they capture the essence of structures in the PDB database.

Columns:

rcsb_id: The unique identifier for the structure in the PDB database.

exptl_method: The method used to determine the structure.

release_date: The date the structure was released.

polymer_entity_instance_count: The number of polymer entities in the structure.

polymer_entity_count_RNA: The number of RNA polymer entities in the structure.

resolution: The resolution of the structure.

selected_polymer_entity_types: The types of polymer entities in the structure.

pdbx_keywords: Keywords associated with the structure.

src.dataset_creator module

class src.dataset_creator.DatasetCreator(dataset_name: str, rfam_cm_path: str, structure_determination_methodology: str | None = 'experimental', rcsb_entity_polymer_type: str | None = 'RNA', dstart: int | None = 0, dend: int | None = 10000, download_all: bool | None = False, exptl_method: List[str] | None = ['X-RAY DIFFRACTION'], resolution: float | None = 3.6, year_range: List[int] | int | None = 2024, polymer_entity_instance_count: int | None = None, polymer_entity_count_RNA: int | None = None, selected_polymer_entity_types: List[str] | None = None, pdbx_keywords: List[str] | None = ['RNA'], polymer_type: str | None = 'polyribonucleotide', sequence_length: int | None = 40, sequence_identity: float | None = 50.0, auto_download: bool | None = True, alignment_tool: str | None = 'clustal', e_value_cmscan: float | None = 0.0001, save: bool | None = False)

Bases: object

It is used to create a dataset of RNA structure from the RCSB PDB database. This makes use of the DatasetDownload, StructureLevelFilter and PolymerLevelFilter classes. It can be used to download the dataset, apply filters at the structure and polymer level, all the parameters can be controlled by the user. Make sure to download Rfam.cm file from https://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/Rfam.cm.gz, unzip it and use infernal cmpress to compress it. Desired DATA_PATH can be set as an environment variable. When you run this class for the first time, it will create a folder for PDB files or you can create a dir called ‘pdb_files’ one level below the current directory.

Parameters:
  • dataset_name (str) – Name of the dataset.

  • rfam_cm_path (str) – Path to the Rfam.cm file.

  • structure_determination_methodology (str) – Structure determination methodology.

  • rcsb_entity_polymer_type (str) – Polymer type.

  • dstart (int) – Start index for pagination.

  • dend (int) – End index for pagination.

  • download_all (bool) – Download all entries.

  • exptl_method (list) – Experimental method.

  • resolution (float) – Resolution.

  • year_range (int or list) – Year range.

  • polymer_entity_instance_count (int) – Polymer entity instance count.

  • polymer_entity_count_RNA (int) – Polymer entity count RNA.

  • selected_polymer_entity_types (list) – Selected polymer entity types.

  • pdbx_keywords (list) – PDBx keywords.

  • polymer_type (str) – Polymer type.

  • sequence_length (int) – Sequence length.

  • sequence_identity (float) – Sequence identity.

  • auto_download (bool) – Auto download PDB file.

  • alignment_tool (str) – Alignment tool.

  • e_value_cmscan (float) – E-value for CMScan.

  • save (bool) – Save filtered data, if set to False, it will not save data at each filter level.

dataset_name

Name of the dataset.

Type:

str

structure_determination_methodology

Structure determination methodology.

Type:

str

rcsb_entity_polymer_type

Polymer type. [‘Protein’, ‘DNA’, ‘RNA’, ‘NA-hybrid’, ‘Other’] a list can’t be passed

Type:

str

dstart

Start index for pagination.

Type:

int

dend

End index for pagination.

Type:

int

download_all

Download all entries.

Type:

bool

exptl_method

Experimental method. [‘X-RAY DIFFRACTION’, ‘SOLUTION NMR’, ‘ELECTRON MICROSCOPY’, ‘SOLID-STATE NMR’] a list can be passed Further details can be found at www.wwpdb.org or www.rcsb.org

Type:

list

resolution

Resolution.

Type:

float

year_range

Year range.

Type:

int or list

polymer_entity_instance_count

Polymer entity instance count.

Type:

int

polymer_entity_count_RNA

Polymer entity count RNA.

Type:

int

selected_polymer_entity_types

Selected polymer entity types.

Type:

list

pdbx_keywords

PDBx keywords. [‘RNA’, ‘DNA/RNA’, ‘RIBOSOME’, ‘RIBOZYME’] a list can be passed Further details can be found at https://www.wwpdb.org or https://www.rcsb.org or in the raw data

Type:

list

polymer_type

Polymer type.

Type:

str

sequence_length

Sequence length.

Type:

int

sequence_identity

Sequence identity.

Type:

float

auto_download

Auto download PDB file.

Type:

bool

alignment_tool

Alignment tool.

Type:

str

rfam_cm_path

Path to the Rfam.cm file.

Type:

str

e_value_cmscan

E-value for CMScan.

Type:

float

save

Save filtered data.

Type:

bool

dataset_files

Path to the dataset folder.

Type:

str

final_fasta_path

Path to the final FASTA file.

Type:

str

cmscan_path

Path to the CMScan output file.

Type:

str

tblout

Path to the tblout file.

Type:

str

clean_tblout_path

Path to the clean tblout file.

Type:

str

final_fams_path

Path to the final families file.

Type:

str

final_pdb_list_path

Path to the final PDB list file.

Type:

str

final_chain_ids_path

Path to the final chain IDs file.

Type:

str

cmd

cmscan command.

Type:

str

data

DatasetDownload component.

Type:

DatasetDownload

structure_filter

StructureLevelFilter component.

Type:

StructureLevelFilter

polymer_level_filter

PolymerLevelFilter component.

Type:

PolymerLevelFilter

save_filtered_data()

Save PDB list and DataFrame for each level of filters.

save_pdb_list()

Save list of PDBs to a file.

save_dataframe()

Save DataFrame to a file.

get_without_filter_df()

Get dataframe without applying any filters.

get_structure_filtered_df()

Get dataframe after applying structure-level filters.

get_polymer_filtered_df()

Get dataframe after applying polymer-level filters.

get_final_df()

Get final dataframe after applying all filters.

create_final_fasta_file()

Create final FASTA file.

run_cmscan()

Run CMScan.

get_final_families()

Get final families.

get_final_pdb_list()

Get final PDB list.

NOTE: When you use different params it is better to create a explantory dataset_name because internally all files will have same names and it will be hard to distinguish

Example

dataset_name = ‘test’ os.environ[‘DATA_PATH’] = str(Path(__file__).resolve().parents[1] / f’data/{dataset_name}’) DATA_PATH = os.environ.get(‘DATA_PATH’) dc = DatasetCreator(your_params_here)

apply_filters(df: DataFrame, df_polymer_filtered_list: list) DataFrame

Apply all filters.

apply_polymer_filters(pdb_list: list) Tuple[DataFrame, list]

Apply polymer-level filters.

apply_structure_filters(df: DataFrame) DataFrame

Apply structure-level filters.

create_final_fasta_file(df) None

Create final FASTA file.

run() None

Run CMScan.

save_dataframe(df: DataFrame, filename: str) None

Save DataFrame to a file.

save_pdb_list(df: DataFrame, filename: str) None

Save list of PDBs to a file.

src.dataset_creator.main()

src.dataset_download module

class src.dataset_download.DatasetDownload(structure_determination_methodology: str = 'experimental', rcsb_entity_polymer_type: str = 'RNA', dstart: int = 0, dend: int = 25, download_all: bool = False)

Bases: object

Download RNA dataset from RCSB PDB database. It is used to download the dataset from the RCSB PDB database. It uses the search API to get the list of PDB IDs and then uses the GraphQL API to get data for each PDB ID. The data is then converted into a DataFrame with the columns specified in the COLUMNS variable. The dataframe can be accessed using the df attribute and is also saved as a CSV file.

Parameters:
  • structure_determination_methodology (str) – Structure determination methodology.

  • rcsb_entity_polymer_type (str) – Polymer type, possible values are “Protein”, “DNA”, “RNA” ,”NA-hybrid”, “Other”

  • used (If "Protein" or any other value is) –

  • RNA. (the tool will likely work but cmscan will fail as it is designed to work with) –

  • dstart (int) – Start index for pagination.

  • dend (int) – End index for pagination.

  • download_all (bool) – Download all entries.

SEARCH_API_BASE_URI

Base URI for search API.

Type:

str

DATA_API_BASE_URI_GRAPHQL

Base URI for data API.

Type:

str

dstart

Start index for pagination.

Type:

int

dend

End index for pagination.

Type:

int

structure_determination_methodology

Structure determination methodology.

Type:

str

rcsb_entity_polymer_type

Polymer type.

Type:

str

data_path

Path to save the data.

Type:

str

df

Dataframe of the dataset.

Type:

pd.DataFrame

get_search_api_query()

Get search API query.

get_graphql_query()

Get GraphQL query.

get_pdb_list()

Get list of PDB IDs.

get_data_for_each_pdb()

Get data for each PDB ID.

get_data_as_df()

Get data as DataFrame.

save_data_as_csv()

Save data as CSV.

NOTE: If you wish to supply multiple values for structure_determination_methodology or rcsb_entity_polymer_type, you would have to change the search query to include multiple values.

DATA_API_BASE_URI_GRAPHQL = 'https://data.rcsb.org/graphql'
SEARCH_API_BASE_URI = 'https://search.rcsb.org/rcsbsearch/v2/query'
get_data_as_df()

This method converts the data for each PDB ID into a DataFrame.

Returns:

Dataframe of the whole dataset obtained from the RCSB PDB database using the search and GraphQL APIs.

Return type:

pd.DataFrame

get_data_for_each_pdb()

This method fetches data for each PDB ID using the GraphQL API.

Returns:

combined data for each PDB ID.

Return type:

json

get_graphql_query(pdb_ids: list)

Get GraphQL query, this is used to get data for each PDB ID.

Parameters:

pdb_ids (list) – List of PDB IDs.

Returns:

GraphQL query.

Return type:

dict

get_pdb_list()

This method fetches the list of PDB IDs from the RCSB PDB database using the search API.

Returns:

List of PDB IDs.

Return type:

list

get_search_api_query()

Get search API query, this is used to search for entries in the RCSB PDB database. It filters entries based on structure determination methodology and polymer type. This query is used to get the list of PDB IDs.

Returns:

Search API query.

Return type:

dict

NOTE: If you wish to supply multiple values for structure_determination_methodology or rcsb_entity_polymer_type, you would have to change the search query to include multiple values.

save_data_as_csv(path: str | PathLike)

Save data as CSV.

Parameters:

path (str, os.PathLike) – Path to save the data.

Returns:

None

src.pdb_filter module

class src.pdb_filter.PDBFilter(pdb_parser: MMCIF2Dict, pdb_id: str, polymer_type: str, sequence_length: int, auto_download: bool = False)

Bases: object

This is used to apply filters on PDB files. It analyses the PDB files at the chain and polymer level. It checks if each chain in the PDB file is of the given polymer type and has the required sequence length.

Parameters:
  • pdb_parser (Bio.PDB.MMCIFParser) – PDB parser.

  • pdb_id (str) – PDB ID.

  • polymer_type (str) – Polymer type, common possible values are ‘polyribonucleotide’, ‘polydeoxyribonucleotide’, ‘polypeptide’ etc.

  • sequence_length (int) – Sequence length.

  • auto_download (bool) – Auto download PDB file if not found.

pdb_id

PDB ID.

Type:

str

polymer_type

Polymer type.

Type:

str

sequence_length

Sequence length.

Type:

int

auto_download

Auto download PDB file.

Type:

bool

pdb_parser

PDB parser.

Type:

Bio.PDB.MMCIFParser

pdb_file

Path to the PDB file.

Type:

str

structure_dict

Structure dictionary.

Type:

dict

_check_and_auto_download()

Check if the PDB file exists and auto download if required.

check_polymer_type()

Check if the PDB file satisfies the given criteria.

all_characters_are_n(s)

Check if all characters in the string are ‘N’. This is helpful to remove chains with all ‘N’ characters as they can’t be processed by Clustal Omega.

Parameters:

s (str) – Input string.

Returns:

True if all characters are ‘N’, False otherwise.

Return type:

bool

check_polymer_type()

Check if the PDB file contains the given polymer type and has the required sequence length.

Returns:

List of tuples containing PDB ID, chain ID and corresponding sequence.

Return type:

list

src.polymer_level_filter module

class src.polymer_level_filter.PolymerLevelFilter(polymer_type: str, sequence_length: int, sequence_identity: float, auto_download: bool = False, alignment_tool: str = 'clustal')

Bases: object

This is used to apply PDBFilter for multiple PDB files. It analyses the PDB files at the chain and polymer level. It also calculates the sequence identity between the sequences of the PDB files. You can use two alignment tools: clustal and emboss. They should be installed on your system. We recommend using clustal as it is faster.

This gives the final dataframe after applying all the filters.

Parameters:
  • polymer_type (str) – Polymer type.

  • sequence_length (int) – Sequence length.

  • sequence_identity (float) – Sequence identity.

  • auto_download (bool) – Auto download PDB file.

  • alignment_tool (str) – Alignment tool. Default is clustal.

polymer_type

Polymer type.

Type:

str

sequence_length

Sequence length.

Type:

int

sequence_identity

Sequence identity.

Type:

float

auto_download

Auto download PDB file.

Type:

bool

alignment_tool

Alignment tool.

Type:

str

pdb_parser

PDB parser.

Type:

Bio.PDB.MMCIFParser

DATA_PATH

Data path.

Type:

str

sequence_identity_mat_file

Sequence identity matrix file.

Type:

str

sequence_identity_mat_path

Sequence identity matrix path.

Type:

str

apply_filters_on_pdb_id()

Apply filters on PDB ID.

apply_filters_on_list()

Apply filters on list of PDB IDs.

create_combined_fasta_file()

Create combined fasta file.

create_sequence_identity_mat_emboss()

Create sequence identity matrix using emboss.

get_sequence_identity_df_emboss()

Get sequence identity dataframe using emboss.

create_sequence_identity_mat_clustal()

Create sequence identity matrix using clustal.

get_sequence_identity_df_clustal()

Get sequence identity dataframe using clustal.

apply_filter_on_df()

Apply filter on dataframe.

create_final_fasta_file()

Create final fasta file.

#! NOTE: When applying similarity cutoff, if the df contains resolution column with no values, it will raise an error.

apply_filter_on_df(df: DataFrame, data_list: list)

Apply filter on dataframe.

Parameters:
  • df (pd.DataFrame) – Dataframe.

  • data_list (list) – List of tuples containing PDB ID, chain ID, and sequence.

Returns:

Filtered dataframe.

Return type:

pd.DataFrame

apply_filters_on_list(pdb_list: list)

Using this method, you can apply filters on a list of PDB IDs.

Parameters:

pdb_list (list) – List of PDB IDs.

Returns:

List of tuples containing PDB ID, chain ID, and sequence.

Return type:

list

apply_filters_on_pdb_id(pdb_id: str)

Using this method, you can apply filters on a single PDB ID.

Parameters:

pdb_id (str) – PDB ID.

Returns:

List of tuples containing PDB ID, chain ID, and sequence.

Return type:

list

create_combined_fasta_file(data_list: list)

Create a combined fasta file from the data list.

Parameters:

data_list (list) – List of tuples containing PDB ID, chain ID, and sequence.

Returns:

None

create_final_fasta_file(df_final: DataFrame)

Create final fasta file.

Parameters:

df_final (pd.DataFrame) – Final dataframe.

Returns:

None

create_sequence_identity_mat_clustal(combined_fasta_file: str)

Create sequence identity matrix using clustal.

Parameters:

combined_fasta_file (str) – Combined fasta file.

Returns:

None

create_sequence_identity_mat_emboss(data_list: list)

Create sequence identity matrix using emboss.

Parameters:

data_list (list) – List of tuples containing PDB ID, chain ID, and sequence.

Returns:

Sequence identity matrix.

Return type:

pd.DataFrame

get_sequence_identity_df_clustal(data_list: list)

Get sequence identity dataframe using clustal.

Parameters:

data_list (list) – List of tuples containing PDB ID, chain ID, and sequence.

Returns:

Sequence identity dataframe.

Return type:

pd.DataFrame

get_sequence_identity_df_emboss(data_list: list)

Get sequence identity dataframe using emboss.

Parameters:

data_list (list) – List of tuples containing PDB ID, chain ID, and sequence.

Returns:

Sequence identity dataframe.

Return type:

pd.DataFrame

src.structure_filter module

class src.structure_filter.StructureLevelFilter(exptl_method: List[str], resolution: float, year_range: List[int], polymer_entity_instance_count: int, polymer_entity_count_RNA: int, selected_polymer_entity_types: List[str], pdbx_keywords: List[str])

Bases: object

Filter the dataset at the structure level. Using this class we can filter the dataset based on structure level attributes. It works on the columns specified in the COLUMNS variable and filters the dataset based on the given criteria.

Parameters:
  • exptl_method (list) – Experimental method.

  • resolution (float) – Resolution.

  • year_range (int or list) – Year range.

  • polymer_entity_instance_count (int) – Polymer entity instance count.

  • polymer_entity_count_RNA (int) – Polymer entity count RNA.

  • selected_polymer_entity_types (list) – Selected polymer entity types.

  • pdbx_keywords (list) – PDBx keywords.

exptl_method

Experimental method.

Type:

list

resolution

Resolution.

Type:

float

year_range

Year range.

Type:

int or list

polymer_entity_instance_count

Polymer entity instance count.

Type:

int

polymer_entity_count_RNA

Polymer entity count RNA.

Type:

int

selected_polymer_entity_types

Selected polymer entity types.

Type:

list

pdbx_keywords

PDBx keywords.

Type:

list

_check_dataframe()

Check if the dataframe columns match the expected columns.

apply_exptl_method_filter()

Apply filter based on experimental method.

apply_resolution_filter()

Apply filter based on resolution.

apply_year_range_filter()

Apply filter based on year range.

apply_polymer_entity_instance_count_filter()

Apply filter based on polymer entity instance count.

apply_polymer_entity_count_RNA_filter()

Apply filter based on polymer entity count RNA.

apply_selected_polymer_entity_types_filter()

Apply filter based on selected polymer entity types.

apply_pdbx_keywords_filter()

Apply filter based on PDBx keywords.

apply_filters()

Apply all the filters.

apply_exptl_method_filter(df: DataFrame)

Apply filter based on experimental method.

Parameters:

df (pd.DataFrame) – Input dataframe.

Returns:

Filtered dataframe.

Return type:

pd.DataFrame

apply_filters(df: DataFrame)

Apply all the filters.

Parameters:

df (pd.DataFrame) – Input dataframe.

Returns:

Filtered dataframe.

Return type:

pd.DataFrame

apply_pdbx_keywords_filter(df: DataFrame)

Apply filter based on PDBx keywords.

Parameters:

df (pd.DataFrame) – Input dataframe.

Returns:

Filtered dataframe.

Return type:

pd.DataFrame

apply_polymer_entity_count_RNA_filter(df: DataFrame)

Apply filter based on polymer entity count RNA.

Parameters:

df (pd.DataFrame) – Input dataframe.

Returns:

Filtered dataframe.

Return type:

pd.DataFrame

apply_polymer_entity_instance_count_filter(df: DataFrame)

Apply filter based on polymer entity instance count.

Parameters:

df (pd.DataFrame) – Input dataframe.

Returns:

Filtered dataframe.

Return type:

pd.DataFrame

apply_resolution_filter(df: DataFrame)

Apply filter based on resolution.

Parameters:

df (pd.DataFrame) – Input dataframe.

Returns:

Filtered dataframe.

Return type:

pd.DataFrame

apply_selected_polymer_entity_types_filter(df: DataFrame)

Apply filter based on selected polymer entity types.

Parameters:

df (pd.DataFrame) – Input dataframe.

Returns:

Filtered dataframe.

Return type:

pd.DataFrame

apply_year_range_filter(df: DataFrame)

Apply filter based on year range. It can be a single year or a range of two years.

Parameters:

df (pd.DataFrame) – Input dataframe.

Returns:

Filtered dataframe.

Return type:

pd.DataFrame

src.utils module

src.utils.contacts_from_pdb(structure: Structure, chain: str, distance_threshold: float, sequence_length: int)
src.utils.download_fasta_file(pdb_id, output_dir)

This is used to download a fasta file.

Parameters:
  • pdb_id (str) – The PDB ID.

  • output_dir (str) – The output directory.

src.utils.download_fasta_files(pdb_list: list, output_dir: str)

This is used to download a list of fasta files.

Parameters:
  • pdb_list (list) – The list of PDB IDs.

  • output_dir (str) – The output directory.

src.utils.download_pdb_file(pdb_id, output_dir)

This is used to download a PDB file.

Parameters:
  • pdb_id (str) – The PDB ID.

  • output_dir (str) – The output directory.

src.utils.download_pdb_files(pdb_list: list, output_dir: str)

This is used to download a list of PDB files.

Parameters:
  • pdb_list (list) – The list of PDB IDs.

  • output_dir (str) – The output directory.

src.utils.extract_sequence_from_combined_fasta(combined_fasta_file: str, output_dir: str, pdb_id: str)

It takes as input a combined fasta file and extracts the sequence of a given pdb_id. If the fasta header is not of the format {pdb_id}_{chain}, it will throw an error. :param combined_fasta_file: The combined fasta file. :type combined_fasta_file: str :param output_dir: The output directory. :type output_dir: str :param pdb_id: The pdb_id. :type pdb_id: str

Returns:

Saves the sequence to a file with name {output_dir}/{pdb_id}_{chain}.fa

src.utils.extract_sequences_for_pdb_ids(combined_fasta_file: str, output_dir: str, pdb_ids: list)

It takes as input a combined fasta file and extracts the sequence of a given pdb_id. If the fasta header is not of the format {pdb_id}_{chain}, it will throw an error. :param combined_fasta_file: The combined fasta file. :type combined_fasta_file: str :param output_dir: The output directory. :type output_dir: str :param pdb_ids: The list of pdb_ids. :type pdb_ids: list

Returns:

Saves the sequence to a file with name {output_dir}/{pdb_id}_{chain}.fa

src.utils.generate_contact_map_from_mmcif_file(mmcif_file: str, output_dir: str, chain: str, seq_len: int, distance_cutoff: float = 8.0, save: bool = True, width: int = 0)

It generates a contact map from a mmcif file. It used MMCIFParser from Bio.PDB to parse the mmcif file.

src.utils.generate_contact_map_from_pdb_file(pdb_file: str, output_dir: str, chain: str, seq_len: int, distance_cutoff: float = 8.0, save: bool = True, width: int = 0)

It generates a contact map from a pdb file. It uses the PDBParser from the Bio.PDB module to parse the pdb file.

src.utils.get_final_fam_pdb_chain_csv(clean_tblout_path)
src.utils.parse_fasta(fasta_file)
src.utils.remove_backbone_contacts(contacts: ndarray, width: int = 0)
src.utils.save_sequences(final_fasta, output_dir)
src.utils.write_sequences_to_files(sequences, output_dir)

Module contents