The DataBase

pySciSci provides a standardized interface for working with several of the major datasets in the Science of Science, including:

The storage and processing frameworks are highly generalizable, and can be extended to other databases not mentioned here. Please contribute an interface to your data on the github project page!

Each dataset is accessed as a customized variant of the BibDataBase class, a container of python Pandas data frames pandas, that handles all data loading and pre-processing.

Currently, we provide direct data access only to DBLP and PubMed. All other data must be manually downloaded from the data provider before processing.

To facilitate data movement and lower memory overhead when the complete tables are not required, we pre-process the data tables into smaller chunks. When loading a table into memory, the user can quickly load the full table by referencing the table name as a database property or specify multiple filters to load only a subset of the data.

Basic Data WorkFlow

Every dataset in pySciSci is first pre-processed into a standardized tabular format based around DataFrame objects.

First usage only:
  • Download Data

  • Preprocess Data

All other usages:
  • Apply Data Filter

  • Load Only the DataFrame you need

DataSet Examples

DataFrames

Each dataset is partitioned into several DataFrames containing information for different bibliometric objects that are accessed as properties of the BibDataBase:
  • pub: The DataFrame keeping publication information, including publication date, journal, title, etc.. Each PubId occurs only once. Columns depend on the specific datasource.

  • author: The DataFrame keeping author names and personal information. Each AuthorId occurs only once. Columns depend on the specific datasource.

  • affiliation: The DataFrame keeping affilations names and websites. Each AffiliationId occurs only once. Columns depend on the specific datasource.

  • journal: The DataFrame keeping journal names and websites. Each JournalId occurs only once. Columns depend on the specific datasource.

  • fieldinfo: The DataFrame keeping field names and levels. Each FieldId occurs only once. Columns depend on the specific datasource.

There are also DataFrames which contain edge lists linking the different bibliometric data objects:
  • pub2ref: The DataFrame linking publications to their references (or citations).

  • pub2field: The DataFrame linking publications to their fields.

  • paa: The DataFrame linking publications to authors to affiliations. Columns depend on the specific datasource.

And two processed DataFrames are created which contain the most popular citation counts for future reference:
  • impact: The DataFrame linking publications to their citations counts. Columns depend on the specific datasource and processing options.

  • pub2refnoself: The DataFrame linking publications to their references where all self-citations are removed.

Filters

Some datasets contain a wide-range of publication types, from many times, in many fields, spanning many different topics. Often, it is useful to focus only on a subset of the available data. For example, the MAG contains many different document types including journal publications, books, patents, and others.

Filters can be applied to the BibDataBase to ensure only a desired subset of the data is loaded into memory.

There are four default filters provided by pySciSci:
  • YearFilter

  • DocTypeFilter

  • FieldFilter

  • JournalFilter

Property Dictionaries

Two property dictionaries are created for quick reference without the need to load the complete publication dataframe:
  • pub2year: mapping between the PubId and PubYear

  • pub2doctype: mapping between the PubId and DocType (when available)

BibDataBase

class pyscisci.database.BibDataBase(path2database='', database_extension='csv.gz', keep_in_memory=False, global_filter=None, show_progress=True)

Base class for all bibliometric database interfaces.

The BibDataBase provides a parasomonious structure for each of the specific data sources (MAG, WOS, etc.).

There are four primary types of functions:
  1. Parseing Functions (data source specific) that parse the raw data files

  2. Loading Functions that load DataFrames

  3. Processing Functions that calculate advanced data types from the raw files

  4. Analysis Functions that calculate Science of Science metrics.


Parameters:
  • path2database (str) – The path to the database files

  • keep_in_memory (bool, default False) – Flag to keep database files in memory once loaded

  • global_filter (dict or Filter) – Set of conditions to apply globally.

  • show_progress (bool, default True) – Flag to show progress of loading, processing, and calculations.


property affiliation

The DataFrame keeping affiliation information. Columns may depend on the specific datasource.

Notes

columns: ‘AffiliationId’, ‘NumberPublications’, ‘NumberCitations’, ‘FullName’, ‘GridId’, ‘OfficialPage’, ‘WikiPage’, ‘Latitude’, ‘Longitude’


property author

The DataFrame keeping author information. Columns may depend on the specific datasource.

Notes

columns: ‘AuthorId’, ‘LastKnownAffiliationId’, ‘NumberPublications’, ‘NumberCitations’, ‘FullName’, ‘LastName’, ‘FirstName’, ‘MiddleName’


property author2pub

The DataFrame keeping all publication, author relationships. Columns may depend on the specific datasource.

Notes

columns: ‘PublicationId’, ‘AuthorId’


compute_impact(preprocess=True, citation_horizons=[5, 10], noselfcite=True)
Calculate several of the common citation indices.
  • ‘Ctotal’ : The total number of citations.

  • ‘Ck’ : The total number of citations within the first k years of publcation, for each k value specified by citation_horizons.

  • ‘Ctotal_noself’ : The total number of citations with self-citations removed.

  • ‘Ck’ : The total number of citations within the first k years of publcation with self-citations removed, for each k value specified by citation_horizons.

Parameters:
  • preprocess (bool, default True, Optional) – If True then the impact measures are saved in preprocessed files.

  • citation_horizons (list, default [5,10], Optional) – The DataFrame column with Author Ids. If None then the database ‘AuthorId’ is used.

  • noselfcite (Bool, default 'True', Optional) – If True then the noselfcitation pub2ref files are also processed.

Returns:

The impact DataFrame with at least two columns: ‘PublicationId’, ‘Year’, + citation columns

Return type:

DataFrame


property fieldinfo

The DataFrame keeping all publication field relationships. Columns may depend on the specific datasource.

Notes

columns: ‘FieldId’, ‘FieldLevel’, ‘NumberPublications’, ‘FieldName’


get_author_publications(paa=None, include_pub_info=None, include_impact=False)

Produce a DataFrame with the author(s) publication information.

Parameters:
  • author_ids (int, str, or list) – An AuthorId or list of AuthorIds

  • paa (None, or DataFrame - default None, optional) – A pre-loaded paa DataFrame or defualt None indicates we should load the paa dataframe.

  • include_pub_info (bool or DataFrame, defualt None, optional) – If DataFrame- used as the publication information. If Bool and True - we load our own publication information.

  • include_impact (bool or DataFrame, defualt None, optional) – If DataFrame- used as the impact or reference information. If Bool and True - we load our own impact information.

Returns:

author_paa – The author(s) publication information.

Return type:

DataFrame

get_journal_publications(pub=None, include_pub_info=None, include_impact=False)

Produce a DataFrame with the journal(s) publication information.

Parameters:
  • journal_ids (int, str, or list) – A JournalId or list of JournalIds

  • pub (None, or DataFrame - default None, optional) – A pre-loaded pub DataFrame or defualt None indicates we should load the pub dataframe.

  • include_pub_info (bool or DataFrame, defualt None, optional) – If DataFrame- used as the publication information. If Bool and True - we load our own publication information.

  • include_impact (bool or DataFrame, defualt None, optional) – If DataFrame- used as the impact or reference information. If Bool and True - we load our own impact information.

Returns:

journal_pubs – The journal(s) publication information.

Return type:

DataFrame

get_publication_info(pub=True, columns=None)

Produce a DataFrame with the publication information.

Parameters:
  • pub_ids (int, str, or list) – A PublicationId or list of PublicationIds

  • pub (Bool, or DataFrame - default True, optional) – A pre-loaded pub DataFrame or a bool indicating you should load the publicaiton dataframe.

  • columns (list, defualt None, optional) – A list of specific columns from the pub DataFrame to return.

Returns:

pub_info – The publication information.

Return type:

DataFrame

property journal

The DataFrame keeping journal information. Columns may depend on the specific datasource.

Notes

columns: ‘JournalId’, ‘FullName’, ‘Issn’, ‘Publisher’, ‘Webpage’


load_affiliations(preprocess=True, columns=None, filter_dict=None, duplicate_subset=None, duplicate_keep='last', dropna=None, prefunc2apply=None, postfunc2apply=None, show_progress=False)

Load the Affiliation DataFrame from a preprocessed directory, or parse from the raw files.

Parameters:
  • preprocess (bool, default True, Optional) – Attempt to load from the preprocessed directory.

  • columns (list, default None, Optional) – Load only this subset of columns

  • filter_dict (dict, default None, Optional) – Dictionary of format {“ColumnName”:”ListofValues”} where “ColumnName” is a data column and “ListofValues” is a sorted list of valid values. A DataFrame only containing rows that appear in “ListofValues” will be returned.

  • duplicate_subset (list, default None, Optional) – Drop any duplicate entries as specified by this subset of columns

  • duplicate_keep (str, default 'last', Optional) – If duplicates are being dropped, keep the ‘first’ or ‘last’ (see pandas.DataFram.drop_duplicates)

  • dropna (list, default None, Optional) – Drop any NaN entries as specified by this subset of columns

Returns:

Affililation DataFrame.

Return type:

DataFrame


load_authors(preprocess=True, columns=None, filter_dict={}, duplicate_subset=None, duplicate_keep='last', dropna=None, prefunc2apply=None, postfunc2apply=None, process_name=True, show_progress=True)

Load the Author DataFrame from a preprocessed directory, or parse from the raw files.

Parameters:
  • preprocess (bool, default True, Optional) – Attempt to load from the preprocessed directory.

  • columns (list, default None, Optional) – Load only this subset of columns

  • filter_dict (dict, default {}, Optional) – Dictionary of format {“ColumnName”:”ListofValues”} where “ColumnName” is a data column and “ListofValues” is a sorted list of valid values. A DataFrame only containing rows that appear in “ListofValues” will be returned.

  • duplicate_subset (list, default None, Optional) – Drop any duplicate entries as specified by this subset of columns

  • duplicate_keep (str, default 'last', Optional) –

    If duplicates are being dropped, keep the ‘first’ or ‘last’ (see pandas.DataFram.drop_duplicates)

  • dropna (list, default None, Optional) – Drop any NaN entries as specified by this subset of columns

  • process_name (bool, default True, Optional) – If True, then when processing the raw file, the package NameParser will be used to split author FullNames.

Returns:

Author DataFrame.

Return type:

DataFrame


load_fieldinfo(preprocess=True, columns=None, filter_dict={}, show_progress=False)

Load the Field Information DataFrame from a preprocessed directory, or parse from the raw files.

Parameters:
  • preprocess (bool, default True, Optional) – Attempt to load from the preprocessed directory.

  • columns (list, default None, Optional) – Load only this subset of columns

  • filter_dict (dict, default {}, Optional) – Dictionary of format {“ColumnName”:”ListofValues”} where “ColumnName” is a data column and “ListofValues” is a sorted list of valid values. A DataFrame only containing rows that appear in “ListofValues” will be returned.

  • duplicate_subset (list, default None, Optional) – Drop any duplicate entries as specified by this subset of columns

  • duplicate_keep (str, default 'last', Optional) –

    If duplicates are being dropped, keep the ‘first’ or ‘last’ (see pandas.DataFram.drop_duplicates)

  • dropna (list, default None, Optional) – Drop any NaN entries as specified by this subset of columns

Returns:

FieldInformation DataFrame.

Return type:

DataFrame


load_impact(preprocess=True, include_yearnormed=True, columns=None, filter_dict={}, duplicate_subset=None, duplicate_keep='last', dropna=None, prefunc2apply=None, postfunc2apply=None, show_progress=False)

Load the precomputed impact DataFrame from a preprocessed directory.

Parameters:
  • preprocess (bool, default True) – Attempt to load from the preprocessed directory.

  • include_yearnormed (bool, default True) – Normalize all columns by yearly average.

  • columns (list, default None) – Load only this subset of columns

  • filter_dict (dict, default {}, Optional) – Dictionary of format {“ColumnName”:”ListofValues”} where “ColumnName” is a data column and “ListofValues” is a sorted list of valid values. A DataFrame only containing rows that appear in “ListofValues” will be returned.

  • duplicate_subset (list, default None, Optional) – Drop any duplicate entries as specified by this subset of columns

  • duplicate_keep (str, default 'last', Optional) –

    If duplicates are being dropped, keep the ‘first’ or ‘last’ (see pandas.DataFram.drop_duplicates)

  • dropna (list, default None, Optional) – Drop any NaN entries as specified by this subset of columns

Returns:

FieldInformation DataFrame.

Return type:

DataFrame


load_journals(preprocess=True, columns=None, filter_dict={}, duplicate_subset=None, duplicate_keep='last', dropna=None, prefunc2apply=None, postfunc2apply=None, show_progress=False)

Load the Journal DataFrame from a preprocessed directory, or parse from the raw files.

Parameters:
  • preprocess (bool, default True, Optional) – Attempt to load from the preprocessed directory.

  • columns (list, default None, Optional) – Load only this subset of columns

  • filter_dict (dict, default None, Optional) – Dictionary of format {“ColumnName”:”ListofValues”} where “ColumnName” is a data column and “ListofValues” is a sorted list of valid values. A DataFrame only containing rows that appear in “ListofValues” will be returned.

  • duplicate_subset (list, default None, Optional) – Drop any duplicate entries as specified by this subset of columns

  • duplicate_keep (str, default 'last', Optional) –

    If duplicates are being dropped, keep the ‘first’ or ‘last’ (see pandas.DataFram.drop_duplicates)

  • dropna (list, default None, Optional) – Drop any NaN entries as specified by this subset of columns

Returns:

Journal DataFrame.

Return type:

DataFrame


load_pub2field(preprocess=True, columns=None, filter_dict={}, duplicate_subset=None, duplicate_keep='last', dropna=None, prefunc2apply=None, postfunc2apply=None, show_progress=False)

Load the Pub2Field DataFrame from a preprocessed directory, or parse from the raw files.

Parameters:
  • preprocess (bool, default True, Optional) – Attempt to load from the preprocessed directory.

  • columns (list, default None, Optional) – Load only this subset of columns

  • filter_dict (dict, default {}, Optional) – Dictionary of format {“ColumnName”:”ListofValues”} where “ColumnName” is a data column and “ListofValues” is a sorted list of valid values. A DataFrame only containing rows that appear in “ListofValues” will be returned.

  • duplicate_subset (list, default None, Optional) – Drop any duplicate entries as specified by this subset of columns

  • duplicate_keep (str, default 'last', Optional) –

    If duplicates are being dropped, keep the ‘first’ or ‘last’ (see pandas.DataFram.drop_duplicates)

  • dropna (list, default None, Optional) – Drop any NaN entries as specified by this subset of columns

Returns:

Pub2Field DataFrame.

Return type:

DataFrame


load_publicationauthoraffiliation(preprocess=True, columns=None, filter_dict={}, duplicate_subset=None, duplicate_keep='last', dropna=None, prefunc2apply=None, postfunc2apply=None, show_progress=False)

Load the PublicationAuthorAffilation DataFrame from a preprocessed directory, or parse from the raw files.

Parameters:
  • preprocess (bool, default True, Optional) – Attempt to load from the preprocessed directory.

  • columns (list, default None, Optional) – Load only this subset of columns

  • filter_dict (dict, default {}, Optional) – Dictionary of format {“ColumnName”:”ListofValues”} where “ColumnName” is a data column and “ListofValues” is a sorted list of valid values. A DataFrame only containing rows that appear in “ListofValues” will be returned.

  • duplicate_subset (list, default None, Optional) – Drop any duplicate entries as specified by this subset of columns

  • duplicate_keep (str, default 'last', Optional) –

    If duplicates are being dropped, keep the ‘first’ or ‘last’ (see pandas.DataFram.drop_duplicates)

  • dropna (list, default None, Optional) – Drop any NaN entries as specified by this subset of columns

Returns:

PublicationAuthorAffilation DataFrame.

Return type:

DataFrame


load_publications(preprocess=True, columns=None, filter_dict={}, duplicate_subset=None, duplicate_keep='last', dropna=None, prefunc2apply=None, postfunc2apply=None, show_progress=False)

Load the Publication DataFrame from a preprocessed directory, or parse from the raw files.

Parameters:
  • preprocess (bool, default True, Optional) – Attempt to load from the preprocessed directory.

  • columns (list, default None, Optional) – Load only this subset of columns

  • filter_dict (dict, default {}, Optional) – Dictionary of format {“ColumnName”:”ListofValues”} where “ColumnName” is a data column and “ListofValues” is a sorted list of valid values. A DataFrame only containing rows that appear in “ListofValues” will be returned.

  • duplicate_subset (list, default None, Optional) – Drop any duplicate entries as specified by this subset of columns

  • duplicate_keep (str, default 'last', Optional) –

    If duplicates are being dropped, keep the ‘first’ or ‘last’ (see pandas.DataFram.drop_duplicates)

  • dropna (list, default None, Optional) – Drop any NaN entries as specified by this subset of columns

Returns:

Publication DataFrame.

Return type:

DataFrame


load_references(preprocess=True, columns=None, filter_dict={}, duplicate_subset=None, duplicate_keep='last', noselfcite=False, dropna=None, prefunc2apply=None, postfunc2apply=None, show_progress=False)

Load the Pub2Ref DataFrame from a preprocessed directory, or parse from the raw files.

Parameters:
  • preprocess (bool, default True, Optional) – Attempt to load from the preprocessed directory.

  • columns (list, default None, Optional) – Load only this subset of columns

  • filter_dict (dict, default {}, Optional) – Dictionary of format {“ColumnName”:”ListofValues”} where “ColumnName” is a data column and “ListofValues” is a sorted list of valid values. A DataFrame only containing rows that appear in “ListofValues” will be returned.

  • duplicate_subset (list, default None, Optional) – Drop any duplicate entries as specified by this subset of columns

  • duplicate_keep (str, default 'last', Optional) –

    If duplicates are being dropped, keep the ‘first’ or ‘last’ (see pandas.DataFram.drop_duplicates)

  • dropna (list, default None, Optional) – Drop any NaN entries as specified by this subset of columns

  • noselfcite (bool, default False, Optional) – If True, then the preprocessed pub2ref files with self-citations removed will be used.

Returns:

Pub2Ref DataFrame.

Return type:

DataFrame


property paa

The DataFrame keeping all publication, author, affiliation relationships. Columns may depend on the specific datasource.

Notes

columns: ‘PublicationId’, ‘AuthorId’, ‘AffiliationId’, ‘AuthorSequence’, ‘OrigAuthorName’, ‘OrigAffiliationName’


precompute_teamsize(save2pubdf=True, show_progress=False)

Calculate the teamsize of publications, defined as the total number of Authors on the publication.

Parameters:
  • save2pubdf (bool, default True, Optional) – If True the results are appended to the preprocessed publication DataFrames.

  • show_progress (bool, default False) – If True, display a progress bar for the count.

Returns:

TeamSize DataFrame with 2 columns: ‘PublicationId’, ‘TeamSize’

Return type:

DataFrame


property pub

The DataFrame keeping publication information. Columns may depend on the specific datasource.

Notes

columns: ‘PublicationId’, ‘Year’, ‘JournalId’, ‘FamilyId’, ‘Doi’, ‘Title’, ‘Date’, ‘Volume’, ‘Issue’, ‘DocType’


property pub2doctype

A dictionary mapping PublicationId to Document Type.

Notes

doctype mapping: ‘Journal’: ‘j’, ‘Book’:’b’, ‘’:’’, ‘BookChapter’:’bc’, ‘Conference’:’c’, ‘Dataset’:’d’, ‘Patent’:’p’, ‘Repository’:’r’


property pub2field

The DataFrame keeping all publication field relationships. Columns may depend on the specific datasource.

Notes

columns: ‘PublicationId’, ‘FieldId’


property pub2ref

The DataFrame keeping citing and cited PublicationId.

Notes

columns: ‘CitingPublicationId’, ‘CitedPublicationId’


property pub2refnoself

The DataFrame keeping citing and cited PublicationId after filtering out the self-citations.

Notes

columns: CitingPublicationId, CitedPublicationId


property pub2year

A dictionary mapping PublicationId to Year.

publicationid_list()

A list of all PublicationIds.


read_data_file(fname, key='')

Read the DataFrame from a file.

Parameters:
  • df (DataFrame) – A pandas DataFrame

  • fname (std,) – The filename

Returns:

df – The DataFrame.

Return type:

DataFrame

remove_selfcitations(preprocess=True, show_progress=False)

Prcoess the pub2ref DataFrame and remove all citation relationships that share an Author.

Parameters:

preprocess (bool, default True, Optional) – If True then the new preprocessed DataFrames are saved in pub2refnoself

Returns:

  • DataFrame – Pub2Ref DataFrame with 2 columns: ‘CitingPublicationId’, ‘CitedPublicationId’

  • |

save_data_file(df, fname, key='')

Save the DataFrame to a file.

Parameters:
  • df (DataFrame) – A pandas DataFrame

  • fname (std,) – The filename

  • key (str) – For hdf files, the table key.

Returns:

journal_pubs – The journal(s) publication information.

Return type:

DataFrame

set_global_filters(global_filter)

Set of filtering conditions that restrict the global set of publications loaded from the DataBase.

Allowable global filters are:

‘Year’, ‘DocType’, ‘FieldId’


set_new_data_path(dataframe_name='', new_path='')

Override path to the dataframe collection.

Parameters:
  • dataframe_name (str) – The dataframe name to override. E.g. ‘author’, ‘pub’, ‘paa’, ‘pub2field’, etc.

  • new_path (str) – The new dataframe path.

setup_dask_client(n_workers=None, threads_per_worker=None)

Initialize a Dask client

Parameters:
  • n_workers (int, defualt None) – The number of workers to use.

  • threads_per_worker (int, defualt None) – The number of threads per worker to use.

OpenAlex

For initial example, see Getting Started With OpenAlex.

class pyscisci.datasource.OpenAlex.OpenAlex(path2database='', database_extension='csv.gz', keep_in_memory=False, global_filter=None, enable_dask=False, show_progress=True)

Base class for the OpenAlex interface.

This is an extension of ‘BibDataBase’ with processing functions specific to the MAG. See ‘BibDataBase’ in database.py for details of non-MAG specific functions.

The MAG comes structured into three folders: mag, advanced, nlp. Explain downloading etc.

download_from_source(aws_access_key_id='', aws_secret_access_key='', specific_update='', aws_bucket='openalex', dataframe_list='all', rewrite_existing=False, edit_works=True, show_progress=True)

Download the OpenAlex files from Amazon Web Services. The files will be saved to the path specified by path2database into RawXML.

Parameters:
  • source_url (str, default 'ftp.ncbi.nlm.nih.gov') – The base url for the ftp server from which to download.

  • aws_access_key_id (str) – The acess key for AWS (not required)

  • aws_secret_access_key (str) – The secret acess key for AWS (not required)

  • specific_update (str) – Download only a specific update date, specified by the date in Y-M-D format, 2022-01-01. If empty the full data is downloaded.

  • dataframe_list (list) –

    The data types to download and save from OpenAlex.

    ’all’ ‘affiliations’ ‘authors’ ‘publications’ ‘references’ ‘publicationauthoraffiliation’ ‘fields’ ‘abstracts’

  • rewrite_existing (bool, default False) – If True, overwrite existing files or if False, only download any missing files.

  • edit_works (bool, default True) – If True, edit the works to remove pieces of data. If False, force keeping all entries from the works file.

  • show_progress (bool, default True) – Show progress with processing of the data.

  • https (# edited from) –

parse_affiliations(preprocess=True, specific_update='', show_progress=True)

Parse the OpenAlex Affilation raw data.

Parameters:
  • preprocess (bool, default True) – Save the processed data in new DataFrames.

  • specific_update (str) – Parse only a specific update date, specified by the date in Y-M-D format, 2022-01-01. If empty the full data is parsed.

  • show_progress (bool, default True) – Show progress with processing of the data.

Returns:

Affiliation DataFrame.

Return type:

DataFrame

parse_authors(preprocess=True, specific_update='', process_name=True, show_progress=True)

Parse the OpenAlex Author raw data.

Parameters:
  • preprocess (bool, default True) – Save the processed data in new DataFrames.

  • specific_update (str) – Parse only a specific update date, specified by the date in Y-M-D format, 2022-01-01. If empty the full data is parsed.

  • process_name (bool, default True) –

    If True, then when processing the raw file, the package NameParser will be used to split author FullNames.

  • show_progress (bool, default True) – Show progress with processing of the data.

Returns:

Author DataFrame.

Return type:

DataFrame

parse_concepts(preprocess=True, specific_update='', show_progress=True)

Parse the MAG Paper Field raw data.

Parameters:
  • preprocess (bool, default True) – Save the processed data in new DataFrames.

  • specific_update (str) – Parse only a specific update date, specified by the date in Y-M-D format, 2022-01-01. If empty the full data is parsed.

  • show_progress (bool, default True) – Show progress with processing of the data.

Returns:

FieldInfo DataFrame.

Return type:

DataFrame

parse_publications(preprocess=True, specific_update='', preprocess_dicts=True, dataframe_list=['publications', 'references', 'publicationauthoraffiliation', 'fields'], show_progress=True)

Parse the OpenAlex Works raw data.

Parameters:
  • preprocess (bool, default True) – Save the processed data in new DataFrames.

  • specific_update (str) – Parse only a specific update date, specified by the date in Y-M-D format, 2022-01-01. If empty the full data is parsed.

  • parse_venues (bool, default True) – Also parse the venue information.

  • preprocess_dicts (bool, default True) – Save the processed Year and DocType data as dictionaries.

  • dataframe_list (list) –

    The data types to download and save from OpenAlex.

    ’all’ ‘publications’ ‘references’ ‘publicationauthoraffiliation’ ‘fields’ ‘abstracts’

  • show_progress (bool, default True) – Show progress with processing of the data.

Returns:

Publication DataFrame.

Return type:

DataFrame

parse_venues(preprocess=True, specific_update='', show_progress=True)

Parse the OpenAlex Venues raw data.

Parameters:
  • preprocess (bool, default True) – Save the processed data in new DataFrames.

  • specific_update (str) – Parse only a specific update date, specified by the date in Y-M-D format, 2022-01-01. If empty the full data is parsed.

  • show_progress (bool, default True) – Show progress with processing of the data.

Returns:

Venue DataFrame.

Return type:

DataFrame

preprocess(dataframe_list=None, show_progress=True)

Bulk preprocess the MAG raw data.

Parameters:
  • dataframe_list (list, default None) – The list of DataFrames to preprocess. If None, all MAG DataFrames are preprocessed.

  • show_progress (bool, default True) – Show progress with processing of the data.

Microsoft Academic Graph (MAG)

For initial example, see Getting Started With MAG.

class pyscisci.datasource.MAG.MAG(path2database='', database_extension='csv.gz', keep_in_memory=False, global_filter=None, enable_dask=False, show_progress=True)

Base class for Microsoft Academic Graph interface.

This is an extension of ‘BibDataBase’ with processing functions specific to the MAG. See ‘BibDataBase’ in database.py for details of non-MAG specific functions.

The MAG comes structured into three folders: mag, advanced, nlp. Explain downloading etc.

parse_affiliations(preprocess=True, show_progress=True)

Parse the MAG Affilation raw data.

Parameters:
  • preprocess (bool, default True) – Save the processed data in new DataFrames.

  • show_progress (bool, default True) – Show progress with processing of the data.

Returns:

Affiliation DataFrame.

Return type:

DataFrame

parse_authors(preprocess=False, process_name=True, num_file_lines=5000000, show_progress=True)

Parse the MAG Author raw data.

Parameters:
  • preprocess (bool, default True) – Save the processed data in new DataFrames.

  • process_name (bool, default True) –

    If True, then when processing the raw file, the package NameParser will be used to split author FullNames.

  • num_file_lines (int, default 5*10**6) – The processed data will be saved into smaller DataFrames, each with num_file_lines rows.

  • show_progress (bool, default True) – Show progress with processing of the data.

Returns:

Author DataFrame.

Return type:

DataFrame

parse_fields(preprocess=False, num_file_lines=5000000, show_progress=True)

Parse the MAG Paper Field raw data.

Parameters:
  • preprocess (bool, default True) – Save the processed data in new DataFrames.

  • num_file_lines (int, default 10**7) – The processed data will be saved into smaller DataFrames, each with num_file_lines rows.

  • show_progress (bool, default True) – Show progress with processing of the data.

Returns:

Pub2Field DataFrame.

Return type:

DataFrame

parse_publicationauthoraffiliation(preprocess=False, num_file_lines=5000000, show_progress=True)

Parse the MAG PublicationAuthorAffiliation raw data.

Parameters:
  • preprocess (bool, default True) – Save the processed data in new DataFrames.

  • num_file_lines (int, default 10**7) – The processed data will be saved into smaller DataFrames, each with num_file_lines rows.

  • show_progress (bool, default True) – Show progress with processing of the data.

Returns:

PublicationAuthorAffiliation DataFrame.

Return type:

DataFrame

parse_publications(preprocess=True, num_file_lines=2000000, preprocess_dicts=True, show_progress=True)

Parse the MAG Publication and Journal raw data.

Parameters:
  • preprocess (bool, default True) – Save the processed data in new DataFrames.

  • num_file_lines (int, default 5*10**6) – The processed data will be saved into smaller DataFrames, each with num_file_lines rows.

  • preprocess_dicts (bool, default True) – Save the processed Year and DocType data as dictionaries.

  • show_progress (bool, default True) – Show progress with processing of the data.

Returns:

Publication DataFrame.

Return type:

DataFrame

parse_references(preprocess=False, num_file_lines=10000000, show_progress=True)

Parse the MAG References raw data.

Parameters:
  • preprocess (bool, default True) – Save the processed data in new DataFrames.

  • num_file_lines (int, default 10**7) – The processed data will be saved into smaller DataFrames, each with num_file_lines rows.

  • show_progress (bool, default True) – Show progress with processing of the data.

Returns:

Pub2Ref DataFrame.

Return type:

DataFrame

preprocess(dflist=None, combine_conferences=True, show_progress=True)

Bulk preprocess the MAG raw data.

Parameters:
  • dflist (list, default None) – The list of DataFrames to preprocess. If None, all MAG DataFrames are preprocessed.

  • combine_conferences (bool, default True,) – combine the conference series into the Journals

  • show_progress (bool, default True) – Show progress with processing of the data.

Web of Science (WoS)

For initial example, see Getting Started With WoS.

class pyscisci.datasource.WOS.WOS(path2database='', database_extension='csv.gz', keep_in_memory=False, global_filter=None, enable_dask=False, show_progress=True)

Base class for Web of Science interface.

preprocess(xml_directory='RawXML', name_space=None, process_name=True, num_file_lines=1000000, show_progress=True)

Bulk preprocess of the Web of Science raw data.

Parameters:
  • 'xml_directory' (str, default 'RawXML') – The subdirectory containing the raw WOS xml files.

  • 'name_space' (str, defulat None) – The link to a xml namespace file. Originally ‘http://scientific.thomsonreuters.com/schema/wok5.4/public/FullRecord’ but this link is now broken and Clarivate has not replaced the namespace.

  • process_name (bool, default True) –

    If True, then when processing the raw file, the package NameParser will be used to split author FullNames.

  • num_file_lines (int, default 10**6) – The processed data will be saved into smaller DataFrames, each with num_file_lines rows.

  • show_progress (bool, default True) – Show progress with processing of the data.

DBLP Computer Science Bibliography (DBLP)

For initial example, see Getting Started With DBLP.

class pyscisci.datasource.DBLP.DBLP(path2database='', database_extension='csv.gz', keep_in_memory=False, global_filter=None, enable_dask=False, show_progress=True)

Base class for DBLP interface.

The DBLP comes as a single xml file. It can be downloaded from [DBLP](https://dblp.uni-trier.de/) via donwload_from_source

There is no citation information!

property author2pub

The DataFrame keeping all publication, author relationships. Columns may depend on the specific datasource.

Notes

columns: ‘PublicationId’, ‘AuthorId’, ‘AuthorOrder’

download_from_source(source_url='https://dblp.uni-trier.de/xml/', xml_file_name='dblp.xml.gz', dtd_file_name='dblp.dtd', show_progress=True)
Download the DBLP raw xml file and the dtd formating information from [DBLP](https://dblp.uni-trier.de/).
  1. dblp.xml.gz - the compressed xml file

  2. dblp.dtd - the dtd containing xml syntax

The files will be saved to the path specified by path2database.

Parameters:
  • source_url (str, default ‘https://dblp.uni-trier.de/xml/’) – The base url from which to download.

  • xml_file_name (str, default 'dblp.xml.gz') – The xml file name.

  • dtd_file_name (str, default 'dblp.dtd') – The dtd file name.

  • show_progress (bool, default True) – Show progress with processing of the data.

load_journals(preprocess=True, columns=None, filter_dict={}, duplicate_subset=None, duplicate_keep='last', dropna=None, prefunc2apply=None, postfunc2apply=None, show_progress=False)

Load the Journal DataFrame from a preprocessed directory, or parse from the raw files.

Parameters:
  • preprocess (bool, default True, Optional) – Attempt to load from the preprocessed directory.

  • columns (list, default None, Optional) – Load only this subset of columns

  • filter_dict (dict, default None, Optional) – Dictionary of format {“ColumnName”:”ListofValues”} where “ColumnName” is a data column and “ListofValues” is a sorted list of valid values. A DataFrame only containing rows that appear in “ListofValues” will be returned.

  • duplicate_subset (list, default None, Optional) – Drop any duplicate entries as specified by this subset of columns

  • duplicate_keep (str, default 'last', Optional) –

    If duplicates are being dropped, keep the ‘first’ or ‘last’ (see pandas.DataFram.drop_duplicates)

  • dropna (list, default None, Optional) – Drop any NaN entries as specified by this subset of columns

Returns:

Journal DataFrame.

Return type:

DataFrame


load_publicationauthor(preprocess=True, columns=None, filter_dict=None, duplicate_subset=None, duplicate_keep='last', dropna=None, show_progress=False)

Load the PublicationAuthor DataFrame from a preprocessed directory. For DBLP, you must run preprocess before the dataframe is available for use.

Parameters:
  • preprocess (bool, default True, Optional) – Attempt to load from the preprocessed directory.

  • columns (list, default None, Optional) – Load only this subset of columns

  • filter_dict (dict, default None, Optional) – Dictionary of format {“ColumnName”:”ListofValues”} where “ColumnName” is a data column and “ListofValues” is a sorted list of valid values. A DataFrame only containing rows that appear in “ListofValues” will be returned.

  • duplicate_subset (list, default None, Optional) – Drop any duplicate entries as specified by this subset of columns

  • duplicate_keep (str, default 'last', Optional) –

    If duplicates are being dropped, keep the ‘first’ or ‘last’ (see pandas.DataFram.drop_duplicates)

  • dropna (list, default None, Optional) – Drop any NaN entries as specified by this subset of columns

Returns:

PublicationAuthor DataFrame.

Return type:

DataFrame

preprocess(xml_file_name='dblp.xml.gz', process_name=True, num_file_lines=1000000, show_progress=True)

Bulk preprocess of the DBLP raw data.

Parameters:
  • process_name (bool, default True) –

    If True, then when processing the raw file, the package NameParser will be used to split author FullNames.

  • xml_file_name (str, default 'dblp.xml.gz') – The xml file name.

  • num_file_lines (int, default 10**6) – The processed data will be saved into smaller DataFrames, each with num_file_lines rows.

  • show_progress (bool, default True) – Show progress with processing of the data.

American Physics Society (APS)

For initial example, see Getting Started With APS.

class pyscisci.datasource.APS.APS(path2database='', database_extension='csv.gz', keep_in_memory=False, global_filter=None, enable_dask=False, show_progress=True)

Base class for APS interface.

The APS comes as a single xml file.

You must request usage through their website: https://journals.aps.org/datasets

load_journals(preprocess=True, columns=None, filter_dict={}, duplicate_subset=None, duplicate_keep='last', dropna=None, prefunc2apply=None, postfunc2apply=None, show_progress=False)

Load the Journal DataFrame from a preprocessed directory, or parse from the raw files.

Parameters:
  • preprocess (bool, default True, Optional) – Attempt to load from the preprocessed directory.

  • columns (list, default None, Optional) – Load only this subset of columns

  • filter_dict (dict, default None, Optional) – Dictionary of format {“ColumnName”:”ListofValues”} where “ColumnName” is a data column and “ListofValues” is a sorted list of valid values. A DataFrame only containing rows that appear in “ListofValues” will be returned.

  • duplicate_subset (list, default None, Optional) – Drop any duplicate entries as specified by this subset of columns

  • duplicate_keep (str, default 'last', Optional) –

    If duplicates are being dropped, keep the ‘first’ or ‘last’ (see pandas.DataFram.drop_duplicates)

  • dropna (list, default None, Optional) – Drop any NaN entries as specified by this subset of columns

Returns:

Journal DataFrame.

Return type:

DataFrame


preprocess(archive_year=2019, pubid2int=False, metadata_archive=None, citation_archive=None, show_progress=True)

Bulk preprocess the APS raw data.

PubMed

For initial example, see Getting Started With PubMed.

class pyscisci.datasource.PubMed.PubMed(path2database='', database_extension='csv.gz', keep_in_memory=False, global_filter=None, enable_dask=False, show_progress=True)

Base class for PubMed Medline interface.

Notes

~ PubMed comes as >1000 compressed XML files. ~ The PMID is renamed PublicationId to be consistent with the rest of pySciSci. ~ PubMed does not disambiguate Authors. ~

download_from_source(source_url='ftp.ncbi.nlm.nih.gov', dtd_url='https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_190101.dtd', rewrite_existing=False, show_progress=True)
Download the Pubmed raw xml files and the dtd formating information from [PubMed](https://www.nlm.nih.gov/databases/download/pubmed_medline.html).
  1. pubmed/baseline - the directory containing the baseline compressed xml files

  2. pubmed_190101.dtd - the dtd containing xml syntax

The files will be saved to the path specified by path2database into RawXML.

Parameters:
  • source_url (str, default 'ftp.ncbi.nlm.nih.gov') – The base url for the ftp server from which to download.

  • dtd_url (str, default 'pubmed_190101.dtd') – The url for the dtd file.

  • rewrite_existing (bool, default False) – If True, overwrite existing files or if False, only download any missing files.

  • show_progress (bool, default True) – Show progress with processing of the data.

parse_fields(preprocess=True, num_file_lines=10000000, rewrite_existing=False, xml_directory='RawXML')

Parse the PubMed field (mesh term) raw data.

Parameters:
  • preprocess (bool, default True) – Save the processed data in new DataFrames.

  • process_name (bool, default True) –

    If True, then when processing the raw file, the package NameParser will be used to split author FullNames.

  • num_file_lines (int, default 5*10**6) – The processed data will be saved into smaller DataFrames, each with num_file_lines rows.

  • show_progress (bool, default True) – Show progress with processing of the data.

Returns:

Publication-Term ID DataFrame and Term ID - Term DataFrame

Return type:

DataFrame

parse_publicationauthoraffiliation(xml_directory='RawXML', preprocess=True, num_file_lines=10000000, rewrite_existing=False)

Parse the PubMed publication-author raw data.

Parameters:
  • preprocess (bool, default True) – Save the processed data in new DataFrames.

  • process_name (bool, default True) –

    If True, then when processing the raw file, the package NameParser will be used to split author FullNames.

  • num_file_lines (int, default 5*10**6) – The processed data will be saved into smaller DataFrames, each with num_file_lines rows.

  • show_progress (bool, default True) – Show progress with processing of the data.

Returns:

Publication-Author DataFrame.

Return type:

DataFrame

parse_publications(xml_directory='RawXML', preprocess=True, num_file_lines=10000000, rewrite_existing=False)

Parse the PubMed publication raw data.

Parameters:
  • preprocess (bool, default True) – Save the processed data in new DataFrames.

  • process_name (bool, default True) –

    If True, then when processing the raw file, the package NameParser will be used to split author FullNames.

  • num_file_lines (int, default 5*10**6) – The processed data will be saved into smaller DataFrames, each with num_file_lines rows.

  • show_progress (bool, default True) – Show progress with processing of the data.

Returns:

Publication metadata DataFrame.

Return type:

DataFrame

parse_references(xml_directory='RawXML', preprocess=True, num_file_lines=10000000, rewrite_existing=False, show_progress=True)

Parse the PubMed References raw data.

Parameters:
  • preprocess (bool, default True) – Save the processed data in new DataFrames.

  • process_name (bool, default True) –

    If True, then when processing the raw file, the package NameParser will be used to split author FullNames.

  • num_file_lines (int, default 5*10**6) – The processed data will be saved into smaller DataFrames, each with num_file_lines rows.

  • show_progress (bool, default True) – Show progress with processing of the data.

Returns:

Citations DataFrame.

Return type:

DataFrame

preprocess(xml_directory='RawXML', process_name=True, num_file_lines=1000000, show_progress=True, rewrite_existing=False)

Bulk preprocess of the PubMed raw data.

Parameters:
  • process_name (bool, default True) –

    If True, then when processing the raw file, the package NameParser will be used to split author FullNames.

  • num_file_lines (int, default 10**6) – The processed data will be saved into smaller DataFrames, each with num_file_lines rows.

  • show_progress (bool, default True) – Show progress with processing of the data.

  • rewrite_existing (bool, default False) – If True, rewrites the files in the data directory

Custom DB

For initial example, see Getting Started With Custom DB.

class pyscisci.datasource.CustomDB.CustomDB(path2database='', database_extension='csv.gz', keep_in_memory=False, global_filter=None, enable_dask=False, show_progress=True)

Base class for creating a CustomDB.

set_new_data_paths()

Override path to the dataframe collections based on a new custom hierarchy.

Parameters:

new_path_dict (dict) – A dictionary where each key is a dataframe name to override. E.g. ‘author’, ‘pub’, ‘paa’, ‘pub2field’, etc. and each item is the new dataframe path.