scitex_scholar.core

class scitex_scholar.core.Paper(*args, **kwargs)[source]

Bases: BaseModel

Complete paper with metadata and container.

model_dump(**kwargs)[source]

Custom serialization to ensure all nested models use aliases.

Return type:

Dict[str, Any]

classmethod from_dict(data)[source]

Create from dictionary (for loading from JSON).

Uses Pydantic’s model_validate which handles: - Type validation - Type coercion (e.g., “2024” -> 2024) - Field aliases (e.g., “2025” -> y2025)

Return type:

Paper

to_dict()[source]

Convert to dictionary for JSON serialization.

Alias for model_dump() for backward compatibility.

Return type:

Dict[str, Any]

detect_open_access(use_unpaywall=False, update_metadata=True)[source]

Detect open access status for this paper.

Uses identifiers (DOI, arXiv ID, PMCID) and known OA sources to determine if the paper is freely available.

Parameters:
  • use_unpaywall (bool) – If True, query Unpaywall API for uncertain cases

  • update_metadata (bool) – If True, update self.metadata.access with results

Return type:

OAResult

Returns:

OAResult with detection results

property is_open_access: bool

Check if paper is open access (quick check without API calls).

class scitex_scholar.core.Papers(papers=None, project=None, config=None)[source]

Bases: object

A simple collection of Paper objects.

This is a minimal collection class. Most business logic (loading, saving, enrichment, etc.) is handled by Scholar.

Methods have been reduced from 39 to ~15 for simplicity. Complex operations should use Scholar or utility functions.

__init__(papers=None, project=None, config=None)[source]

Initialize Papers collection.

Parameters:
__len__()[source]

Number of papers in collection.

Return type:

int

__iter__()[source]

Iterate over papers.

Return type:

Iterator[Paper]

__getitem__(index)[source]

Get paper(s) by index or slice.

Parameters:

index (Union[int, slice]) – Integer index or slice

Return type:

Union[Paper, Papers]

Returns:

Single Paper if integer index, Papers collection if slice

__repr__()[source]

String representation.

Return type:

str

__str__()[source]

Human-readable string.

Return type:

str

__dir__()[source]

Custom dir for better discoverability.

Return type:

List[str]

property papers: List[Paper]

Get the underlying papers list.

append(paper)[source]

Add a paper to the collection.

Parameters:

paper (Paper) – Paper to add

Return type:

None

extend(papers)[source]

Add multiple papers to the collection.

Parameters:

papers (Union[List[Paper], Papers]) – List of papers or another Papers collection

Return type:

None

to_list()[source]

Get papers as a list.

Return type:

List[Paper]

Returns:

List of Paper objects

filter(condition=None, year_min=None, year_max=None, has_doi=None, has_abstract=None, has_pdf=None, min_citations=None, max_citations=None, min_impact_factor=None, max_impact_factor=None, journal=None, author=None, keyword=None, publisher=None, **kwargs)[source]

Filter papers by condition or criteria.

Parameters:
  • condition (Optional[Callable[[Paper], bool]]) – Function that takes a Paper and returns bool.

  • year_min (Optional[int]) – Minimum year.

  • year_max (Optional[int]) – Maximum year.

  • has_doi (Optional[bool]) – Filter papers with/without DOI.

  • has_abstract (Optional[bool]) – Filter papers with/without abstract.

  • has_pdf (Optional[bool]) – Filter papers with/without PDF URL.

  • min_citations (Optional[int]) – Minimum citation count.

  • max_citations (Optional[int]) – Maximum citation count.

  • min_impact_factor (Optional[float]) – Minimum journal impact factor.

  • max_impact_factor (Optional[float]) – Maximum journal impact factor.

  • journal (Optional[str]) – Journal name (partial match).

  • author (Optional[str]) – Author name (partial match).

  • keyword (Optional[str]) – Keyword (searches in keywords, title, abstract).

  • publisher (Optional[str]) – Publisher name (partial match).

  • **kwargs – Additional keyword arguments for backward compatibility.

Returns:

New Papers collection with filtered papers.

Return type:

Papers

Examples

Filter using a lambda condition:

high_impact = papers.filter(lambda p: p.journal_impact_factor and p.journal_impact_factor > 10)
highly_cited = papers.filter(lambda p: p.citation_count and p.citation_count > 500)
recent = papers.filter(lambda p: p.year and p.year >= 2020)

Filter using built-in parameters:

high_impact_v2 = papers.filter(min_impact_factor=10.0)
highly_cited_v2 = papers.filter(min_citations=500)
recent_v2 = papers.filter(year_min=2020)

Combine multiple parameters:

filtered = papers.filter(
    min_impact_factor=5.0,
    min_citations=100,
    year_min=2015,
    year_max=2023,
    journal="Nature",
    has_doi=True,
)

Chain filters for AND logic:

elite_recent = papers.filter(min_impact_factor=10).filter(year_min=2020)
sort_by(*criteria, reverse=False, **kwargs)[source]

Sort papers by criteria.

Parameters:
  • *criteria – Field names (as strings) or lambda functions to sort by.

  • reverse (bool) – Sort in descending order (default: False).

  • **kwargs – Additional options.

Returns:

New sorted Papers collection.

Return type:

Papers

Notes

Available Paper fields for sorting:

  • title – Paper title

  • year – Publication year

  • citation_count – Number of citations

  • journal_impact_factor – Journal impact factor

  • journal – Journal name

  • publisher – Publisher name

  • doi – Digital Object Identifier

  • created_at – When record was created

  • updated_at – When record was last updated

Examples

Sort by a single field:

by_year = papers.sort_by('year')
by_citations_desc = papers.sort_by('citation_count', reverse=True)

Sort by multiple fields (primary, secondary, etc.):

by_year_then_citations = papers.sort_by('year', 'citation_count')

Sort using a lambda function:

by_citations = papers.sort_by(lambda p: p.citation_count or 0, reverse=True)
by_year_safe = papers.sort_by(lambda p: p.year if p.year else 9999)

Sort by a computed value:

by_citation_per_year = papers.sort_by(
    lambda p: (p.citation_count or 0) / (2024 - p.year) if p.year else 0,
    reverse=True,
)
classmethod from_bibtex(bibtex_input)[source]

Load papers from BibTeX.

DEPRECATED: Use Scholar.from_bibtex() instead. This method is kept for backward compatibility.

Parameters:

bibtex_input (Union[str, Path]) – Path to BibTeX file or BibTeX string

Return type:

Papers

Returns:

Papers collection

classmethod _from_bibtex_file(file_path)[source]

Load papers from BibTeX file.

Parameters:

file_path (Union[str, Path]) – Path to BibTeX file

Return type:

Papers

Returns:

Papers collection

classmethod _from_bibtex_text(bibtex_content)[source]

Load papers from BibTeX text.

Parameters:

bibtex_content (str) – BibTeX content as string

Return type:

Papers

Returns:

Papers collection

static _bibtex_entry_to_paper(entry)[source]

Convert BibTeX entry to Paper object.

Parameters:

entry (Dict[str, Any]) – BibTeX entry dictionary

Return type:

Paper

Returns:

Paper object

save(output_path, format='auto', **kwargs)[source]

Save papers to file.

DEPRECATED: Use Scholar.save_papers() or Scholar.export_bibtex() instead. This method is kept for backward compatibility.

Parameters:
  • output_path (Union[str, Path]) – Path to save file

  • format (Optional[str]) – Output format (auto, bibtex, json, csv)

  • **kwargs – Additional options

Return type:

None

to_dict()[source]

Convert to dictionary.

DEPRECATED: Use papers_utils.papers_to_dict() for new code.

Return type:

List[Dict[str, Any]]

Returns:

Dictionary representation

to_dataframe()[source]

Convert to pandas DataFrame.

DEPRECATED: Use papers_utils.papers_to_dataframe() for new code.

Return type:

Any

Returns:

DataFrame with papers data

summary()[source]

Get summary statistics.

DEPRECATED: Use papers_utils.papers_statistics() for new code.

Return type:

Dict[str, Any]

Returns:

Dictionary with statistics

class scitex_scholar.core.Scholar(config=None, project=None, project_description=None, browser_mode=None)[source]

Bases: EnricherMixin, URLFindingMixin, PDFDownloadMixin, LoaderMixin, SearchMixin, SaverMixin, ProjectHandlerMixin, LibraryHandlerMixin, PipelineMixin, ServiceMixin

Main interface for SciTeX Scholar - scientific literature management made simple.

By default, papers are automatically enriched with:

  • Journal impact factors from impact_factor package (2024 JCR data)

  • Citation counts from Semantic Scholar (via DOI/title matching)

Examples

Basic search with automatic enrichment:

scholar = Scholar()
papers = scholar.search("deep learning neuroscience")
# Papers now have impact_factor and citation_count populated
papers.save("my_pac.bib")

Disable automatic enrichment if needed:

config = ScholarConfig(enable_auto_enrich=False)
scholar = Scholar(config=config)

Search a specific source:

papers = scholar.search("transformer models", sources='arxiv')

Advanced workflow:

papers = (
    scholar.search("transformer models", year_min=2020)
           .filter(min_citations=50)
           .sort_by("impact_factor")
           .save("transformers.bib")
)

Local library:

scholar._index_local_pdfs("./my_papers")
local_papers = scholar.search_local("attention mechanism")
property name

Class name for logging.

__init__(config=None, project=None, project_description=None, browser_mode=None)[source]

Initialize Scholar with configuration.

Parameters:
  • config (Union[ScholarConfig, str, Path, None]) –

    One of:

    • ScholarConfig instance

    • Path to YAML config file (str or Path)

    • None (uses ScholarConfig.load() to find config)

  • project (Optional[str]) – Default project name for operations.

  • project_description (Optional[str]) – Optional description for the project.

  • browser_mode (Optional[str]) – Browser mode ('stealth', 'interactive', 'manual').

class scitex_scholar.core.OAStatus(value)[source]

Bases: Enum

Open Access status categories (aligned with Unpaywall).

GOLD = 'gold'
GREEN = 'green'
HYBRID = 'hybrid'
BRONZE = 'bronze'
CLOSED = 'closed'
UNKNOWN = 'unknown'
class scitex_scholar.core.OAResult(is_open_access, status, oa_url=None, source=None, license=None, confidence=1.0)[source]

Bases: object

Result of open access detection.

is_open_access: bool
status: OAStatus
oa_url: str | None = None
source: str | None = None
license: str | None = None
confidence: float = 1.0
__init__(is_open_access, status, oa_url=None, source=None, license=None, confidence=1.0)
scitex_scholar.core.detect_oa_from_identifiers(doi=None, arxiv_id=None, pmcid=None, source=None, journal=None, is_open_access_flag=None)[source]

Detect open access status from paper identifiers without API calls.

This is fast but may miss some OA papers (e.g., hybrid articles). For comprehensive detection, use check_oa_status_async() with Unpaywall.

Parameters:
Return type:

OAResult

Returns:

OAResult with detection results

scitex_scholar.core.check_oa_status(doi=None, arxiv_id=None, pmcid=None, source=None, journal=None, is_open_access_flag=None, use_unpaywall=False)[source]

Synchronous wrapper for OA detection.

By default only uses local detection (no API calls). Set use_unpaywall=True to use Unpaywall API (requires event loop).

Return type:

OAResult

async scitex_scholar.core.check_oa_status_async(doi=None, arxiv_id=None, pmcid=None, source=None, journal=None, is_open_access_flag=None, use_unpaywall=True, unpaywall_email=None)[source]

Comprehensive open access detection.

First tries fast local detection, then falls back to Unpaywall API if the status is uncertain.

Parameters:
  • doi (Optional[str]) – Paper DOI

  • arxiv_id (Optional[str]) – arXiv identifier

  • pmcid (Optional[str]) – PubMed Central ID

  • source (Optional[str]) – Source database

  • journal (Optional[str]) – Journal name

  • is_open_access_flag (Optional[bool]) – Pre-existing OA flag

  • use_unpaywall (bool) – Whether to query Unpaywall for uncertain cases

  • unpaywall_email (str) – Email for Unpaywall API

Return type:

OAResult

Returns:

OAResult with best available OA information

scitex_scholar.core.is_open_access_source(source)[source]

Check if source is a known open access repository.

Sources are loaded from config/default.yaml → OPENACCESS_SOURCES

Return type:

bool

scitex_scholar.core.is_open_access_journal(journal_name, use_cache=True)[source]

Check if journal is a known open access journal.

Uses three-tier lookup: 1. Fast check against config/default.yaml → OPENACCESS_JOURNALS (pattern matching) 2. Comprehensive check against cached OpenAlex OA sources (exact match, 62K+ journals) 3. Journal normalizer check (handles abbreviations, variants, historical names)

Parameters:
  • journal_name (str) – Journal name to check

  • use_cache (bool) – Whether to use OpenAlex cache (default True)

Return type:

bool

Returns:

True if journal is known to be Open Access

scitex_scholar.core.is_arxiv_id(identifier)[source]

Check if identifier looks like an arXiv ID.

Return type:

bool

class scitex_scholar.core.OASourcesCache(cache_dir=None)[source]

Bases: object

Manages cached Open Access sources from OpenAlex.

Features: - Lazy loading on first access - 1-day TTL with automatic refresh - Thread-safe singleton pattern - Fallback to config YAML if API fails - Journal name normalization via ISSN-L - Handles abbreviations, variants, and historical names

__init__(cache_dir=None)[source]
classmethod get_instance(cache_dir=None)[source]

Get singleton instance.

Return type:

OASourcesCache

_is_cache_valid()[source]

Check if cache exists and is within TTL.

Return type:

bool

_load_from_cache()[source]

Load cached data from file.

Return type:

bool

_save_to_cache()[source]

Save current data to cache file.

Return type:

None

async _fetch_oa_sources_async(max_pages=100)[source]

Fetch OA sources from OpenAlex API.

Parameters:

max_pages (int) – Maximum pages to fetch (200 sources per page)

Return type:

None

_fetch_oa_sources_sync(max_pages=100)[source]

Synchronous wrapper for fetching OA sources.

Return type:

None

ensure_loaded(force_refresh=False)[source]

Ensure cache is loaded, fetching from API if needed.

Parameters:

force_refresh (bool) – Force refresh even if cache is valid

Return type:

None

is_oa_source(source_name)[source]

Check if a source/journal name is in the OA list.

Parameters:

source_name (str) – Journal or source name to check

Return type:

bool

Returns:

True if source is known to be Open Access

is_oa_issn(issn)[source]

Check if an ISSN belongs to an OA journal.

Parameters:

issn (str) – ISSN to check

Return type:

bool

Returns:

True if ISSN belongs to an OA journal

property source_count: int

Get number of cached OA sources.

property cache_age_hours: float

Get cache age in hours.

scitex_scholar.core.get_oa_cache(cache_dir=None)[source]

Get the OA sources cache singleton.

Return type:

OASourcesCache

scitex_scholar.core.is_oa_journal_cached(journal_name)[source]

Check if journal is OA using cached OpenAlex data.

Return type:

bool

scitex_scholar.core.refresh_oa_cache()[source]

Force refresh the OA sources cache.

Return type:

None

class scitex_scholar.core.JournalNormalizer(cache_dir=None)[source]

Bases: object

Journal name normalizer using ISSN-L as unique identifier.

Handles: - Full names ↔ abbreviations - Name variants (spelling, punctuation, capitalization) - Historical/former names - Publisher variations

Data is cached locally with daily refresh from OpenAlex.

__init__(cache_dir=None)[source]
classmethod get_instance(cache_dir=None)[source]

Get singleton instance.

Return type:

JournalNormalizer

_is_cache_valid()[source]

Check if cache exists and is within TTL.

Return type:

bool

_load_from_cache()[source]

Load cached data from file.

Return type:

bool

_save_to_cache()[source]

Save current data to cache file.

Return type:

None

_add_journal(source_data)[source]

Add a journal to the normalizer from OpenAlex source data.

Parameters:

source_data (Dict[str, Any]) – OpenAlex source object with display_name, issn_l, etc.

Return type:

None

async _fetch_journals_async(max_pages=500, filter_oa_only=False)[source]

Fetch journal data from OpenAlex API.

Parameters:
  • max_pages (int) – Maximum pages to fetch (200 per page)

  • filter_oa_only (bool) – If True, only fetch OA journals

Return type:

None

_fetch_journals_sync(max_pages=500, filter_oa_only=False)[source]

Synchronous wrapper for fetching journals (handles nested event loops).

Return type:

None

ensure_loaded(force_refresh=False, max_pages=500)[source]

Ensure cache is loaded, fetching from API if needed.

Parameters:
  • force_refresh (bool) – Force refresh even if cache is valid

  • max_pages (int) – Max pages to fetch if refreshing

Return type:

None

get_issn_l(journal_name)[source]

Get ISSN-L for a journal name.

Parameters:

journal_name (str) – Any journal name variant, abbreviation, or ISSN

Return type:

Optional[str]

Returns:

ISSN-L if found, None otherwise

normalize(journal_name)[source]

Normalize journal name to canonical form.

Parameters:

journal_name (str) – Any journal name variant

Return type:

Optional[str]

Returns:

Canonical journal name, or original if not found

get_abbreviation(journal_name)[source]

Get abbreviated title for a journal.

Parameters:

journal_name (str) – Any journal name variant

Return type:

Optional[str]

Returns:

Abbreviated title if available

get_journal_info(journal_name)[source]

Get full journal metadata.

Parameters:

journal_name (str) – Any journal name variant

Return type:

Optional[Dict[str, Any]]

Returns:

Dict with canonical_name, abbreviated_title, alternate_titles, issns, is_oa, publisher

is_same_journal(name1, name2)[source]

Check if two names refer to the same journal.

Parameters:
  • name1 (str) – First journal name

  • name2 (str) – Second journal name

Return type:

bool

Returns:

True if both names resolve to the same ISSN-L

is_open_access(journal_name)[source]

Check if journal is Open Access.

Parameters:

journal_name (str) – Any journal name variant

Return type:

bool

Returns:

True if journal is OA

search(query, limit=10)[source]

Search for journals by name (prefix/substring match).

Parameters:
  • query (str) – Search query

  • limit (int) – Maximum results

Return type:

List[Dict[str, Any]]

Returns:

List of matching journal info dicts

property journal_count: int

Get number of cached journals.

property cache_age_hours: float

Get cache age in hours.

scitex_scholar.core.get_journal_normalizer(cache_dir=None)[source]

Get the journal normalizer singleton.

Return type:

JournalNormalizer

scitex_scholar.core.normalize_journal_name(name)[source]

Normalize journal name to canonical form.

Return type:

Optional[str]

scitex_scholar.core.get_journal_issn_l(name)[source]

Get ISSN-L for a journal name.

Return type:

Optional[str]

scitex_scholar.core.is_same_journal(name1, name2)[source]

Check if two names refer to the same journal.

Return type:

bool

scitex_scholar.core.refresh_journal_cache()[source]

Force refresh the journal normalizer cache.

Return type:

None