scitex_scholar.config

class scitex_scholar.config.ScholarConfig(config_path=None, scholar_dir=None)[source]

Bases: object

__init__(config_path=None, scholar_dir=None)[source]

Initialize ScholarConfig.

Parameters:
  • config_path (Union[str, Path, None]) – Path to custom config YAML file

  • scholar_dir (Union[str, Path, None]) – Direct path to scholar directory (e.g., /data/users/alice/.scitex) This bypasses SCITEX_DIR env var for thread-safe multi-user usage. Use this in Django/multi-user environments to avoid race conditions.

__getattr__(name)[source]

Delegate all get_* methods to path_manager.

__dir__()[source]

Include path_manager’s get_* methods in dir() output.

resolve(key, direct_val=None, default=None, type=<class 'str'>, mask=None)[source]

Resolve configuration value with precedence: direct → config → env → default

get(key)[source]

Get value from config dict only

print()[source]

Print how each config was resolved

clear_log()[source]

Clear resolution log

load_yaml(path)[source]
Return type:

dict

classmethod load(path=None)[source]
property paths

Access to path manager for organized directory structure

class scitex_scholar.config.PublisherRules(config=None)[source]

Bases: object

Access publisher-specific PDF extraction rules from config.

__init__(config=None)[source]
get_config_for_url(url)[source]

Get publisher-specific config for a URL.

Return type:

Dict

merge_with_config(url, base_deny_selectors=None, base_deny_classes=None, base_deny_text_patterns=None)[source]

Merge publisher-specific config with base deny patterns.

Return type:

Dict

is_valid_pdf_url(page_url, pdf_url)[source]

Check if PDF URL is valid based on publisher rules.

Return type:

bool

filter_pdf_urls(page_url, pdf_urls)[source]

Filter PDF URLs based on publisher-specific rules.

Return type:

List[str]