scitex_browser.remote

Remote browser components (ZenRows, CAPTCHA handling).

class scitex_browser.remote.ZenRowsAPIBrowser(api_key=None, proxy_country='au', enable_antibot=True, premium_proxy=True)[source]

Bases: object

Browser-like interface using ZenRows API for page rendering.

This provides a simpler, more reliable alternative to WebSocket-based browser connections. It’s especially good for: - Taking screenshots - Handling CAPTCHAs automatically - Getting rendered HTML content - Bypassing anti-bot measures

__init__(api_key=None, proxy_country='au', enable_antibot=True, premium_proxy=True)[source]

Initialize ZenRows API browser.

Parameters:
  • api_key (Optional[str]) – ZenRows API key (or from env)

  • proxy_country (str) – Country code for proxy

  • enable_antibot (bool) – Enable anti-bot bypass features

  • premium_proxy (bool) – Use premium residential proxies

async navigate_and_screenshot_async(url, screenshot_path=None, wait_ms=5000, js_instructions=None, return_html=False)[source]

Navigate to URL and optionally take screenshot.

Parameters:
  • url (str) – Target URL

  • screenshot_path (Optional[str]) – Path to save screenshot (None to skip)

  • wait_ms (int) – Additional wait time in milliseconds

  • js_instructions (Optional[List[Dict]]) – Custom JavaScript instructions

  • return_html (bool) – Whether to return rendered HTML

Return type:

Dict[str, Any]

Returns:

Dict with results including screenshot info and HTML

async get_pdf_url_async(doi, use_openurl=True)[source]

Try to get PDF URL for a DOI.

Parameters:
  • doi (str) – DOI to resolve

  • use_openurl (bool) – Whether to try OpenURL resolver first

Return type:

Optional[str]

Returns:

PDF URL if found, None otherwise

async batch_screenshot_async(urls, output_dir, max_concurrent=3)[source]

Take screenshots of multiple URLs concurrently.

Parameters:
  • urls (List[str]) – List of URLs to screenshot

  • output_dir (str) – Directory to save screenshots

  • max_concurrent (int) – Max concurrent requests

Return type:

List[Dict[str, Any]]

Returns:

List of results for each URL

class scitex_browser.remote.ZenRowsRemoteScholarBrowserManager(auth_manager=None, zenrows_api_key=None, proxy_country=None, **kwargs)[source]

Bases: object

Manages a connection to the remote ZenRows Scraping Browser service.

__init__(auth_manager=None, zenrows_api_key=None, proxy_country=None, **kwargs)[source]

Initialize ZenRows browser manager.

Parameters:
  • auth_manager – Authentication manager for cookie injection.

  • zenrows_api_key (Optional[str]) – ZenRows API key.

  • proxy_country (Optional[str]) – Country code for proxy routing (e.g., ‘au’, ‘us’). Note: Country routing may only work with certain endpoints.

  • **kwargs – Additional arguments (ignored, for compatibility).

async get_browser_async()[source]

Connect to the ZenRows Scraping Browser.

Return type:

Browser

async get_authenticated_browser_and_context_async()[source]

Get browser context with authentication cookies pre-loaded.

Return type:

tuple[Browser, BrowserContext]

async new_page(context=None)[source]

Create a new page in the ZenRows browser.

Return type:

Any

async close()[source]

Close the ZenRows browser connection.

async take_screenshot_reliable_async(url, output_path, use_api=True, wait_ms=5000)[source]

Take a screenshot with automatic CAPTCHA handling.

This method provides reliable screenshot capture by: 1. Using the API approach by default (more reliable) 2. Falling back to WebSocket browser if needed 3. Automatically handling CAPTCHAs via ZenRows

Parameters:
  • url (str) – URL to screenshot

  • output_path (str) – Path to save screenshot

  • use_api (bool) – Use API browser (recommended) vs WebSocket

  • wait_ms (int) – Additional wait time

Return type:

Dict[str, Any]

Returns:

Dict with success status and details

async navigate_and_extract_async(url, extract_pdf_url=True, take_screenshot=False, screenshot_path=None)[source]

Navigate to URL and extract information.

This combines navigation, screenshot, and data extraction. Uses the API approach for better reliability.

Parameters:
  • url (str) – Target URL

  • extract_pdf_url (bool) – Try to find PDF URL

  • take_screenshot (bool) – Whether to capture screenshot

  • screenshot_path (Optional[str]) – Where to save screenshot

Return type:

Dict[str, Any]

Returns:

Dict with extracted data

async __aenter__()[source]

Async context manager entry.

async __aexit__(exc_type, exc_val, exc_tb)[source]

Async context manager exit.

class scitex_browser.remote.CaptchaHandler(api_key=None)[source]

Bases: object

Handles CAPTCHA solving using 2Captcha service.

__init__(api_key=None)[source]

Initialize with 2Captcha API key.

async handle_page_async(page)[source]

Check and handle captcha on the current page.

Returns:

True if captcha was found and solved, False otherwise

Return type:

bool

async _detect_captcha_async(page)[source]

Detect if page has a captcha.

Return type:

bool

async _is_cloudflare_challenge_async(page)[source]

Check if this is a Cloudflare challenge.

Return type:

bool

async _solve_cloudflare_challenge_async(page)[source]

Handle Cloudflare challenge/turnstile.

Return type:

bool

async _has_recaptcha_async(page)[source]

Check if page has reCAPTCHA.

Return type:

bool

async _solve_recaptcha_async(page)[source]

Solve reCAPTCHA v2.

Return type:

bool

async _has_hcaptcha_async(page)[source]

Check if page has hCaptcha.

Return type:

bool

async _solve_hcaptcha_async(page)[source]

Solve hCaptcha.

Return type:

bool

async _extract_turnstile_key_async(page)[source]

Extract Cloudflare Turnstile site key.

Return type:

Optional[str]

async _submit_recaptcha_async(page_url, site_key)[source]

Submit reCAPTCHA to 2Captcha.

Return type:

Optional[str]

async _submit_hcaptcha_async(page_url, site_key)[source]

Submit hCaptcha to 2Captcha.

Return type:

Optional[str]

async _submit_turnstile_async(page_url, site_key)[source]

Submit Turnstile to 2Captcha.

Return type:

Optional[str]

async _submit_captcha_async(params)[source]

Submit captcha to 2Captcha and get task ID.

Return type:

Optional[str]

async _get_captcha_result_async(task_id)[source]

Poll 2Captcha for result.

Return type:

Optional[str]