Metadata-Version: 2.4
Name: dataimpulse-scraper
Version: 1.0.1
Summary: MCP server for web scraping via DataImpulse residential proxies with per-request country targeting
Project-URL: Homepage, https://github.com/sonnysangha/dataimpulse-scraper
Project-URL: Repository, https://github.com/sonnysangha/dataimpulse-scraper
Author-email: Sonny Sangha <sonny.sangha@gmail.com>
License-Expression: MIT
License-File: LICENSE
Keywords: dataimpulse,mcp,proxy,residential-proxy,scraping,web-scraping
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Requires-Python: >=3.10
Requires-Dist: crawl4ai>=0.7
Requires-Dist: fastmcp>=3.0
Requires-Dist: httpx>=0.27
Requires-Dist: python-dotenv>=1.0
Description-Content-Type: text/markdown

# DataImpulse Scraper (MCP)

Give your AI agent — **Claude Code**, **Cursor**, **Codex**, Claude Desktop, … — the power to scrape the web from **any country**, through **[DataImpulse](https://dataimpulse.com/?utm_source=youtube&utm_medium=video&utm_campaign=SonnySangha) residential proxies**, with full JavaScript rendering to clean, LLM‑ready markdown.

The only thing you configure is your **DataImpulse login + password**. No clone, no Python setup, no `.env`.

> **🟢 [Get your DataImpulse residential proxy plan →](https://dataimpulse.com/?utm_source=youtube&utm_medium=video&utm_campaign=SonnySangha)** — residential proxies from **$1/GB**, pay‑as‑you‑go, no expiry.

---

## Install

Grab your credentials first: **[open your DataImpulse dashboard →](https://dataimpulse.com/?utm_source=youtube&utm_medium=video&utm_campaign=SonnySangha)** → **Residential Proxy → Proxy Access** → copy your **Login** and **Password**. Then pick a lane:

### ⚡ Option 1 — one command (macOS / Linux)

```bash
curl -LsSf https://raw.githubusercontent.com/sonnysangha/dataimpulse-scraper/main/install.sh | sh
```

That's the whole install. The script sets up `uv` if you don't have it, asks for your DataImpulse login + password, **verifies them with a live proxy check**, and auto‑configures **Claude Code, Cursor, and Claude Desktop** — whichever you have. (Non‑interactive? `DI_USER=... DI_PASS=... sh install.sh`)

### 🖱️ Option 2 — one click (Cursor)

[![Install MCP Server](https://cursor.com/deeplink/mcp-install-dark.svg)](https://cursor.com/en/install-mcp?name=dataimpulse-scraper&config=eyJjb21tYW5kIjoidXZ4IiwiYXJncyI6WyJkYXRhaW1wdWxzZS1zY3JhcGVyIl0sImVudiI6eyJESV9VU0VSIjoiWU9VUl9EQVRBSU1QVUxTRV9MT0dJTiIsIkRJX1BBU1MiOiJZT1VSX0RBVEFJTVBVTFNFX1BBU1NXT1JEIn19)

Click, approve, then replace the two `YOUR_DATAIMPULSE_*` placeholders with your login and password. Done. (Needs [`uv`](https://docs.astral.sh/uv/) installed — see Option 4, step 1.)

### 📦 Option 3 — double‑click (Claude Desktop)

Download **[`dataimpulse-scraper.mcpb`](https://github.com/sonnysangha/dataimpulse-scraper/releases/latest)**, double‑click it (or drag it onto Claude Desktop), and type your login + password into the form it shows you — the password is stored in your OS keychain, not a file.

### 🛠️ Option 4 — manual (any client, incl. Windows)

**1. Install `uv`** — the tiny, fast runner that launches the server and auto‑downloads everything else:

```bash
# macOS / Linux
curl -LsSf https://astral.sh/uv/install.sh | sh

# Windows (PowerShell)
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"
```

**2. Add the server to your AI app:**

**Claude Code** — one command:

```bash
claude mcp add dataimpulse-scraper \
  --env DI_USER=YOUR_LOGIN \
  --env DI_PASS=YOUR_PASSWORD \
  -- uvx dataimpulse-scraper
```

**Cursor / Claude Desktop / Codex / anything else** — paste into your MCP config (e.g. `.cursor/mcp.json`):

```json
{
  "mcpServers": {
    "dataimpulse-scraper": {
      "command": "uvx",
      "args": ["dataimpulse-scraper"],
      "env": {
        "DI_USER": "your_dataimpulse_login",
        "DI_PASS": "your_dataimpulse_password"
      }
    }
  }
}
```

### Then: restart your AI app and ask

MCP servers load at startup (in Claude Code, `/mcp` should show **dataimpulse-scraper** connected). Then just ask:

> 🏠 _"Read https://www.zillow.com/homes/for_sale/ and list the first 5 homes."_
>
> 🌏 _"Check our exit IP from Japan."_
>
> 🆚 _"Compare what's trending on Reddit in the US, UK, and Japan."_

That's it. 🎉 The **first** `read_page` downloads a headless browser (~1 min, one‑time, auto‑cached) — every run after is instant.

📓 More copy‑paste recipes with real output in **[EXAMPLES.md](EXAMPLES.md)**.

---

## The tools your agent gets

| Tool                                   | What it does                                                     | When to use                                |
| -------------------------------------- | ---------------------------------------------------------------- | ------------------------------------------ |
| `read_page(url, country)`              | **Primary.** Real browser → clean markdown (Crawl4AI)            | Any page, especially JS‑heavy / SPAs       |
| `read_page_from_regions(url, regions)` | The same page from many countries at once → `{region: markdown}` | Compare prices / stock / content by region |
| `fetch_html(url, country)`             | Raw HTML / JSON, fast (no browser)                               | Static pages, APIs                         |
| `check_proxy(country)`                 | Exit IP + geolocation                                            | Prove the proxy / geo works                |

Country codes are 2‑letter ISO (`us`, `de`, `jp`, `gb`, …). Every request routes through a **fresh residential IP** in that country — just ask in plain language ("read this as a German visitor") and the agent picks the right tool.

## What works vs what needs a login

`read_page` fetches **public** pages like a real browser. It does **not** log in or bypass auth walls.

| ✅ Works (public)                                           | ❌ Needs a login                         |
| ----------------------------------------------------------- | ---------------------------------------- |
| Reddit, YouTube, Bluesky, Mastodon, news, e‑commerce, SERPs | X/Twitter, Instagram, Facebook, LinkedIn |

A residential proxy beats **IP‑based** blocking and geo‑walls — not authentication. For login‑gated platforms, use their official API.

## Notes

- **Country targeting only** — city/state/ZIP filters bill at **2×**, so stay country‑level.
- Each call gets a **fresh** residential IP; change `country` for geo or to dodge IP rate‑limits.
- If the headless browser ever fails to auto‑install: `uvx --from dataimpulse-scraper playwright install chromium`
- Respect each site's Terms of Service and `robots.txt`. Scrape public data responsibly.
- **Security:** credentials live in your MCP client's `env` block. Never commit proxy passwords — rotate them in the dashboard if exposed.

---

## Local development (running from this folder)

Until the package is on PyPI — or when hacking on your own fork — run it straight from a checkout:

```bash
git clone https://github.com/sonnysangha/dataimpulse-scraper
cd dataimpulse-scraper
uv sync                                   # creates .venv, installs deps
cp .env.example .env                      # local dev only — add DI_USER / DI_PASS
uv run dataimpulse-scraper --selftest us  # prints a US exit IP → you're live
```

Then point your MCP config at the checkout instead of PyPI:

```json
{
  "mcpServers": {
    "dataimpulse-scraper": {
      "command": "uvx",
      "args": [
        "--from",
        "/absolute/path/to/dataimpulse-scraper",
        "dataimpulse-scraper"
      ],
      "env": { "DI_USER": "your_login", "DI_PASS": "your_password" }
    }
  }
}
```

`.env` is **only** for local development (it's gitignored). End users never create one.

**Maintainer?** Publishing to PyPI (GitHub Action, trusted publishing, release loop) is covered step‑by‑step in **[PUBLISHING.md](PUBLISHING.md)**.

---

Built with [DataImpulse](https://dataimpulse.com/?utm_source=youtube&utm_medium=video&utm_campaign=SonnySangha) residential proxies. **[Get a plan — $1/GB →](https://dataimpulse.com/?utm_source=youtube&utm_medium=video&utm_campaign=SonnySangha)**
