Metadata-Version: 2.4
Name: ansferatu
Version: 0.1.0
Summary: Multifunctional tool for HTTP reconnaissance, web crawling and web directory bruteforce.
Author: frostbits-security
License: MIT License
        
        Copyright (c) 2022 frostbits-security
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
        
Project-URL: Homepage, https://github.com/frostbits-security/ansferatu
Project-URL: Repository, https://github.com/frostbits-security/ansferatu
Keywords: crawler,spider,bruteforce,reconnaissance,security,web
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Information Technology
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Topic :: Internet :: WWW/HTTP :: Indexing/Search
Classifier: Topic :: Security
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: requests
Requires-Dist: urllib3
Requires-Dist: beautifulsoup4
Requires-Dist: simhash
Requires-Dist: tldextract
Requires-Dist: validators
Requires-Dist: psutil
Provides-Extra: headless
Requires-Dist: playwright; extra == "headless"
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Requires-Dist: build; extra == "dev"
Requires-Dist: twine; extra == "dev"
Dynamic: license-file

Multifunctional tool for http reconnaissance, web crawling, web directory bruteforce.
Based at [PSpider](https://github.com/xianhu/PSpider)

Killer features:
1. Fast multiurl crawling  
2. Fast multiurl directory bruteforce  
3. Find new domains without DNS bruteforce. (for example https://mail.ru --> 105 Domains of *.mail.ru)
4. To Do: dynamic creation dictionary for brute-force
5. To Do: deduplication based on Simhash
6. Headless browsing and forms fill-up as addtional option
7. To Do: add proper output to jsonl + html reports
8. To Do: Collect query parameters (for get and post)
9. To Do: better deduplication based on page hash




### Installation

Ansferatu is a regular Python package. It requires Python 3.8+.

**From PyPI:**
```bash
pip3 install ansferatu
```

**From source / GitHub:**
```bash
pip3 install git+https://github.com/frostbits-security/ansferatu.git
# or, from a local checkout:
pip3 install .
```

**Headless / form-filling support (optional).** The `--headless` and
`--fill-forms` modes rely on [Playwright](https://playwright.dev/python/).
Install the optional extra and download the Chromium runtime:
```bash
pip3 install 'ansferatu[headless]'
playwright install chromium
```

Installing the package exposes an `ansferatu` console command (equivalent to
`python3 -m ansferatu`).

### How to run 

After installation, run via the `ansferatu` command:
```bash
ansferatu crawl --url https://mail.ru -o ./results/ --limit 1
```

#### Use as a library

The package can be imported into other Python tools:
```python
from ansferatu import common_crawler, common_brute_from_file

common_crawler(
    url_list=["https://example.com"],
    scope=["example.com"],
    exclude_codes_list=[403, 404, 401],
    visit_count_limit=10,
    max_deep=2,
    threads=10,
    output_file="results.jsonl",
)
```
For lower-level control, build the spider directly:
```python
from ansferatu.spider import WebSpider, TaskFetch
```

#### Docker

Build docker image: 
```bash
docker build -t ansferatu .
```

Run the container (the image's entrypoint is the `ansferatu` command):
```bash
docker run --rm -it -v /tmp/ansferatu_out:/ansferatu/results ansferatu \
  crawl --url https://mail.ru -o /ansferatu/results/ --limit 1
```

#### Modes

**crawl** - run crawl for web sites. Main parameter is "visit_count_limit"  
```
ansferatu crawl --url https://deti.mail.ru -o /home/sabotaged/BB/mail.ru/
```  

**crawl --headless** - same crawl but with Playwright headless extraction for qualifying pages.
Requires the headless extra: `pip install 'ansferatu[headless]' && playwright install chromium`.
```bash
ansferatu crawl --headless --url https://example.com -o ./results/
```

**crawl --fill-forms** - extends headless crawl with form detection and interaction.
Detects `<form>` elements on pages, fills fields with smart defaults (email, password, search, etc.),
submits forms and clicks buttons, then captures the resulting POST responses and new URLs.
Implies `--headless`.
```bash
ansferatu crawl --fill-forms --url https://example.com -o ./results/
```

**brute** - classic web directories bruteforce. Needs wordlist.  
```bash
ansferatu brute --url https://news.mail.ru -w ./wordlists/fuzz_big.txt -o /home/sabotaged/BB/mail.ru/
```

#### Modes task flow (queues and owners)

**crawl** puts start tasks into `QueueFetch`, then the queues are filled and drained by the workers shown below:
```mermaid
flowchart LR
  start([Start Task]) -->|set_start_task| qf[QueueFetch<br/>priority keys deep url repeat]
  qf --> fetchers[Fetchers<br/>multi-threading]
  fetchers -->|TaskExtract| qe[QueueExtract<br/>priority keys deep url content]
  fetchers -->|TaskHTMLHandle| qh[QueueHTMLHandle<br/>priority keys deep url content]
  qe --> extractor[Extractor]
  extractor -->|TaskFetch| qf
  qh --> html[HTML Handler]
  html -->|TaskSave if item| qs[QueueSave<br/>priority keys deep url item]
  qs --> saver[Saver]

  proxieser[Proxieser] -.->|optional| qp[QueueProxies]
  qp -.->|optional| fetchers
```

**crawl --headless** extends the regular crawl with a Playwright-based headless browser pipeline.
Qualifying pages (decided by `HeadlessCandidate`) are routed to a single-threaded headless
engine instead of the normal Extractor + HTML Handler path. The headless engine intercepts
CDP network events to discover URLs and captures the fully-rendered page for the HTML Handler.

```mermaid
flowchart LR
  start([Start Task]) -->|set_start_task| qf[QueueFetch<br/>priority keys deep url repeat]
  qf --> fetchers[Fetchers<br/>multi-thread]

  fetchers -->|HeadlessCandidate?| decision{is<br/>candidate?}

  decision -->|No| qe[QueueExtract]
  decision -->|No| qh[QueueHTMLHandle]
  decision -->|Yes| qhl[QueueHeadless<br/>dedup: VisitLimit]

  qhl --> headless[HeadlessThread<br/>single thread<br/>Playwright + CDP]

  headless -->|intercepted URLs<br/>TaskFetch| qf
  headless -->|normalized page<br/>TaskHTMLHandle| qh

  qe --> extractor[Extractor]
  extractor -->|TaskFetch| qf

  qh --> html[HTML Handler<br/>_normalize_content]
  html -->|TaskSave| qs[QueueSave]
  qs --> saver[Saver]
```

Key points:
- **HeadlessCandidate** decides which fetched pages qualify. Currently: root/index-like URLs
  (`is_absolute`) and HTML responses with status 200/301/302.
- **HeadlessExtractor** (Playwright) uses lazy browser init on the worker thread to avoid
  thread-affinity issues. It hooks `page.on("request")` to capture all network URLs,
  then returns both discovered `TaskFetch` items and a `TaskHTMLHandle` with a normalized
  dict (`status_code`, `url`, `html_text`, `headers`, `title`, etc.).
- **CommonHTMLHandler** accepts both `requests.Response` objects (regular path) and the
  normalized dict (headless path) via `_normalize_content()`.
- **Deduplication**: `VisitLimit.check_headless_visited()` prevents the same URL from being
  sent to headless twice. `UrlFilter` continues to deduplicate the fetch queue as usual.
- When a fetched URL qualifies for headless, it skips the regular Extractor and HTML Handler;
  only the headless pipeline processes it.

**crawl --fill-forms** extends the headless pipeline with a two-phase form interaction system.
Phase 1 (cheap): `HeadlessExtractor` calls `FormDetector.detect(page)` on the already-loaded page
to produce universal form descriptors. Phase 2 (expensive, deferred): `HeadlessFormInteractor`
picks up form tasks from a dedicated queue, opens the page in a separate browser, fills fields
via `FormFiller`, submits, and captures results.

```mermaid
flowchart LR
  start([Start Task]) -->|set_start_task| qf[QueueFetch<br/>priority keys deep url repeat]
  qf --> fetchers[Fetchers<br/>multi-thread]

  fetchers -->|HeadlessCandidate?| decision{is<br/>candidate?}

  decision -->|No| qe[QueueExtract]
  decision -->|No| qh[QueueHTMLHandle]
  decision -->|Yes| qhl[QueueHeadless<br/>dedup: VisitLimit]

  qhl --> headless[HeadlessThread<br/>single thread<br/>Playwright + CDP]

  headless -->|intercepted URLs<br/>TaskFetch| qf
  headless -->|normalized page<br/>TaskHTMLHandle| qh
  headless -->|form descriptors<br/>TaskFormInteract| qfi[QueueFormInteract]

  qfi --> forminteract[FormInteractThread<br/>single thread<br/>separate Playwright browser]
  forminteract -->|POST response URLs<br/>TaskFetch| qf
  forminteract -->|POST response page<br/>TaskHTMLHandle| qh

  qe --> extractor[Extractor]
  extractor -->|TaskFetch| qf

  qh --> html[HTML Handler<br/>_normalize_content]
  html -->|TaskSave| qs[QueueSave]
  qs --> saver[Saver]
```

Key points for form interaction:
- **FormDetector** scans the already-loaded page DOM for `<form>` elements. Pure detection,
  no extra navigation (~50ms overhead). Returns universal form descriptors.
- **Form descriptor schema**: `{form_selector, action, method, fields[], buttons[], page_url}`.
  Designed to be self-contained so `HeadlessFormInteractor` needs no extra DOM inspection.
- **FormFiller** maps input types/names to smart defaults (email, password, search, etc.).
  Supports custom value overrides via dict.
- **HeadlessFormInteractor** runs in a dedicated thread with its own Playwright browser.
  It navigates to the page, fills fields, submits/clicks, and captures network traffic +
  the resulting page data. Results flow back through the normal URL_FETCH and HTM_HANDLE queues.
- **Budget cap**: `FormDetector.max_forms_per_page` (default 5) and
  `HeadlessFormInteractor.max_interactions_per_page` prevent runaway on form-heavy pages.
- The form interaction pipeline is fully independent from the headless extraction pipeline —
  separate queue, separate thread, separate browser instance.

**brute** skips extraction and only handles/save results from fetches:
```mermaid
flowchart LR
  start([Start Task]) -->|set_start_task| qf[QueueFetch<br/>priority keys deep url repeat]
  qf --> fetchers[Fetchers<br/>multi-threading]
  fetchers -->|TaskHTMLHandle| qh[QueueHTMLHandle<br/>priority keys deep url content]
  qh --> html[HTML Handler]
  html -->|TaskSave if item| qs[QueueSave<br/>priority keys deep url item]
  qs --> saver[Saver]

  proxieser[Proxieser] -.->|optional| qp[QueueProxies]
  qp -.->|optional| fetchers
```
#### How to change settings 
Besides parsing the console arguments, ansferatu has a settings file for:
  - blacklist extentions for requests
  - blacklist extentions for parsing
  - HTTP request workers num
  - CPU consumed workers num
  - HTTP error_limit
  - limit of request to one host
  - HTTP request headers
  - ignored content-types for report
  - deduplication mode
  
The default file is stored in modules\settings\default_config.yaml

If you want to update settings, it's best to copy the file modules\settings\default_config.yaml to modules\settings\config.yaml and then edit config.yaml file.

#### How we avoid loops

`checkRecursion()` - check if something is going wrong and request start repeat the same path again and again, like: /blog/atricle/blog/article/... It is happening sometimes because of imperfection of extracting URLs process.

`check_limits ()` - Check how many times we access to parent directory.  
How it works. Let's use http://www.example.com/blog/articles/my_article_1.php as example.  
1. We check how many times we visit http://www.example.com/blog/articles/ 
2. If it cross crawl_limit we mark this path as over_limit_pages. 
3. We add +1 to crawl limit to upper path (http://www.example.com/blog/).
4. Go to step 1 (if this path also contains big amount of URLs we also would avoid this loop too)

Step by step at the last we ban visit this website, if all limits will be crossed.

#### How retries work
We have two types of error limit:
1. To retried URL
2. To add same URL in queue

Retries limit should be less than error limit. 

When we got connection error with url we retried it before retries limit is over and leave this url for a while. 
Than we continue to add urls in queue (maybe it start answer after while) and if it still unavailable we ban it. But if url will answer we would reset the count.

#### Wappalazer role

Wappalazer work with app.json file. This file contains regexp database for search anything in server response. (cookies, headers, scripts, text in html, etc.)


The idea is use wappalazer’s regex engine for “bad place” searching:

- All inputs 
```
<input type="email">
<input type="password">
<input type="search">
<input type="submit">
```
- SSRF 
```
formcontrolname="url"
```
- Submit buttons
 ```
 <button class="aa" type="submit">Search</button>
```
- File uploads
```
<input type="file">
```

Wappalazer could be used as simple vulnerability scanner:
1. Send specific request
2. Regexp search in server's answer.


#### Deduplication
- Content length + word_count
- Content length prediction (not fully tested)
- To Do: Similarity check
   - Check changes in HTML (search for new functions)

### Development

Editable install (changes to the source are picked up immediately):
```bash
pip3 install -e '.[headless,dev]'
```

Run the test suite:
```bash
pytest
```

### Building & publishing to PyPI

The project is configured with `pyproject.toml` (PEP 621). To build the
distribution artifacts (source distribution + wheel):
```bash
pip3 install build
python3 -m build          # writes dist/ansferatu-<version>.tar.gz and .whl
```

Validate and upload with [Twine](https://twine.readthedocs.io/):
```bash
pip3 install twine
twine check dist/*

# Test upload first (recommended): https://test.pypi.org
twine upload --repository testpypi dist/*

# Real upload
twine upload dist/*
```

Notes:
- Bump `version` in `pyproject.toml` (and `__version__` in `ansferatu/__init__.py`)
  before each release; PyPI rejects re-uploads of an existing version.
- Uploading requires a PyPI account and an API token (configure it via
  `~/.pypirc` or the `TWINE_USERNAME=__token__` / `TWINE_PASSWORD=<token>`
  environment variables).
- The package name `ansferatu` must be available on PyPI for the first upload.
