Metadata-Version: 2.4
Name: memorious4
Version: 4.0.1
Summary: A minimalistic, recursive web crawling library for Python.
License: AGPLv3+
License-File: LICENSE
License-File: NOTICE
Author: Organized Crime and Corruption Reporting Project
Author-email: data@occrp.org
Requires-Python: >=3.11,<3.14
Classifier: Intended Audience :: Developers
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Provides-Extra: ftp
Provides-Extra: postgres
Provides-Extra: redis
Provides-Extra: sql
Requires-Dist: PySocks (==1.7.1) ; extra == "ftp"
Requires-Dist: alephclient (>=2.6.0,<3.0.0)
Requires-Dist: anystore (>=1.1.9,<2.0.0)
Requires-Dist: banal (>=1.0.6,<2.0.0)
Requires-Dist: dateparser (>=1.2.1,<2.0.0)
Requires-Dist: fakeredis (>=2.26.2,<3.0.0) ; extra == "redis"
Requires-Dist: followthemoney (>=4.6.2,<5.0.0)
Requires-Dist: ftm-lakehouse (>=0.4.0,<0.5.0)
Requires-Dist: ftmq (>=4.8.1,<5.0.0)
Requires-Dist: furl (>=2.1.0,<3.0.0)
Requires-Dist: httpx (>=0.28.1,<0.29.0)
Requires-Dist: jinja2 (>=3.0.0,<4.0.0)
Requires-Dist: jq (>=1.6.0,<2.0.0)
Requires-Dist: legacy-cgi (>=2.6.3,<3.0.0)
Requires-Dist: lxml[html-clean] (>=6.0.2,<7.0.0)
Requires-Dist: normality (>=3.0.2,<4.0.0)
Requires-Dist: openaleph-procrastinate (>=5.2.2,<6.0.0)
Requires-Dist: psycopg2 (>=2.9.10,<3.0.0) ; extra == "postgres"
Requires-Dist: python-dateutil (>=2.9.0.post0,<3.0.0)
Requires-Dist: redis (>=4.0.0,<6.0.0) ; extra == "redis"
Requires-Dist: requests-ftp (>=0.3.1,<0.4.0) ; extra == "ftp"
Requires-Dist: requests[security] (>=2.32.3,<3.0.0) ; extra == "ftp"
Requires-Dist: rigour (>=1.7.5,<2.0.0)
Requires-Dist: sqlalchemy (>=2.0.36,<3.0.0) ; extra == "postgres"
Requires-Dist: sqlalchemy (>=2.0.36,<3.0.0) ; extra == "sql"
Requires-Dist: stringcase (>=1.2.0,<2.0.0)
Project-URL: Documentation, https://docs.investigraph.dev/lib/memorious
Project-URL: Homepage, https://docs.investigraph.dev/lib/memorious
Project-URL: Issues, https://github.com/dataresearchcenter/memorious/issues
Project-URL: Repository, https://github.com/dataresearchcenter/memorious
Description-Content-Type: text/markdown

[![Docs](https://img.shields.io/badge/docs-live-brightgreen)](https://docs.investigraph.dev/lib/memorious/)
[![memorious4 on pypi](https://img.shields.io/pypi/v/memorious4)](https://pypi.org/project/memorious4/)
[![PyPI Downloads](https://static.pepy.tech/badge/memorious4/month)](https://pepy.tech/projects/memorious4)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/memorious4)](https://pypi.org/project/memorious4/)
[![Python test and package](https://github.com/dataresearchcenter/memorious/actions/workflows/python.yml/badge.svg)](https://github.com/dataresearchcenter/memorious/actions/workflows/python.yml)
[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit)](https://github.com/pre-commit/pre-commit)
[![Coverage Status](https://coveralls.io/repos/github/dataresearchcenter/memorious/badge.svg?branch=main)](https://coveralls.io/github/dataresearchcenter/memorious?branch=main)
[![AGPLv3+ License](https://img.shields.io/pypi/l/memorious4)](./LICENSE)
[![Pydantic v2](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/pydantic/pydantic/main/docs/badge/v2.json)](https://pydantic.dev)

# Memorious

> The solitary and lucid spectator of a multiform, instantaneous and almost intolerably precise world.
>
> -- [Funes the Memorious](http://users.clas.ufl.edu/burt/spaceshotsairheads/borges-funes.pdf),
> Jorge Luis Borges

`memorious` is a light-weight web scraping toolkit. It supports scrapers that
collect structured or un-structured data. This includes the following use cases:

* Make crawlers modular and simple tasks reusable
* Provide utility functions to do common tasks such as data storage, HTTP session management
* Integrate crawlers with the Aleph and FollowTheMoney ecosystem
* Get out of your way as much as possible

`memorious` is part of the [OpenAleph](https://openaleph.org) suite but can be used standalone as well.

## Design

When writing a scraper, you often need to paginate through through an index
page, then download an HTML page for each result and finally parse that page
and insert or update a record in a database.

`memorious` handles this by managing a set of `crawlers`, each of which
can be composed of multiple `stages`. Each `stage` is implemented using a
Python function, which can be reused across different `crawlers`.

The basic steps of writing a Memorious crawler:

1. Make YAML crawler configuration file
2. Add different stages
3. Write code for stage operations (optional)
4. Test, rinse, repeat

## Documentation

The documentation for Memorious is available at
[docs.investigraph.dev/lib/memorious](https://docs.investigraph.dev/lib/memorious).
Feel free to edit the source files in the `docs` folder and send pull requests for improvements.

To serve the documentation locally, run `mkdocs serve`

## License and Copyright


`memorious`, (C) -2024 Organized Crime and Corruption Reporting Project

`memorious`, (C) 2025 [Data and Research Center – DARC](https://dataresearchcenter.org)

`memorious4`, (C) 2026 [Data and Research Center – DARC](https://dataresearchcenter.org)

`memorious4` is licensed under the AGPLv3 or later license.

Prior to version 4.0.0, `memorious` was released under the MIT license.

see [NOTICE](./NOTICE) and [LICENSE](./LICENSE)

