Metadata-Version: 2.3
Name: wpextract
Version: 1.1.1
Summary: Create datasets from WordPress sites
License: Apache-2.0
Author: Freddy Heppell
Author-email: f.heppell@sheffield.ac.uk
Requires-Python: >=3.9
Classifier: Development Status :: 5 - Production/Stable
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Dist: beautifulsoup4 (>=4.12.0)
Requires-Dist: click (>=8.0.1)
Requires-Dist: click-option-group (>=0.5.3)
Requires-Dist: langcodes (>=3.3.0)
Requires-Dist: lxml (>=5.0.0)
Requires-Dist: numpy (>=1.23.0)
Requires-Dist: pandas (>=1.5.2)
Requires-Dist: requests (>=2.32.3)
Requires-Dist: tqdm (>=4.65.0)
Requires-Dist: urllib3 (>1.21.3,<3)
Project-URL: Documentation, https://wpextract.readthedocs.io/
Project-URL: Homepage, https://wpextract.readthedocs.io/
Project-URL: Repository, https://github.com/GateNLP/wpextract
Description-Content-Type: text/markdown

# WPextract - WordPress Site Extractor

<a href="https://pypi.org/project/wpextract/"><img alt="PyPI - Version" src="https://img.shields.io/pypi/v/wpextract"></a>
<a href="https://anaconda.org/conda-forge/wpextract"><img alt="Conda Version" src="https://img.shields.io/conda/vn/conda-forge/wpextract"></a>
<a href="https://zenodo.org/doi/10.5281/zenodo.12725781"><img src="https://zenodo.org/badge/573084559.svg" alt="DOI"></a>

**WPextract is a tool to create datasets from WordPress sites.**

- Archives posts, pages, tags, categories, media (including files), comments, and users
- Uses the WordPress API to guarantee 100% accurate and complete content
- Resolves internal links and media to IDs
- Automatically parses multilingual sites to create parallel datasets


## Quickstart

See the [complete documentation](https://wpextract.readthedocs.io/) for more detailed usage.

1. Install with `pipx`
    ```shell-session
    $ pipx install wpextract
    ```
2. Download site data
    ```shell-session
    $ wpextract download "https://example.org" out_dl
    ```
3. Process into a dataset
    ```shell-session
    $ wpextract extract out_dl out_data
    ```

## About WPextract

WPextract was built by [Freddy Heppell](https://freddyheppell.com) of the [GATE Project](https://gate.ac.uk) at the [School of Computer Science, University of Sheffield](https://sheffield.ac.uk/cs), originally created to scrape mis/disinformation websites for research.

## License

Available under the Apache 2.0 license. See [LICENSE](LICENSE) for more information.

## Citing

> [!NOTE]
> This software was developed for our EMNLP 2023 paper [_Analysing State-Backed Propaganda Websites: a New Dataset and Linguistic Study_](https://aclanthology.org/2023.emnlp-main.349/). The code has been updated since the paper was written; for archival purposes, the precise version used for the study is [available on Zenodo](https://zenodo.org/records/10008086).

We'd love to hear about your use of our tool, you can [email us](mailto:frheppell1@sheffield.ac.uk) to let us know! Feel free to create issues and/or pull requests for new features or bugs.

If you use this tool in published work, please cite [our EMNLP paper](https://aclanthology.org/2023.emnlp-main.349/):

> Freddy Heppell, Kalina Bontcheva, and Carolina Scarton. 2023. Analysing State-Backed Propaganda Websites: a New Dataset and Linguistic Study. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5729–5741, Singapore. Association for Computational Linguistics.

Permanent references to each release of this software are available from [Zenodo](https://zenodo.org/doi/10.5281/zenodo.12725781).
