Metadata-Version: 2.4
Name: wmsdump
Version: 0.1.6
Summary: Library for extraction of data from WMS/WFS endpoints
Project-URL: homepage, https://github.com/ramSeraph/pywmsdump
Project-URL: repository, https://github.com/ramSeraph/pywmsdump.git
Author-email: Sreeram Kandimalla <kandimalla.sreeram@gmail.com>
License: This is free and unencumbered software released into the public domain.
        
        Anyone is free to copy, modify, publish, use, compile, sell, or
        distribute this software, either in source code form or as a compiled
        binary, for any purpose, commercial or non-commercial, and by any
        means.
        
        In jurisdictions that recognize copyright laws, the author or authors
        of this software dedicate any and all copyright interest in the
        software to the public domain. We make this dedication for the benefit
        of the public at large and to the detriment of our heirs and
        successors. We intend this dedication to be an overt act of
        relinquishment in perpetuity of all present and future rights to this
        software under copyright law.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
        EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
        MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
        IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR
        OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
        ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
        OTHER DEALINGS IN THE SOFTWARE.
        
        For more information, please refer to <https://unlicense.org>
License-File: LICENSE
Requires-Python: >=3.12
Requires-Dist: bs4>=0.0.2
Requires-Dist: click>=8.1.7
Requires-Dist: colorlog>=6.9.0
Requires-Dist: jsonschema>=4.23.0
Requires-Dist: kml2geojson>=5.1.0
Requires-Dist: requests>=2.32.3
Requires-Dist: xmltodict>=0.14.2
Provides-Extra: proj
Requires-Dist: pyproj>=3.7.0; extra == 'proj'
Provides-Extra: punch-holes
Requires-Dist: geoindex-rs>=0.2.0; extra == 'punch-holes'
Requires-Dist: numpy>=2.2.1; extra == 'punch-holes'
Requires-Dist: shapely>=2.0.6; extra == 'punch-holes'
Description-Content-Type: text/markdown

# wmsdump [![PyPI - Latest Version](https://img.shields.io/pypi/v/wmsdump)](https://pypi.org/project/wmsdump/) [![GitHub Tag](https://img.shields.io/github/v/tag/ramSeraph/pywmsdump?filter=v*)](https://github.com/ramSeraph/pywmsdump/releases/latest)

A library and command-line tool for extracting **vector layer** data from OGC services (WMS, WFS).

> **Note:** This tool only supports vector layers. Raster layers are not supported.

## Features

*   **Supports WMS and WFS:** Extracts data from both Web Map Service (WMS) and Web Feature Service (WFS) endpoints.
*   **Flexible Retrieval Modes:** Offers `OFFSET` (paged retrieval) and `EXTENT` (bbox splitting and drilling down by spatial extent) retrieval modes for efficient data extraction, including handling deduplication with the EXTENT mode.
*   **Multiple Retrieval Formats:** Supports KML and GeoRSS formats when retrieving data from WMS GetMap operations. Output is always in Geojsonl(GeoJSONSeq) 
*   **Geometry Precision Control:** Allows truncating geometry coordinates to a specified decimal point precision.
*   **State Management:**  Persists extraction state to allow resuming interrupted downloads.
*   **Geoserver and QGIS Server Flavor Support:** Handles vendor-specific differences for GetFeatureInfo based retrieval from WMS.
*   **Error Handling:** Provides informative error messages and handles common service exceptions.
*   **Configuration:** Customizable through command-line options.
*   **KML Postprocessing:** Offers options to strip superflous points in Polygon/LineString geometry collections and whether to keep original style related props.
*   **Hole Punching:** Includes a utility to remove overlap in polygons by punching holes to deal with shortcomings of GeoRSS based retrieval 
*   **Capabilities Exploration:**  Can explore services via a GetCapabilities request or by scraping the Geoserver webpage. Partial parsing of incomplete/corrupt capabilitie.xml response is supported

## Installation

1.  **Using `pip`:**

    ```bash
    pip install wmsdump
    ```

2.  **Using `uv` (recommended):**

    `wmsdump` uses `uv` for package management and dependency resolution. `uv` is a faster alternative to `pip`.
    
    Installing uv - https://docs.astral.sh/uv/getting-started/installation

    ```bash
    # Install dependencies using uv
    uv pip install wmsdump
    ```

    You can also use the tools directly by running
    ```bash
    uvx --from wmsdump wms-extractor <args>
    ```

    uv creates a temporary virtualenv and manages your dependencies in this invocation.


    For the optional `punch-holes` feature( needed for using the punch-holes utility ), use:

    ```bash
    uv pip install wmsdump[punch-holes]
    ```

    or

    ```bash
    pip install wmsdump[punch-holes]
    ```

    For the optional `proj` feature( needed for retrieving data in projections other than EPSG:4326 or EPSG:3857 ), use:

    ```bash
    uv pip install wmsdump[proj]
    ```

    or

    ```bash
    pip install wmsdump[proj]
    ```

## Usage

`wmsdump` provides a command-line tool `wms-extractor` with two main commands: `explore` and `extract`.

### Common Options

The following options are available on both `explore` and `extract` subcommands:

*    `--log-level`: Log level. One of DEBUG,INFO,WARNING,ERROR,CRITICAL. Defaults to INFO.
*    `--no-ssl-verify`: switch off ssl verification for all network calls.
*    `--request-timeout`: timeout for the http requests in seconds. Default is no timeout.
*    `--header`: Header to be added to all network requests, in the format "Key:Value". Can be used multiple times.

### 1. Explore

The `explore` command helps discover available layers and service information.

```bash
wms-extractor explore --help
```

**Options:**

*   `--geoserver-url`: URL of the GeoServer endpoint.  The WMS endpoint is assumed to be `<geoserver_url>/ows`.
*   `--service-url`: URL of the WMS/WFS endpoint from which to probe for capabilities. If not provided, it will be derived from `geoserver-url`.
*   `--service`: Service to use (WMS or WFS). Defaults to WFS.
*   `--service-version`: The protocol version to use. Defaults to '1.1.1' for WMS and '1.0.0' for WFS.
*   `--namespace`: Only look for layers in a given namespace (Geoserver specific).
*   `--output-file`: File to write the layer list to.
*   `--scrape-webpage`: Scrape the GeoServer web page instead of reading capabilities. Useful when capabilities are broken.

**Examples:**

```bash
# Explore WFS layers from a GeoServer endpoint
wms-extractor explore --geoserver-url http://example.com/geoserver

# Explore WMS layers from a specific URL
wms-extractor explore --service-url http://example.com/wms --service WMS

# Scrape the GeoServer web page for layers
wms-extractor explore --geoserver-url http://example.com/geoserver --scrape-webpage

# Write layer list to a file
wms-extractor explore --geoserver-url http://example.com/geoserver --output-file layers.txt
```

### 2. Extract

The `extract` command extracts data from a specified layer.

```bash
wms-extractor extract --help
```

**Arguments:**

*   `LAYERNAME`: Name of the layer to extract.
*   `OUTPUT_FILE`: Output file to write the GeoJSONl features to.  If not provided, a filename will be derived from the LAYERNAME.

**Options:**

*   `--output-dir`: Directory to write output files in (only used when `OUTPUT_FILE` is not given). Defaults to the current directory.
*   `--geoserver-url`: URL of the GeoServer endpoint. `service-url` is assumed to be `<geoserver_url>/[<layer_namespace>/]ows`.
*   `--service-url`: URL of the WMS/WFS endpoint from which to retrieve data. If not provided, it will be derived from `geoserver-url`.
*   `--service`: Service to use (WMS or WFS). Defaults to WFS.
*   `--service-version`: The protocol version to use. Defaults to '1.1.1' for WMS and '1.0.0' for WFS.
*   `--retrieval-mode`: Which method to use for batch record retrieval (`OFFSET`, `EXTENT`, or `EXTENT_FIXED_BUFFER`). Defaults to `OFFSET`.
*   `--operation`: Which operation to use for querying the service. WMS supports `GetMap` or `GetFeatureInfo`; WFS uses `GetFeature` (auto-selected). Defaults to `GetMap` for WMS.
*   `--flavor`: Vendor of the WMS service (`Geoserver` or `QGISserver`), useful to specify for GetFeatureInfo based retrieval. Defaults to `Geoserver`.
*   `--sort-key`: Key to use for paged retrieval (required when server requires it).
*   `--batch-size`: Batch size to use for retrieval. Defaults to 1000.
*   `--pause-seconds`: Amount of time to pause between a batch of requests. Defaults to 2.
*   `--requests-to-pause`: Number of requests to make before pausing. Defaults to 10.
*   `--max-attempts`: Number of times to attempt a request before giving up. Defaults to 5.
*   `--retry-delay`: Number of seconds to wait before retrying on failure (delay is incremented for each failure). Defaults to 5.
*   `--geometry-precision`: Decimal point precision of geometry to be returned (-1 means no truncation). Defaults to -1.
*   `--getmap-format`: Format to use while pulling using WMS GetMap (`KML` or `GEORSS`). Defaults to `KML`.
*   `--kml-strip-point`: Whether to strip the points in polygons and linestring geomcollections (KML specific). Defaults to `True`.
*   `--kml-keep-original-props`: Whether to keep the original style-related properties in KML conversion. Defaults to `False`.
*   `--out-srs`: CRS to request data in. Defaults to `EPSG:4326`.
*   `--bounds`: Bounding box to restrict the query to (format: `<xmin>,<ymin>,<xmax>,<ymax>`).
*   `--max-box-dims`: When querying using EXTENT mode, the maximum size of the bounding box to use (format: `<deltax>,<deltay>`).
*   `--fixed-buffer`: Pixel buffer size for `EXTENT_FIXED_BUFFER` mode. Required when using `EXTENT_FIXED_BUFFER` retrieval mode with `GetFeatureInfo`.
*   `--wms-map-size`: Virtual map size in pixels for WMS requests (default 256). Primarily affects `GetFeatureInfo` calls where it determines the query point and buffer calculations.
*   `--custom-dumper`: Path to a Python file containing a `SpecialDumper` class that subclasses `OGCServiceDumper` to override default behavior.
*   `--skip-index`: Skip n elements in index (useful to skip records causing failure, only applicable for OFFSET retrieval).  Defaults to 0.

**Examples:**

```bash
# Extract data from a WFS layer
wms-extractor extract my_layer output.geojsonl --geoserver-url http://example.com/geoserver

# Extract data from a WMS layer using GetMap with GeoRSS format
wms-extractor extract my_layer output.geojsonl --service WMS --service-url http://example.com/wms --getmap-format GEORSS

# Extract data and truncate geometry to 3 decimal places
wms-extractor extract my_layer output.geojsonl --geoserver-url http://example.com/geoserver --geometry-precision 3

# Extract data with bounding box
wms-extractor extract my_layer output.geojsonl --geoserver-url http://example.com/geoserver --bounds -180,-90,180,90
```

### 3. Deduplicate GeoJSONL

This command removes duplicate features from a GeoJSONL file. Features are considered duplicates if they have identical geometry and properties. Deduplication is performed by hashing features and detecting collisions.

```bash
geojsonl-dedupe --help
```

**Arguments:**

*   `INPUT-FILE`: The input GeoJSONl file to deduplicate (required)
*   `OUTPUT-FILE`: The output GeoJSONl file. If not provided, writes to `deduped_<INPUT-FILE>`

**Options:**

*   `--log-level`, `-l`: Set logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL). Defaults to INFO.
*   `--use-offset/--use-ram`: Use file offset for collision checks (default) or keep features in RAM. Using file offset is more memory-efficient for large files.

**Example:**

```bash
# Deduplicate using file offset method (memory-efficient)
geojsonl-dedupe input.geojsonl output.geojsonl

# Deduplicate keeping features in RAM (faster but uses more memory)
geojsonl-dedupe input.geojsonl output.geojsonl --use-ram

# Auto-generate output filename
geojsonl-dedupe input.geojsonl
```

**Note:** The `EXTENT` retrieval mode includes built-in deduplication to handle features that may appear in overlapping spatial extents during extraction. This tool is useful for post-processing or cleaning up data from other sources.

### 4. Punch Holes (Optional)

This command is available if installed with the `punch-holes` extra.  It removes overlaps in a GeoJSONl file by punching holes where polygons overlap.  This is useful for cleaning up data problems which happen when extracting data using GeoRSS format which cannot represent polygons with holes.

```bash
punch-holes --help
```

**Arguments:**

*   `INPUT_FILE`: The input GeoJSONl file to process
*   `OUTPUT_FILE`: The output GeoJSONl file. If none provided, writes the results to `fixed_<INPUT_FILE>`

**Options:**

*   `--index-in-mem`: Whether the spatial index keeps the geometry data in memory or just the offset of the features on disk.
*   `--keep-map-file`:  Whether to keep the overlap map temporary file (debugging purposes).

**Example:**

```bash
punch-holes input.geojsonl output.geojsonl
```

## State Management

`wmsdump` automatically creates a `.state` file alongside the output file. This file stores the progress of the extraction. If the extraction is interrupted, `wmsdump` will resume from the last known state when run again with the same parameters. To start a new extraction, delete both the output file and the `.state` file.

## Environment Variables

*   `WMSDUMP_SAVE_RESPONSE_TO_FILE`: If set, the raw HTTP response from the OGC service will be saved to the specified file. This is useful for debugging.

## Dependencies

*   `bs4` (Beautiful Soup 4)
*   `click`
*   `colorlog`
*   `jsonschema`
*   `kml2geojson`
*   `requests`
*   `xmltodict`

**Optional:**

*   `geoindex-rs` (required for `punch-holes`)
*   `numpy` (required for `punch-holes`)
*   `shapely` (required for `punch-holes`)
*   `pyproj` (required for handling some CRS definitions)

## Contributing

Contributions are welcome! Please submit bug reports, feature requests, and pull requests through GitHub.

## License

This project is released under UnLicense - see the `LICENSE` file for details.

## Credits

This was heavily inspired by a similar tool for ESRI endpoints - [openaddresses/pyesridump](https://github.com/openaddresses/pyesridump)

Also, that this is possible was pointed out to me by [datta07](https://github.com/datta07), some of the georss parsing code was also based on prior work by [datta07](https://github.com/datta07), [answerquest](https://github.com/answerquest) and [devdattaT](https://github.com/devdattaT).

