Metadata-Version: 2.4
Name: groundsource
Version: 0.1.1
Summary: Python package for Google's Groundsource flash flood dataset — 2.6M events, 150+ countries, 2000–2026
Author: Sharath Sivamalaisamy
License-Expression: MIT
Project-URL: Homepage, https://github.com/SharathSivamalaisamy/groundsource
Project-URL: Repository, https://github.com/SharathSivamalaisamy/groundsource
Project-URL: Issues, https://github.com/SharathSivamalaisamy/groundsource/issues
Keywords: flood,flash-flood,climate,groundsource,google,geospatial,dataset,gemini,natural-disaster,news-mining
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: GIS
Classifier: Topic :: Scientific/Engineering :: Atmospheric Science
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas>=1.5
Requires-Dist: pyarrow>=10.0
Requires-Dist: geopandas>=0.13
Requires-Dist: shapely>=2.0
Requires-Dist: matplotlib>=3.6
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Dynamic: license-file

# groundsource

**Python package for Google's Groundsource flash flood dataset.**

Google used Gemini to extract 2.6 million flash flood events from news articles across 150+ countries (2000-2026). The raw data is a 667MB Parquet file with undocumented WKB geometries and no location labels. This package decodes the geometries, tags every event with country and continent, and provides a clean search and analysis API.

```python
from groundsource import FloodDB

db = FloodDB()  # auto-downloads + enriches on first run
floods = db.search(country="India", year_range=(2020, 2025))
```

## Installation

```bash
pip install groundsource
```

**Requirements:** Python 3.9+, pandas, pyarrow, geopandas, shapely, matplotlib

On first run, the package downloads the dataset from Zenodo (~667MB), decodes 2.6M WKB polygons, and performs a spatial join against Natural Earth boundaries. This takes 2-3 minutes and is cached locally for instant subsequent loads.

## Usage

### Search

```python
from groundsource import FloodDB
db = FloodDB()

# By country (supports common aliases: "USA", "UK", "UAE", etc.)
db.search(country="India")
db.search(country="USA", year_range=(2020, 2025))

# By city (98 major cities built-in, default 100km radius)
db.search(city="Houston", radius_km=50)

# By continent or bounding box
db.search(continent="Asia")
db.search(bbox=[0, 95, 25, 120])  # [min_lat, min_lon, max_lat, max_lon]
```

### Trend Analysis

```python
db.trend(country="India")                        # yearly event counts
db.growth(country="India")                       # growth rate between two periods
db.compare(["USA", "UK", "India", "Indonesia"])  # side-by-side comparison
db.top_countries(20)                             # ranked by total events
db.country_growth_ranking(20)                    # ranked by growth acceleration
db.bias_check()                                  # global yearly counts for bias analysis
```

### Built-in Charts

```python
db.plot_hockey_stick(save_path="hockey_stick.png")
db.plot_bias(save_path="bias.png")
db.plot_top_countries(save_path="top_countries.png")
db.plot_country_growth(save_path="growth.png")
```

### Raw DataFrame Access

```python
df = db.to_dataframe()
# Columns: uuid, area_km2, start_date, end_date, centroid_lon, centroid_lat,
#           country, iso_a3, continent, year
```

## What This Package Does

The raw Parquet from Zenodo has 5 columns with no documentation:

| Raw Column | Type | Issue |
|-----------|------|-------|
| `uuid` | string | ID only |
| `area_km2` | float | Usable as-is |
| `geometry` | WKB binary | Requires `shapely` to decode |
| `start_date` | string | Not parsed as datetime |
| `end_date` | string | Not parsed as datetime |

This package enriches each event with:

| Added Column | Source |
|-------------|--------|
| `centroid_lon`, `centroid_lat` | Decoded from WKB polygons |
| `country`, `iso_a3` | Spatial join against Natural Earth |
| `continent` | Natural Earth |
| `year` | Extracted from `start_date` |

## Reporting Bias

The dataset shows 498 events in 2000 and 402,012 in 2024. This does not mean floods increased 807x. The data is extracted from news articles, and digital news coverage grew dramatically over this period. Any trend analysis should account for this reporting bias. Use `db.bias_check()` and `db.plot_bias()` to visualize this.

![Bias Analysis](https://raw.githubusercontent.com/SharathSivamalaisamy/groundsource/main/charts/02_bias_normalized.png)

## Top Countries by Events Detected

![Top Countries](https://raw.githubusercontent.com/SharathSivamalaisamy/groundsource/main/charts/04_top_countries.png)

## Dataset

- **Source:** [Google Groundsource](https://research.google/blog/introducing-groundsource-turning-news-reports-into-data-with-gemini/)
- **Download:** [Zenodo](https://zenodo.org/records/18647054) (CC BY 4.0)
- **Records:** 2,646,302 events across 175 countries, 2000-2026
- **Method:** Gemini parsed ~5M news articles
- **Accuracy:** 60% location+timing, 82% practically useful (per Google)

## License

MIT. The underlying dataset is licensed CC BY 4.0 by Google.

## Citation

> Google Research. *Groundsource: Turning News Reports into Data with Gemini.* Zenodo, 2026. DOI: [10.5281/zenodo.18647054](https://zenodo.org/records/18647054)
