Metadata-Version: 2.4
Name: arcgis-item-graph
Version: 0.3.1
Summary: CLI tool for building and querying ArcGIS item dependency graphs
License: MIT
Project-URL: Homepage, https://github.com/Global-Geospatial-IT/ArcGIS-Item-Dependency-Management
Project-URL: Repository, https://github.com/Global-Geospatial-IT/ArcGIS-Item-Dependency-Management
Project-URL: Bug Tracker, https://github.com/Global-Geospatial-IT/ArcGIS-Item-Dependency-Management/issues
Keywords: arcgis,esri,gis,dependency,graph
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: End Users/Desktop
Classifier: Topic :: Scientific/Engineering :: GIS
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Operating System :: OS Independent
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Requires-Dist: arcgis>=2.4.0
Requires-Dist: jinja2>=3.0
Requires-Dist: networkx>=2.8
Requires-Dist: openpyxl>=3.0
Requires-Dist: pandas>=1.5
Requires-Dist: python-dotenv>=1.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: rich>=13.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"

# ArcGIS Item Dependency Management

## Overview

This tool builds and maintains an organization-wide ArcGIS item dependency graph, showing what Web Maps, Dashboards, Feature Services, and other items depend on each other. You can query the graph by item ID or portal search string and receive CSV, Excel, interactive HTML, and GML outputs — making it safe to audit, migrate, and clean up portal content without breaking downstream items.

---

## Quick Start

### 1. Install

**Standard Python / Mac / Linux:**

```bash
pip install arcgis-item-graph
```

**ArcGIS Pro (Windows) — uses Pro's bundled Python:**

```bat
"%PROGRAMFILES%\ArcGIS\Pro\bin\Python\Scripts\pip.exe" install arcgis-item-graph
```

**Windows one-click installer (end-user deployment):**

Your GIS admin provides a tool folder containing `install.bat`, `install.ps1`, `launch_query.py`, and a pre-configured `config/config.yaml`. Double-click `install.bat` to install and launch.

What it does automatically:
1. Detects conda (Miniconda / Anaconda / ArcGIS Pro)
2. Downloads and installs Miniconda silently if conda is not found (~75 MB, one-time)
3. Creates an `arcgis-graph` conda environment with Python 3.11
4. Installs the latest ArcGIS API for Python and `arcgis-item-graph`
5. Launches the interactive query tool

Subsequent runs are fast — existing environments and packages are reused, and pip upgrades to the latest version automatically.

### 2. Configure

```bash
arcgis-graph setup
```

The wizard prompts for your portal URL, authentication method (named profile or username/password), and output preferences. Your credentials are never stored in `config.yaml` — they go to a gitignored `.env` file.

### 3. Build the graph (run once)

```bash
arcgis-graph create
```

This crawls your portal, saves a dependency graph (`outputs/graph.gml`), and builds a SQLite metadata cache (`outputs/meta.sqlite`) used by the `health` subcommand and governance risk scoring. For large organizations (5,000+ items) it can take 30–90 minutes.

### 4. Query

```bash
arcgis-graph query --item-id abc123
arcgis-graph query --search "owner:jsmith type:Dashboard"
```

---

## Prerequisites

- Python 3.9 or later
- ArcGIS API for Python 2.4.0 or later (`arcgis>=2.4.0`)

---

## Setup

See [Quick Start](#quick-start) above for installation and configuration.

For development setup, see [For Contributors](#for-contributors) below.

---

## Configuration

`config/config.yaml` controls authentication and all run-time settings. Two auth options are available:

### Option 1 — Named ArcGIS profile (recommended for GIS admins)

Set the `auth.profile` key to the name of a saved ArcGIS credential profile:

```yaml
auth:
  profile: "my_portal_profile"   # created via arcgis.gis.GIS(profile=...)
  verify_cert: true
```

Run `python -c "from arcgis.gis import GIS; GIS(profile='my_portal_profile')"` to verify the profile name is correct.

### Option 2 — Environment variables

Leave `auth.profile` blank and create a `.env` file in the project root:

```
ARCGIS_URL=https://your-portal/portal
ARCGIS_USER=your_username
ARCGIS_PASSWORD=your_password
```

The CLI loads `.env` automatically when a profile is not set.

### Other settings

| Key | Default | Description |
|-----|---------|-------------|
| `paths.output_dir` | `outputs/` | Where all output files are written |
| `paths.gml_file` | `outputs/graph.gml` | Persistent graph file |
| `create.max_items` | `10000` | Upper limit on items indexed |
| `update.max_retries` | `5` | Retries on transient API errors |
| `query.output_formats` | `excel, html, gml` | Default outputs for each query (`excel`, `csv`, `html`, `gml`) |
| `query.traversal_direction` | `upstream` | Controls which graph edges are followed: `upstream` — items that reference X (what breaks if X is removed); `downstream` — items X depends on; `both` — union of both without cross-contamination |

---

## Usage

All commands are run via the unified CLI entry point:

```bash
python -m cli [--config /path/to/config.yaml] {create,update,query} [options]
```

### Build the graph (run once)

Crawls the entire portal and saves a GML snapshot. For large organizations (5,000+ items) this can take 30–90 minutes.

```bash
python -m cli create
```

### Keep the graph current (run on a schedule)

Finds items modified since the last run and merges changes into the existing GML. Designed for a daily cron job.

```bash
python -m cli update
```

### Query the graph

```bash
# Query by item ID
python -m cli query --item-id abc123

# Query by portal search string
python -m cli query --search "owner:jsmith type:Dashboard"

# Request specific output formats for a single run
python -m cli query --item-id abc123 --format excel
python -m cli query --item-id abc123 --format csv --format html

# Use a different config file
python -m cli --config /path/to/other/config.yaml query --item-id abc123
```

### Interactive dashboard (live server mode)

Add `--serve` to any query command to start a local HTTP server and open the dashboard in your browser automatically:

```bash
arcgis-graph query --item-id abc123 --serve
arcgis-graph query --search "owner:jsmith" --serve

# Use a different port if 8765 is taken
arcgis-graph query --item-id abc123 --serve --port 9000
```

The server runs at `http://localhost:8765/` by default. It exposes:
- `GET /` — the interactive HTML dashboard
- `GET /query?id=<item_id>` — live re-query from inside the dashboard (click any node)
- `GET /export/excel?ids=<id1>,<id2>` — download an Excel report for selected items

Press **Ctrl+C** in the terminal to stop the server.

> **Note:** Opening the saved `.html` file directly (`file://...`) will not work for node re-queries or Excel exports because those features require the live server. Always use `--serve` for the full interactive experience.

Run `python -m cli --help` or `python -m cli <command> --help` for the full list of options and overrides.

### Triage (migration planning)

Identify the highest-traffic consumer items in your portal and classify the services they depend on — prioritized by view count and dependency breadth. Designed for migration planning and portal housekeeping.

```bash
arcgis-graph triage                    # rank top 50 items (config default)
arcgis-graph triage --top-n 20         # rank top 20
arcgis-graph triage --min-dependents 2 # only items with 2+ service dependencies
arcgis-graph triage --deep             # Tier 3 layer introspection (slower, more accurate)
arcgis-graph triage --no-usage-stats   # rank by dependency count only (skip portal API)
arcgis-graph triage --force-refresh    # bypass the triage_cache_hours window and re-run
```

Outputs to `outputs/reports/triage/<timestamp>/`:

| File | Contents |
|---|---|
| `triage_report.xlsx` | 5-sheet workbook (see below) |
| `triage_manifest.json` | Machine-readable version of all triage data |

**Excel workbook sheets:**

| Sheet | Description |
|---|---|
| **High Traffic Items** | Ranked consumer items (Web Maps, Dashboards, Apps) by composite score (view count + dependency breadth) |
| **Service Inventory** | All map/feature services those items consume, with `data_source_type` (egdb / hosted / fgdb / external) and `combined_view_impact` |
| **Dependency Matrix** | One row per item × service pair — shows which item uses which service |
| **Migration Hotspots** | Services referenced by 2+ items (configurable), sorted by combined view impact — highest-risk services to touch during a migration |
| **Consumer Chain** | Items in the graph that *depend on* each ranked item — useful for understanding blast radius before deprecating or migrating a service |

> **Note:** `data_source_type` classification uses URL pattern matching (Tier 1), service JSON inspection (Tier 2), and optionally layer-level inspection (Tier 3 with `--deep`). Enterprise ArcGIS Server services backed by an Enterprise Geodatabase are classified as `egdb`; hosted services as `hosted_relational`; file-based data as `fgdb`.

### Health check (broken references and orphan candidates)

After running `arcgis-graph create`, a SQLite metadata cache (`outputs/meta.sqlite`) is built alongside the GML graph. Use the `health` subcommand to query it for org-wide quality issues:

```bash
arcgis-graph health
```

Prints a summary of broken node references (items in the graph that no longer exist in the portal) and orphan candidates (items with no dependents and low recent activity). Writes a `health_report_<timestamp>.xlsx` workbook to `outputs/` with two sheets:

| Sheet | Contents |
|---|---|
| **Broken References** | Node IDs whose items could not be resolved, sorted by governance risk score (🔴 red → 🟡 yellow → 🟢 green) |
| **Orphan Candidates** | Items with zero dependents and below the inactivity threshold (configurable via `cache.orphan_inactive_days`) |

> **Note:** `health` requires the metadata cache. Run `arcgis-graph create` (or `update`) first to build `outputs/meta.sqlite`.

### Remap item references

When an item is replaced (e.g., a service migrated from one URL to another), use `remap` to update all dependent items that reference the old item:

```bash
# Preview what would change (dry run)
arcgis-graph remap --from-id <old-item-id> --to-id <new-item-id> --preview

# Execute the remap
arcgis-graph remap --from-id <old-item-id> --to-id <new-item-id>

# Remap all broken nodes in the health cache (bulk repair workflow)
arcgis-graph remap --from-health-report
```

The `--from-health-report` flag reads broken node IDs directly from the metadata cache and walks you through a remap for each one. A JSON manifest is written to `outputs/` recording every item updated, the old and new references, and success/failure status.

---

## Shared Deployment (Team Use)

For team environments, point `paths.gml_file` and `paths.output_dir` at a UNC share
so all users read from the same graph without running `create` individually.

### 1. Admin: initial setup

```bash
# On the admin machine, configure config.yaml to point at the share:
#   paths.gml_file: "\\\\server\\share\\arcgis-graph\\graph.gml"
#   paths.output_dir: "\\\\server\\share\\arcgis-graph\\outputs"

arcgis-graph create   # one-time full crawl (~30-60 min for large orgs)
```

### 2. Automation: scheduled updates

**Windows Task Scheduler** (hourly):
```
arcgis-graph update --config \\server\share\arcgis-graph\config.yaml --skip-if-fresh
```

**Linux/macOS cron** (hourly):
```cron
0 * * * *  arcgis-graph update --config /mnt/share/arcgis-graph/config.yaml --skip-if-fresh
```

`--skip-if-fresh` prevents double-runs if automation fires while a manual update is in progress.

### 3. Users: install and run

**Option A — Windows installer (no Python required):**

Distribute the tool folder (`install.bat`, `install.ps1`, `launch_query.py`, `config/`) to users.
They double-click `install.bat`. The installer handles everything: conda, packages, and launch.

The tool folder can live on a UNC share — users can run it directly from there:
```
\\server\share\arcgis-graph\install.bat
```

**Option B — CLI (Python already installed):**

Users point their local `config.yaml` at the share paths and run:

```bash
arcgis-graph query --item-id <id>
```

If the same item was queried within 24 hours, the cached outputs are returned instantly.
Use `--force-refresh` to bypass the cache and re-run the query.

### Freshness thresholds (configurable)

```yaml
cache:
  update_warn_hours: 24    # Warn in query if graph is older than this (24 = daily, the default)
  query_cache_hours: 24    # Reuse cached query outputs within this window
```

---

## Output files

All output files land in the directory set by `paths.output_dir` (default: `outputs/`).

| Command | Output files |
|---------|-------------|
| `create` | `graph.gml`, `graph.timestamp` |
| `update` | Updates `graph.gml` in place |
| `query` | `dependency_report_<timestamp>.csv` — tabular summary; `dependency_report_<timestamp>.xlsx` — 3-sheet Excel workbook (All Items, Dependency Edges, Broken Dependencies); `dependency_graph_<timestamp>.html` — interactive visualization; `query_subgraph_<timestamp>.gml` — sub-graph for further analysis |

---

## Project structure

```
arcgis_item_graph/   Core library
  creator.py           Full org crawl → graph.gml
  updater.py           Incremental update since last run
  query.py             Direction-aware graph traversal + live hydration
  reporter.py          DataFrame → CSV, Excel (All Items, Dependency Edges, Broken Deps, External Refs)
  visualizer.py        Jinja2 + Cytoscape.js → interactive HTML dashboard
  cache.py             SQLite metadata cache (outputs/meta.sqlite) — broken nodes, orphan detection, risk scoring
  parsers.py           Custom JSON parsers (View Admin, Dashboard, ExB) that augment graph edges
  risk.py              Governance risk scoring (RiskScore, score_item) — 0-100 score, green/yellow/red tier
  remapper.py          ItemGraphRemapper — remap item references across all dependents
  auditor.py           ItemDependencyAuditor — audit accuracy via dependent_to() API
  triage.py            ItemTriageRunner — rank high-traffic items and classify service dependencies
  utils.py             Shared helpers (URL classification, batch fetch, retry)
cli/                 Unified CLI entry point (python -m cli ...)
config/              config.example.yaml template — copy to config.yaml and fill in credentials
docs/                Documentation and design plans
lib/                 Vendored frontend assets (cytoscape.js, dagre, cytoscape-dagre) for offline HTML
outputs/             Generated output files (gitignored); outputs/meta.sqlite is the metadata cache
tests/               Unit and integration tests (pytest) — 442 tests
```

The CLI uses [Rich](https://github.com/Textualize/rich) for terminal output. Progress bars, error panels, and completion summaries all go through `arcgis_item_graph/console.py` — the single file in the project that imports Rich. Library modules (creator, updater, triage, etc.) remain UI-free and communicate with the CLI via `on_progress`/`on_warning` callback kwargs.

---

## For Contributors

**1. Clone the repository**

```bash
git clone https://github.com/your-org/ArcGIS-Item-Dependency-Management.git
cd ArcGIS-Item-Dependency-Management
```

**2. Install in editable mode with dev dependencies**

```bash
pip install -e ".[dev]"
```

**3. Activate the commit-message hook**

```bash
git config core.hooksPath .githooks
```

**4. Create your configuration file**

```bash
cp config/config.example.yaml config/config.yaml
# or just run: arcgis-graph setup
```

---

## Running tests

```bash
pytest tests/ -v
```

---

## Performance & Architecture Notes

### Graph Traversal
The query BFS uses `collections.deque` for O(1) popleft (O(V+E) total).
Seed items not found in the cached GML file are fetched live in parallel via
`ThreadPoolExecutor` (default 10 workers, configurable via `fetch_workers`
on `ItemGraphQuery`).

Traversal direction is controlled by `query.traversal_direction` in config:
- **`upstream`** (default) — follows `contained_by()` edges: finds items that reference the queried item.
  Answers "what breaks if X is removed?" — the correct mode for migration impact analysis.
- **`downstream`** — follows `contains()` edges: finds items the queried item depends on.
  Answers "what does X need to function?"
- **`both`** — runs separate upstream and downstream passes. No cross-directional contamination
  (forward deps of upstream-reached nodes are not included).

### Update Hydration
`ItemGraphUpdater` hydrates all cached graph nodes concurrently using
`ThreadPoolExecutor` (default 10 workers, configurable via `hydration_workers`).
Graph mutations (node removal) happen serially on the main thread after all
fetches complete. The modified-items search enforces a `max_items` cap (defaults
to `create.max_items` from config) and warns when results may be truncated.

### Timestamps
All timestamps are stored in milliseconds with sub-second precision
(`int(t.timestamp() * 1000)`).

### Excel Reports
`ItemGraphReporter.to_excel()` builds all three sheets from a single pass
through `to_dataframe()` — `node.contains()` is called once per node.
