# BioCypher Adapter Creation (LLM Guide)

> This guide instructs AI coding assistants (Copilot, Cursor, Claude, etc.) how to create BioCypher adapters.
> Adapters transform arbitrary input data (e.g. NCBI GEO metadata) into BioCypher's canonical format, using the schema configuration YAML as the contract.
> Output must be a collection of node tuples (3 elements) and edge tuples (5 elements), aligned with the schema.
> BioCypher provides base classes and utilities to help with common adapter patterns.

Adapters follow the **idiomatic BioCypher interface**: they expose one or more iterables that yield nodes and edges. The key steps for an LLM to generate an adapter are:

## 1. Schema Analysis

- Load and parse the `schema_config.yaml`.
- Identify all node and edge types, their `input_label`s, and required properties.
- **Rule**: all adapter outputs must match the schema's input labels and property names.

## 2. Data Retrieval

- Use BioCypher's `Resource` and `FileDownload` classes for data retrieval and caching.
- Implement retrieval directly (for GEO: `GEOparse.get_GEO("GSE12345")`).
- Consider using external libraries like `GEOparse`, `pandas`, or `requests` for data access.

## 3. Metadata Parsing

- Inspect metadata objects (e.g. GSE, GSM, GPL in GEOparse).
- For each concept in the schema, extract relevant fields.
- Normalize divergent field names (e.g. `"disease_state"` → `"disease"`) to schema properties.

## 4. Node Creation (3-tuple)

Each node is `(node_id, node_label, attributes_dict)`
- `node_id`: unique string, ideally CURIE-like (e.g. `GEO:GSM12345`).
- `node_label`: must equal the schema's `input_label`.
- `attributes_dict`: keys = schema properties; include provenance fields if strict mode (`source`, `version`, `licence`).

## 5. Edge Creation (5-tuple)

Each edge is `(edge_id, source_id, target_id, edge_label, attributes_dict)`
- `edge_id`: optional unique string.
- `source_id` / `target_id`: must reference valid node IDs created above.
- `edge_label`: must equal the schema's `input_label` for this relation.
- `attributes_dict`: properties defined in the schema (or empty if none).

## 6. Multiple Metadata Formats

- If series differ in structure, handle conditionally or create specialized subclasses.
- Ensure **all schema concepts are extracted** regardless of metadata divergence.

## 7. Validation

- Confirm every adapter output type exists in schema.
- Avoid extra types.
- If strict mode: check provenance fields present.

## Example Pattern (Pseudo-Python)

```python
import GEOparse
from biocypher._get import FileDownload

class GEOAdapter:
    def __init__(self, gse_id: str):
        self.gse_id = gse_id
        self.series = GEOparse.get_GEO(gse_id)

    def get_nodes(self):
        # Series node
        yield (
            f"GEO:{self.series.name}",    # node_id
            "geo_series",                 # node_label (matches schema input_label)
            {
                "title": self.series.metadata.get("title"),
                "summary": self.series.metadata.get("summary"),
                "source": "GEO",
                "version": self.series.metadata.get("submission_date"),
            },
        )

        # Sample nodes
        for sample in self.series.gsms.values():
            yield (
                f"GEO:{sample.name}",      # node_id
                "geo_sample",             # node_label (matches schema input_label)
                {
                    "disease": sample.metadata.get("disease_state"),
                    "organism": sample.metadata.get("organism_ch1"),
                    "source": "GEO",
                    "version": self.series.metadata.get("submission_date"),
                },
            )

    def get_edges(self):
        for gsm in self.series.gsms.values():
            yield (
                None,                     # edge_id
                f"GEO:{self.series.name}",# source_id (series)
                f"GEO:{gsm.name}",        # target_id (sample)
                "HAS_SAMPLE",             # edge_label (matches schema input_label)
                {},
            )
```

## Common Patterns

### Resource Management

```python
from biocypher._get import FileDownload

class MyAdapter:
    def __init__(self, data_url: str):
        # Use BioCypher's resource management for downloads
        self.resource = FileDownload(
            name="my_data",
            url_s=data_url,
            lifetime=30  # days
        )
        self.data_file = self.resource.get()
```

### Schema Validation

```python
def validate_schema_compliance(self, schema_config):
    """Ensure adapter outputs match schema requirements."""
    schema_nodes = {node['input_label'] for node in schema_config['nodes']}
    schema_edges = {edge['input_label'] for edge in schema_config['edges']}

    # Validate node labels
    for node_id, node_label, _ in self.get_nodes():
        if node_label not in schema_nodes:
            raise ValueError(f"Node label '{node_label}' not in schema")

    # Validate edge labels
    for _, _, _, edge_label, _ in self.get_edges():
        if edge_label not in schema_edges:
            raise ValueError(f"Edge label '{edge_label}' not in schema")
```

### Error Handling

```python
def safe_extract(self, metadata, key, default=None):
    """Safely extract metadata with fallback."""
    try:
        return metadata.get(key, default)
    except (AttributeError, KeyError):
        return default
```

## Key Principles

1. **Schema as Contract**: The schema configuration is the single source of truth
2. **Consistent Naming**: Use schema `input_label`s exactly as defined
3. **Provenance Tracking**: Include source, version, and license when available
4. **Error Resilience**: Handle missing or malformed data gracefully
5. **Performance**: Use generators for memory efficiency with large datasets

## Related Files

- **llms-example-adapter.txt** - Complete working example
- **llms.txt** - Functionality index and reference
