# BioCypher Functionality Index for LLMs

## Available LLM Documentation Files

For specific guidance, refer to these files in the documentation root, https://biocypher.org/BioCypher/:

- **llms-adapters.txt** - Complete guide for creating BioCypher adapters
- **llms-example-adapter.txt** - Full working example of a GEO adapter
- **llms.md** - Human-readable overview (this page)

## Core Components

### BioCypher Class
- Main entry point for knowledge graph creation
- Handles schema validation, data processing, and output generation
- Methods: `add_node()`, `add_edge()`, `write()`, `get_graph()`

### Adapters
- Transform external data into BioCypher canonical format
- Interface: `get_nodes()` and `get_edges()` methods returning iterables
- Node format: (node_id, node_label, attributes_dict)
- Edge format: (edge_id, source_id, target_id, edge_label, attributes_dict)

### Schema Configuration
- YAML-based schema definition
- Defines node types, edge types, and their properties
- Uses `input_label` to map adapter outputs to schema concepts
- Supports inheritance and property overrides

## Data Processing

### Node Creation
- 3-tuple format: (node_id, node_label, attributes_dict)
- node_id: unique identifier (preferably CURIE format)
- node_label: must match schema input_label
- attributes_dict: property key-value pairs

### Edge Creation
- 5-tuple format: (edge_id, source_id, target_id, edge_label, attributes_dict)
- edge_id: optional unique identifier
- source_id/target_id: must reference existing node IDs
- edge_label: must match schema input_label
- attributes_dict: edge property key-value pairs

### Data Validation
- Schema compliance checking
- Node/edge label validation
- Property type validation
- Provenance field validation (strict mode)

## Output Formats

### Graph Databases
- Neo4j: Cypher queries and batch operations
- ArangoDB: AQL queries and document operations
- PostgreSQL: SQL operations with graph extensions

### File Formats
- RDF: Turtle, N-Triples, RDF/XML
- OWL: Ontology Web Language
- NetworkX: Python graph library format
- Tabular: CSV, TSV with node/edge tables

### In-Memory
- NetworkX graph objects
- Pandas DataFrames
- Python dictionaries and lists

## Utility Functions

### Download and Cache
- `download_and_cache_file()`: Download files with caching
- `download_and_cache_ftp()`: FTP file downloads
- `download_and_cache_http()`: HTTP file downloads

### Ontology Handling
- `load_ontology()`: Load OWL/TTL ontology files
- `get_ontology_mapping()`: Extract entity mappings
- `get_ontology_hierarchy()`: Extract class hierarchies

### Graph Operations
- `get_subgraph()`: Extract subgraphs by criteria
- `merge_graphs()`: Combine multiple graphs
- `deduplicate_nodes()`: Remove duplicate nodes
- `deduplicate_edges()`: Remove duplicate edges

## Configuration

### BioCypher Configuration
- `biocypher_config.yaml`: Main configuration file
- Database connection settings
- Output format specifications
- Logging and validation options

### Schema Configuration
- `schema_config.yaml`: Schema definition file
- Node and edge type definitions
- Property specifications
- Inheritance relationships

## Common Patterns

### Adapter Patterns
- Simple adapter pattern: Direct data transformation
- Resource-based pattern: Using BioCypher's Resource classes
- Generator-based pattern: Memory-efficient streaming
- Schema-driven pattern: Validation against schema configuration

### Error Handling
- Graceful handling of missing data
- Schema validation errors
- Network/IO error recovery
- Data type conversion errors

### Performance Optimization
- Generator-based data processing
- Batch operations for large datasets
- Memory-efficient graph operations
- Caching for repeated operations

## Integration Points

### External Libraries
- GEOparse: NCBI GEO data access
- Pandas: Data manipulation
- NetworkX: Graph operations
- RDFlib: RDF processing
- OWLready2: OWL ontology handling

### Database Drivers
- Neo4j: py2neo driver
- PostgreSQL: psycopg2 driver
- ArangoDB: python-arango driver
- SQLite: sqlite3 (built-in)

## Validation Rules

### Schema Compliance
- All node labels must exist in schema
- All edge labels must exist in schema
- Required properties must be present
- Property types must match schema definition

### Data Quality
- Node IDs must be unique
- Edge source/target must reference valid nodes
- CURIE format preferred for node IDs
- Provenance fields required in strict mode

### Performance Constraints
- Memory usage for large datasets
- Network timeouts for external data
- File size limits for downloads
- Processing time for complex operations
