BioCypher Architecture Migration Guide

This document provides a comprehensive overview of the new BioCypher architecture, the differences between legacy and new implementations, and the roadmap for future integration.

Overview

BioCypher is evolving from an ETL pipeline to a modular, agentic-first knowledge graph framework. This migration introduces new components while maintaining backward compatibility and providing a clear path forward.

Architecture Comparison

Legacy BioCypher Architecture

graph TD
    A[Raw Data] --> B[Translator]
    B --> C[BioCypherNode/Edge]
    C --> D[Deduplicator]
    D --> E[In-Memory KG]
    E --> F[Output Writers]

    G[Schema Config] --> B
    H[Ontology] --> B
    I[BioCypher Config] --> A

Characteristics:

Monolithic: Single BioCypher class orchestrates everything
ETL-Focused: Designed for batch processing of large datasets
Strict Validation: Mandatory schema validation and deduplication
Complex Pipeline: Multi-step translation and processing
Global State: Singleton deduplicator across all operations

New BioCypher Architecture

graph TD
    A[Raw Data] --> B[BioCypherWorkflow]
    B --> C[Native Graph]
    C --> D[Wrapper Methods]
    D --> E[NetworkX/Pandas]

    F[Schema Config] --> B
    G[Ontology] --> B
    H[Validation Modes] --> B

    I[Agent Workflows] --> B
    J[Deterministic Workflows] --> B

Characteristics:

Modular: Separate BioCypherWorkflow and Graph classes
Agentic-First: Optimized for LLM agent workflows
Progressive Validation: Optional validation with flexible modes
Zero Dependencies: Pure Python implementation
Local State: Per-workflow validation and deduplication

Key Differences

1. Object Models

Aspect	Legacy BioCypher	New Implementation
Node Objects	`BioCypherNode` (complex)	`Node` (simple dataclass)
Edge Objects	`BioCypherEdge` (complex)	`Edge` (simple dataclass)
Properties	Rich metadata + properties	Simple properties dict
Methods	`get_id()`, `get_label()`, etc.	Direct attribute access
Serialization	Complex internal format	Simple JSON serialization

2. Validation and Deduplication

Aspect	Legacy BioCypher	New Implementation
Validation	Warn / strict	None / warn / strict
Deduplication	Global singleton	Per-workflow tracking
Schema	Mandatory	Optional
Error Handling	Fail-fast	Configurable (warn/continue)
Performance	Overhead for all operations	Zero overhead in "none" mode

3. API Design

Aspect	Legacy BioCypher	New Implementation
Entry Point	`BioCypher()` class	`create_workflow()` function
Configuration	Complex config files	Simple parameters
Workflow	Batch ETL pipeline	Interactive graph building
Output	Multiple output formats	Native graph + wrappers
Learning Curve	Steep (many concepts)	Gentle (progressive complexity)

4. Use Cases

Use Case	Legacy BioCypher	New Implementation
ETL Pipelines	✅ Excellent	❌ Not suitable
Agent Workflows	❌ Too rigid	✅ Excellent
Interactive Exploration	❌ Batch-only	✅ Excellent
Large Datasets	✅ Optimized	❌ Memory-bound
Schema Validation	✅ Comprehensive	⚠️ Basic only
Research Prototyping	❌ Overkill	✅ Perfect
Database Integration	✅ Full support	❌ Not supported
Ontology Integration	✅ Advanced	❌ Basic only
Batch Processing	✅ Optimized	❌ Not supported
Metadata Handling	✅ Rich support	❌ Basic only

Implementation Details

New Components

1. `BioCypherWorkflow` Class

# Simple, clean API
workflow = create_workflow(
    name="my_graph",
    validation_mode="warn",  # none/warn/strict
    deduplication=True
)

# Add nodes and edges
workflow.add_node("protein_1", "protein", name="TP53")
workflow.add_edge("interaction_1", "interaction", "protein_1", "protein_2")

Key Features:

Progressive validation modes
Optional schema support
Wrapper methods for compatibility
Agentic workflow optimization

2. `Graph` Class

# Unified graph representation
graph = Graph(name="my_graph", directed=True)

# Simple operations
graph.add_node("node_1", "protein", {"name": "TP53"})
graph.add_edge("edge_1", "interaction", "node_1", "node_2")

Key Features:

Zero dependencies
Simple, extensible design
Native BioCypher objects
Built-in deduplication

3. Native Objects

@dataclass
class Node:
    id: str
    type: str
    properties: dict[str, Any] = field(default_factory=dict)

@dataclass
class Edge:
    id: str
    type: str
    source: str
    target: str
    properties: dict[str, Any] = field(default_factory=dict)

Key Features:

Simple dataclasses
Direct attribute access
JSON serializable
Extensible for agentic features

Progressive Validation

The new implementation provides three validation modes:

Mode: "none" (Default)

workflow = create_workflow(validation_mode="none")
# No validation overhead
# Maximum flexibility for agents
# Any node/edge types allowed

Mode: "warn"

workflow = create_workflow(validation_mode="warn", deduplication=True)
# Logs warnings for violations
# Continues processing
# Good for debugging

Mode: "strict"

workflow = create_workflow(validation_mode="strict", deduplication=True)
# Enforces schema validation
# Fails fast on violations
# Legacy BioCypher behavior

Migration Roadmap

Phase 1: Parallel Development (Current)

Status: ✅ Complete

[x] New BioCypherWorkflow and Graph classes
[x] Progressive validation system
[x] Wrapper methods for compatibility
[x] Comprehensive test suite
[x] Documentation and examples

Goals:

Keep new functionality separate from legacy code
Provide compatibility wrappers
Establish clear migration path

Phase 2: Feature Parity (Next)

Status: 🔄 Planned

[ ] Advanced schema validation
[ ] Ontology integration improvements
[ ] Performance optimizations
[ ] Extended wrapper methods
[ ] Migration utilities

Goals:

Match legacy functionality where needed
Improve performance for large datasets
Provide migration tools

Phase 3: Agentic Features (Future)

Status: 🔮 Future

[ ] Computable functions on nodes/edges
[ ] Decision logging and reasoning traces
[ ] Counterfactual inference capabilities
[ ] MCP (Model Context Protocol) integration
[ ] Local graph computation

Goals:

Enable advanced agentic workflows
Support LLM integration
Provide reasoning capabilities

Phase 4: Unified Interface (Future)

Status: 🔮 Future

[ ] Integrate with main BioCypher class
[ ] Unified configuration system
[ ] Backward compatibility layer
[ ] Legacy API deprecation
[ ] Single entry point

Goals:

Provide unified interface
Maintain backward compatibility
Simplify user experience

Migration Guide

For New Users

Start with the new implementation:

from biocypher import create_workflow

# Simple start
workflow = create_workflow("my_graph")
workflow.add_node("protein_1", "protein", name="TP53")

# Add validation as needed
workflow = create_workflow(
    "my_graph",
    validation_mode="warn",
    deduplication=True
)

For Existing Users

Gradual migration approach:

Start with "none" mode for existing workflows
Add validation gradually as needed
Use wrapper methods for compatibility
Migrate to native objects over time

# Legacy approach
from biocypher import BioCypher
bc = BioCypher(offline=True, dbms="neo4j")

# New approach (gradual migration)
from biocypher import create_workflow
workflow = create_workflow(validation_mode="strict")  # Legacy-like behavior

# Use wrapper for compatibility
nx_graph = workflow.to_networkx()

For Agent Developers

Optimized for agentic workflows:

# Maximum flexibility for agents
workflow = create_workflow(validation_mode="none")

# Add nodes/edges dynamically
for entity in agent_discovered_entities:
    workflow.add_node(entity.id, entity.type, **entity.properties)

# Convert to analysis format when needed
analysis_graph = workflow.to_networkx()

Compatibility

Wrapper Methods

The new implementation provides compatibility wrappers:

# Convert to NetworkX
nx_graph = workflow.to_networkx()

# Convert to Pandas DataFrames
nodes_df, edges_df = workflow.to_pandas()

# JSON serialization
json_str = workflow.to_json()

Legacy Integration

Legacy BioCypher can be used alongside new implementation:

# Legacy for ETL
from biocypher import BioCypher
legacy_bc = BioCypher(offline=True, dbms="neo4j")

# New for interactive work
from biocypher import create_workflow
workflow = create_workflow("interactive_graph")

Performance Considerations

Memory Usage

Scenario	Legacy BioCypher	New Implementation
Small Graphs (< 1K nodes)	High overhead	Minimal overhead
Medium Graphs (1K-100K nodes)	Moderate overhead	Low overhead
Large Graphs (> 100K nodes)	Optimized	Memory-bound

Validation Overhead

Mode	Overhead	Use Case
"none"	Zero	Agent workflows, prototyping
"warn"	Low	Development, debugging
"strict"	Moderate	Production, ETL

Best Practices

Choosing the Right Approach

Use Legacy BioCypher for:

Large-scale ETL pipelines
Batch processing of big datasets
Complex schema validation requirements
Production systems requiring proven stability

Use New Implementation for:

Agent workflows and LLM integration
Interactive graph exploration
Research prototyping
Small to medium datasets
When you need flexibility

Migration Strategy

Start Small: Begin with simple use cases
Progressive Validation: Start with "none", add validation as needed
Use Wrappers: Leverage compatibility methods during transition
Test Thoroughly: Validate behavior matches expectations
Plan Gradually: Don't rush the migration

Future Vision

The new BioCypher architecture is designed to support the future of biomedical knowledge graphs:

Agentic Workflows: LLM agents can dynamically build and reason over graphs
Interactive Exploration: Researchers can explore data interactively
Flexible Validation: Users choose their validation level
Extensible Design: Easy to add new features and capabilities
Unified Interface: Eventually, one API for all use cases

Limitations and Trade-offs

While the new implementation provides significant advantages for agentic workflows, it comes with important limitations compared to the legacy ETL pipeline:

1. Limited Ontology Integration

Legacy BioCypher:

Full ontology loading and processing (OWL, OBO, RDF/XML)
Hybrid ontology support (head + tail ontologies)
Complex ontology mapping and inheritance
Automatic ontology-based validation
Support for multiple ontology formats and namespaces

New Implementation:

Basic ontology URL support (head_ontology_url parameter)
No tail ontology support
No complex ontology mapping
No automatic ontology-based validation
Limited to simple ontology references

Impact: Users requiring complex ontology integration should continue using legacy BioCypher.

2. Simplified Deduplication

Legacy BioCypher:

Global singleton deduplicator across all operations
Rich duplicate tracking and reporting
Per-edge-type deduplication logic
Comprehensive duplicate statistics
Integration with batch processing

New Implementation:

Per-workflow deduplication tracking
Simple duplicate detection (ID-based only)
No global duplicate statistics
No integration with batch processing
Optional deduplication (can be disabled)

Impact: Large-scale ETL pipelines may miss duplicates that legacy BioCypher would catch.

3. Reduced Validation Capabilities

Legacy BioCypher:

Comprehensive schema validation
Property whitelist/blacklist filtering
Complex inheritance validation
Source, license, and version enforcement
Multi-level validation pipeline

New Implementation:

Basic property type validation
Simple required property checking
No inheritance validation
No metadata enforcement
Single-level validation

Impact: Production systems requiring strict data quality should use legacy BioCypher.

4. Limited Output Format Support

Legacy BioCypher:

15+ output formats (Neo4j, PostgreSQL, ArangoDB, SQLite, RDF, OWL, CSV, NetworkX, AIRR)
Batch writers with optimized performance
Database connectivity and online mode
Custom import scripts and configurations
Multi-format export capabilities

New Implementation:

3 output formats (JSON, NetworkX, Pandas)
No database connectivity
No batch processing
No custom import scripts
Limited to in-memory formats

Impact: Users needing database integration or specific output formats must use legacy BioCypher.

5. No Batch Processing

Legacy BioCypher:

Optimized batch processing for large datasets
Memory-efficient streaming
Progress tracking and logging
Error handling and recovery
Configurable batch sizes

New Implementation:

In-memory processing only
No streaming capabilities
No progress tracking
Basic error handling
Memory-bound by design

Impact: Large datasets (>100K nodes) may cause memory issues.

6. Limited Metadata Handling

Legacy BioCypher:

Rich metadata support (source, license, version)
Automatic metadata injection
Metadata validation and enforcement
Configurable metadata requirements
Integration with provenance tracking

New Implementation:

Basic property support only
No automatic metadata injection
No metadata validation
No provenance tracking
User must handle metadata manually

Impact: Research requiring provenance tracking should use legacy BioCypher.

7. No Data Source Integration

Legacy BioCypher:

Built-in data source adapters
Download and caching mechanisms
API integration capabilities
File format support (CSV, JSON, XML, etc.)
Data source configuration management

New Implementation:

No data source integration
No download capabilities
No caching mechanisms
No API integration
Manual data loading required

Impact: Users must handle data acquisition separately.

8. Simplified Configuration

Legacy BioCypher:

Complex configuration system
Multiple config files (biocypher_config.yaml, schema_config.yaml)
Environment-specific configurations
Advanced parameter tuning
Configuration validation

New Implementation:

Simple parameter-based configuration
No config file support
No environment-specific settings
Limited parameter options
No configuration validation

Impact: Complex deployments may require legacy BioCypher's configuration system.

9. No Advanced Graph Features

Legacy BioCypher:

Relationship-as-node support
Complex relationship types
Multi-graph support
Advanced indexing
Graph analytics integration

New Implementation:

Basic node/edge/hyperedge support
Simple relationship types
Single graph per workflow
Basic indexing
Limited analytics

Impact: Complex graph structures may not be supported.

10. Limited Error Handling

Legacy BioCypher:

Comprehensive error handling
Graceful degradation
Error recovery mechanisms
Detailed error reporting
Logging and monitoring

New Implementation:

Basic error handling
Fail-fast approach
No error recovery
Simple error messages
Basic logging

Impact: Production systems may need more robust error handling.

11. Technical Limitations

Legacy BioCypher:

Supports 15+ database systems and output formats
Handles multiple ontology formats (OWL, OBO, RDF/XML)
Complex inheritance and mapping logic
Relationship-as-node representation
Multi-graph support
Advanced indexing and querying

New Implementation:

Limited to 3 output formats (JSON, NetworkX, Pandas)
Basic ontology URL support only
Simple inheritance (no complex mapping)
No relationship-as-node support
Single graph per workflow
Basic indexing only

Impact: Complex biomedical knowledge representation may not be fully supported.

12. Performance Limitations

Legacy BioCypher:

Optimized for large-scale processing
Memory-efficient streaming
Batch processing with configurable sizes
Progress tracking and monitoring
Error recovery and retry mechanisms

New Implementation:

In-memory processing only
No streaming capabilities
No batch processing
No progress tracking
Basic error handling

Impact: Performance degrades significantly with large datasets.

When to Use Each Approach

Use Legacy BioCypher for:

Large-scale ETL pipelines (>100K nodes)
Database integration (Neo4j, PostgreSQL, etc.)
Complex ontology requirements
Production systems requiring robust validation
Batch processing of large datasets
Multiple output formats
Provenance tracking and metadata management
Data source integration
Complex configuration requirements

Use New Implementation for:

Agent workflows and LLM integration
Interactive exploration and prototyping
Small to medium datasets (<100K nodes)
Research and development
Simple graph operations
Maximum flexibility requirements
Rapid prototyping
Educational purposes

Migration Strategy

Gradual Migration Approach

Start with New Implementation for new projects
Use Legacy BioCypher for existing ETL pipelines
Migrate incrementally as features are added
Use wrapper methods for compatibility
Plan for Phase 2 feature additions

Hybrid Approach

# Use both approaches together
from biocypher import BioCypher, create_workflow

# Legacy for ETL
legacy_bc = BioCypher(offline=True, dbms="neo4j")
# ... process large dataset ...

# New for interactive work
workflow = create_workflow("interactive_graph")
# ... explore and analyze ...

Future Roadmap

The limitations identified above will be addressed in future phases:

Phase 2: Enhanced ontology support, batch processing, more output formats
Phase 3: Advanced validation, metadata handling, data source integration
Phase 4: Unified interface with full legacy compatibility

This migration represents BioCypher's evolution from a batch ETL tool to a comprehensive knowledge graph framework for the AI era, with a clear path for addressing current limitations.

BioCypher Architecture Migration Guide

Overview

Architecture Comparison

Legacy BioCypher Architecture

New BioCypher Architecture

Key Differences

1. Object Models

2. Validation and Deduplication

3. API Design

4. Use Cases

Implementation Details

New Components

1. BioCypherWorkflow Class

2. Graph Class

3. Native Objects

Progressive Validation

Mode: "none" (Default)

Mode: "warn"

Mode: "strict"

Migration Roadmap

Phase 1: Parallel Development (Current)

Phase 2: Feature Parity (Next)

Phase 3: Agentic Features (Future)

Phase 4: Unified Interface (Future)

Migration Guide

For New Users

For Existing Users

For Agent Developers

Compatibility

Wrapper Methods

Legacy Integration

Performance Considerations

Memory Usage

Validation Overhead

Best Practices

Choosing the Right Approach

Migration Strategy

Future Vision

Limitations and Trade-offs

1. Limited Ontology Integration

2. Simplified Deduplication

3. Reduced Validation Capabilities

4. Limited Output Format Support

5. No Batch Processing

6. Limited Metadata Handling

7. No Data Source Integration

8. Simplified Configuration

9. No Advanced Graph Features

10. Limited Error Handling

11. Technical Limitations

12. Performance Limitations

When to Use Each Approach

Use Legacy BioCypher for:

Use New Implementation for:

Migration Strategy

Gradual Migration Approach

Hybrid Approach

Future Roadmap

1. `BioCypherWorkflow` Class

2. `Graph` Class