BioCypher Architecture Migration Guide
This document provides a comprehensive overview of the new BioCypher architecture, the differences between legacy and new implementations, and the roadmap for future integration.
Overview
BioCypher is evolving from an ETL pipeline to a modular, agentic-first knowledge graph framework. This migration introduces new components while maintaining backward compatibility and providing a clear path forward.
Architecture Comparison
Legacy BioCypher Architecture
graph TD
A[Raw Data] --> B[Translator]
B --> C[BioCypherNode/Edge]
C --> D[Deduplicator]
D --> E[In-Memory KG]
E --> F[Output Writers]
G[Schema Config] --> B
H[Ontology] --> B
I[BioCypher Config] --> A
Characteristics:
- Monolithic: Single
BioCypher
class orchestrates everything - ETL-Focused: Designed for batch processing of large datasets
- Strict Validation: Mandatory schema validation and deduplication
- Complex Pipeline: Multi-step translation and processing
- Global State: Singleton deduplicator across all operations
New BioCypher Architecture
graph TD
A[Raw Data] --> B[BioCypherWorkflow]
B --> C[Native Graph]
C --> D[Wrapper Methods]
D --> E[NetworkX/Pandas]
F[Schema Config] --> B
G[Ontology] --> B
H[Validation Modes] --> B
I[Agent Workflows] --> B
J[Deterministic Workflows] --> B
Characteristics:
- Modular: Separate
BioCypherWorkflow
andGraph
classes - Agentic-First: Optimized for LLM agent workflows
- Progressive Validation: Optional validation with flexible modes
- Zero Dependencies: Pure Python implementation
- Local State: Per-workflow validation and deduplication
Key Differences
1. Object Models
Aspect | Legacy BioCypher | New Implementation |
---|---|---|
Node Objects | BioCypherNode (complex) |
Node (simple dataclass) |
Edge Objects | BioCypherEdge (complex) |
Edge (simple dataclass) |
Properties | Rich metadata + properties | Simple properties dict |
Methods | get_id() , get_label() , etc. |
Direct attribute access |
Serialization | Complex internal format | Simple JSON serialization |
2. Validation and Deduplication
Aspect | Legacy BioCypher | New Implementation |
---|---|---|
Validation | Warn / strict | None / warn / strict |
Deduplication | Global singleton | Per-workflow tracking |
Schema | Mandatory | Optional |
Error Handling | Fail-fast | Configurable (warn/continue) |
Performance | Overhead for all operations | Zero overhead in "none" mode |
3. API Design
Aspect | Legacy BioCypher | New Implementation |
---|---|---|
Entry Point | BioCypher() class |
create_workflow() function |
Configuration | Complex config files | Simple parameters |
Workflow | Batch ETL pipeline | Interactive graph building |
Output | Multiple output formats | Native graph + wrappers |
Learning Curve | Steep (many concepts) | Gentle (progressive complexity) |
4. Use Cases
Use Case | Legacy BioCypher | New Implementation |
---|---|---|
ETL Pipelines | ✅ Excellent | ❌ Not suitable |
Agent Workflows | ❌ Too rigid | ✅ Excellent |
Interactive Exploration | ❌ Batch-only | ✅ Excellent |
Large Datasets | ✅ Optimized | ❌ Memory-bound |
Schema Validation | ✅ Comprehensive | ⚠️ Basic only |
Research Prototyping | ❌ Overkill | ✅ Perfect |
Database Integration | ✅ Full support | ❌ Not supported |
Ontology Integration | ✅ Advanced | ❌ Basic only |
Batch Processing | ✅ Optimized | ❌ Not supported |
Metadata Handling | ✅ Rich support | ❌ Basic only |
Implementation Details
New Components
1. BioCypherWorkflow
Class
# Simple, clean API
workflow = create_workflow(
name="my_graph",
validation_mode="warn", # none/warn/strict
deduplication=True
)
# Add nodes and edges
workflow.add_node("protein_1", "protein", name="TP53")
workflow.add_edge("interaction_1", "interaction", "protein_1", "protein_2")
Key Features:
- Progressive validation modes
- Optional schema support
- Wrapper methods for compatibility
- Agentic workflow optimization
2. Graph
Class
# Unified graph representation
graph = Graph(name="my_graph", directed=True)
# Simple operations
graph.add_node("node_1", "protein", {"name": "TP53"})
graph.add_edge("edge_1", "interaction", "node_1", "node_2")
Key Features:
- Zero dependencies
- Simple, extensible design
- Native BioCypher objects
- Built-in deduplication
3. Native Objects
@dataclass
class Node:
id: str
type: str
properties: dict[str, Any] = field(default_factory=dict)
@dataclass
class Edge:
id: str
type: str
source: str
target: str
properties: dict[str, Any] = field(default_factory=dict)
Key Features:
- Simple dataclasses
- Direct attribute access
- JSON serializable
- Extensible for agentic features
Progressive Validation
The new implementation provides three validation modes:
Mode: "none" (Default)
workflow = create_workflow(validation_mode="none")
# No validation overhead
# Maximum flexibility for agents
# Any node/edge types allowed
Mode: "warn"
workflow = create_workflow(validation_mode="warn", deduplication=True)
# Logs warnings for violations
# Continues processing
# Good for debugging
Mode: "strict"
workflow = create_workflow(validation_mode="strict", deduplication=True)
# Enforces schema validation
# Fails fast on violations
# Legacy BioCypher behavior
Migration Roadmap
Phase 1: Parallel Development (Current)
Status: ✅ Complete
- [x] New
BioCypherWorkflow
andGraph
classes - [x] Progressive validation system
- [x] Wrapper methods for compatibility
- [x] Comprehensive test suite
- [x] Documentation and examples
Goals:
- Keep new functionality separate from legacy code
- Provide compatibility wrappers
- Establish clear migration path
Phase 2: Feature Parity (Next)
Status: 🔄 Planned
- [ ] Advanced schema validation
- [ ] Ontology integration improvements
- [ ] Performance optimizations
- [ ] Extended wrapper methods
- [ ] Migration utilities
Goals:
- Match legacy functionality where needed
- Improve performance for large datasets
- Provide migration tools
Phase 3: Agentic Features (Future)
Status: 🔮 Future
- [ ] Computable functions on nodes/edges
- [ ] Decision logging and reasoning traces
- [ ] Counterfactual inference capabilities
- [ ] MCP (Model Context Protocol) integration
- [ ] Local graph computation
Goals:
- Enable advanced agentic workflows
- Support LLM integration
- Provide reasoning capabilities
Phase 4: Unified Interface (Future)
Status: 🔮 Future
- [ ] Integrate with main
BioCypher
class - [ ] Unified configuration system
- [ ] Backward compatibility layer
- [ ] Legacy API deprecation
- [ ] Single entry point
Goals:
- Provide unified interface
- Maintain backward compatibility
- Simplify user experience
Migration Guide
For New Users
Start with the new implementation:
from biocypher import create_workflow
# Simple start
workflow = create_workflow("my_graph")
workflow.add_node("protein_1", "protein", name="TP53")
# Add validation as needed
workflow = create_workflow(
"my_graph",
validation_mode="warn",
deduplication=True
)
For Existing Users
Gradual migration approach:
- Start with "none" mode for existing workflows
- Add validation gradually as needed
- Use wrapper methods for compatibility
- Migrate to native objects over time
# Legacy approach
from biocypher import BioCypher
bc = BioCypher(offline=True, dbms="neo4j")
# New approach (gradual migration)
from biocypher import create_workflow
workflow = create_workflow(validation_mode="strict") # Legacy-like behavior
# Use wrapper for compatibility
nx_graph = workflow.to_networkx()
For Agent Developers
Optimized for agentic workflows:
# Maximum flexibility for agents
workflow = create_workflow(validation_mode="none")
# Add nodes/edges dynamically
for entity in agent_discovered_entities:
workflow.add_node(entity.id, entity.type, **entity.properties)
# Convert to analysis format when needed
analysis_graph = workflow.to_networkx()
Compatibility
Wrapper Methods
The new implementation provides compatibility wrappers:
# Convert to NetworkX
nx_graph = workflow.to_networkx()
# Convert to Pandas DataFrames
nodes_df, edges_df = workflow.to_pandas()
# JSON serialization
json_str = workflow.to_json()
Legacy Integration
Legacy BioCypher can be used alongside new implementation:
# Legacy for ETL
from biocypher import BioCypher
legacy_bc = BioCypher(offline=True, dbms="neo4j")
# New for interactive work
from biocypher import create_workflow
workflow = create_workflow("interactive_graph")
Performance Considerations
Memory Usage
Scenario | Legacy BioCypher | New Implementation |
---|---|---|
Small Graphs (< 1K nodes) | High overhead | Minimal overhead |
Medium Graphs (1K-100K nodes) | Moderate overhead | Low overhead |
Large Graphs (> 100K nodes) | Optimized | Memory-bound |
Validation Overhead
Mode | Overhead | Use Case |
---|---|---|
"none" | Zero | Agent workflows, prototyping |
"warn" | Low | Development, debugging |
"strict" | Moderate | Production, ETL |
Best Practices
Choosing the Right Approach
Use Legacy BioCypher for:
- Large-scale ETL pipelines
- Batch processing of big datasets
- Complex schema validation requirements
- Production systems requiring proven stability
Use New Implementation for:
- Agent workflows and LLM integration
- Interactive graph exploration
- Research prototyping
- Small to medium datasets
- When you need flexibility
Migration Strategy
- Start Small: Begin with simple use cases
- Progressive Validation: Start with "none", add validation as needed
- Use Wrappers: Leverage compatibility methods during transition
- Test Thoroughly: Validate behavior matches expectations
- Plan Gradually: Don't rush the migration
Future Vision
The new BioCypher architecture is designed to support the future of biomedical knowledge graphs:
- Agentic Workflows: LLM agents can dynamically build and reason over graphs
- Interactive Exploration: Researchers can explore data interactively
- Flexible Validation: Users choose their validation level
- Extensible Design: Easy to add new features and capabilities
- Unified Interface: Eventually, one API for all use cases
Limitations and Trade-offs
While the new implementation provides significant advantages for agentic workflows, it comes with important limitations compared to the legacy ETL pipeline:
1. Limited Ontology Integration
Legacy BioCypher:
- Full ontology loading and processing (OWL, OBO, RDF/XML)
- Hybrid ontology support (head + tail ontologies)
- Complex ontology mapping and inheritance
- Automatic ontology-based validation
- Support for multiple ontology formats and namespaces
New Implementation:
- Basic ontology URL support (head_ontology_url parameter)
- No tail ontology support
- No complex ontology mapping
- No automatic ontology-based validation
- Limited to simple ontology references
Impact: Users requiring complex ontology integration should continue using legacy BioCypher.
2. Simplified Deduplication
Legacy BioCypher:
- Global singleton deduplicator across all operations
- Rich duplicate tracking and reporting
- Per-edge-type deduplication logic
- Comprehensive duplicate statistics
- Integration with batch processing
New Implementation:
- Per-workflow deduplication tracking
- Simple duplicate detection (ID-based only)
- No global duplicate statistics
- No integration with batch processing
- Optional deduplication (can be disabled)
Impact: Large-scale ETL pipelines may miss duplicates that legacy BioCypher would catch.
3. Reduced Validation Capabilities
Legacy BioCypher:
- Comprehensive schema validation
- Property whitelist/blacklist filtering
- Complex inheritance validation
- Source, license, and version enforcement
- Multi-level validation pipeline
New Implementation:
- Basic property type validation
- Simple required property checking
- No inheritance validation
- No metadata enforcement
- Single-level validation
Impact: Production systems requiring strict data quality should use legacy BioCypher.
4. Limited Output Format Support
Legacy BioCypher:
- 15+ output formats (Neo4j, PostgreSQL, ArangoDB, SQLite, RDF, OWL, CSV, NetworkX, AIRR)
- Batch writers with optimized performance
- Database connectivity and online mode
- Custom import scripts and configurations
- Multi-format export capabilities
New Implementation:
- 3 output formats (JSON, NetworkX, Pandas)
- No database connectivity
- No batch processing
- No custom import scripts
- Limited to in-memory formats
Impact: Users needing database integration or specific output formats must use legacy BioCypher.
5. No Batch Processing
Legacy BioCypher:
- Optimized batch processing for large datasets
- Memory-efficient streaming
- Progress tracking and logging
- Error handling and recovery
- Configurable batch sizes
New Implementation:
- In-memory processing only
- No streaming capabilities
- No progress tracking
- Basic error handling
- Memory-bound by design
Impact: Large datasets (>100K nodes) may cause memory issues.
6. Limited Metadata Handling
Legacy BioCypher:
- Rich metadata support (source, license, version)
- Automatic metadata injection
- Metadata validation and enforcement
- Configurable metadata requirements
- Integration with provenance tracking
New Implementation:
- Basic property support only
- No automatic metadata injection
- No metadata validation
- No provenance tracking
- User must handle metadata manually
Impact: Research requiring provenance tracking should use legacy BioCypher.
7. No Data Source Integration
Legacy BioCypher:
- Built-in data source adapters
- Download and caching mechanisms
- API integration capabilities
- File format support (CSV, JSON, XML, etc.)
- Data source configuration management
New Implementation:
- No data source integration
- No download capabilities
- No caching mechanisms
- No API integration
- Manual data loading required
Impact: Users must handle data acquisition separately.
8. Simplified Configuration
Legacy BioCypher:
- Complex configuration system
- Multiple config files (biocypher_config.yaml, schema_config.yaml)
- Environment-specific configurations
- Advanced parameter tuning
- Configuration validation
New Implementation:
- Simple parameter-based configuration
- No config file support
- No environment-specific settings
- Limited parameter options
- No configuration validation
Impact: Complex deployments may require legacy BioCypher's configuration system.
9. No Advanced Graph Features
Legacy BioCypher:
- Relationship-as-node support
- Complex relationship types
- Multi-graph support
- Advanced indexing
- Graph analytics integration
New Implementation:
- Basic node/edge/hyperedge support
- Simple relationship types
- Single graph per workflow
- Basic indexing
- Limited analytics
Impact: Complex graph structures may not be supported.
10. Limited Error Handling
Legacy BioCypher:
- Comprehensive error handling
- Graceful degradation
- Error recovery mechanisms
- Detailed error reporting
- Logging and monitoring
New Implementation:
- Basic error handling
- Fail-fast approach
- No error recovery
- Simple error messages
- Basic logging
Impact: Production systems may need more robust error handling.
11. Technical Limitations
Legacy BioCypher:
- Supports 15+ database systems and output formats
- Handles multiple ontology formats (OWL, OBO, RDF/XML)
- Complex inheritance and mapping logic
- Relationship-as-node representation
- Multi-graph support
- Advanced indexing and querying
New Implementation:
- Limited to 3 output formats (JSON, NetworkX, Pandas)
- Basic ontology URL support only
- Simple inheritance (no complex mapping)
- No relationship-as-node support
- Single graph per workflow
- Basic indexing only
Impact: Complex biomedical knowledge representation may not be fully supported.
12. Performance Limitations
Legacy BioCypher:
- Optimized for large-scale processing
- Memory-efficient streaming
- Batch processing with configurable sizes
- Progress tracking and monitoring
- Error recovery and retry mechanisms
New Implementation:
- In-memory processing only
- No streaming capabilities
- No batch processing
- No progress tracking
- Basic error handling
Impact: Performance degrades significantly with large datasets.
When to Use Each Approach
Use Legacy BioCypher for:
- Large-scale ETL pipelines (>100K nodes)
- Database integration (Neo4j, PostgreSQL, etc.)
- Complex ontology requirements
- Production systems requiring robust validation
- Batch processing of large datasets
- Multiple output formats
- Provenance tracking and metadata management
- Data source integration
- Complex configuration requirements
Use New Implementation for:
- Agent workflows and LLM integration
- Interactive exploration and prototyping
- Small to medium datasets (<100K nodes)
- Research and development
- Simple graph operations
- Maximum flexibility requirements
- Rapid prototyping
- Educational purposes
Migration Strategy
Gradual Migration Approach
- Start with New Implementation for new projects
- Use Legacy BioCypher for existing ETL pipelines
- Migrate incrementally as features are added
- Use wrapper methods for compatibility
- Plan for Phase 2 feature additions
Hybrid Approach
# Use both approaches together
from biocypher import BioCypher, create_workflow
# Legacy for ETL
legacy_bc = BioCypher(offline=True, dbms="neo4j")
# ... process large dataset ...
# New for interactive work
workflow = create_workflow("interactive_graph")
# ... explore and analyze ...
Future Roadmap
The limitations identified above will be addressed in future phases:
- Phase 2: Enhanced ontology support, batch processing, more output formats
- Phase 3: Advanced validation, metadata handling, data source integration
- Phase 4: Unified interface with full legacy compatibility
This migration represents BioCypher's evolution from a batch ETL tool to a comprehensive knowledge graph framework for the AI era, with a clear path for addressing current limitations.