BioCypher design philosophy

At its core, BioCypher is designed around the principle of threefold modularity:

Modular data sources – Seamlessly integrate diverse biomedical datasets.
Modular ontology structures – Define flexible, structured knowledge representations.
Modular output formats – Adapt results to various applications and tools.

Design Principles

1. Modular data sources

Resources

Resources are diverse data inputs and sources that feed into the knowledge graph through "adapters." A Resource could be a file, a list of files, an API request, or a list of API requests. BioCypher can download resources from a given URL, cache them, and manage their lifecycle.

Adapters

BioCypher is a modular framework, with the main purpose of avoiding redundant maintenance work for maintainers of secondary resources and end users alike. To achieve this, we use a collection of reusable "adapters" for the different sources of biomedical knowledge as well as for different ontologies.

2. Modular ontology structures

Ontologies

An ontology is a formal, hierarchical representation of knowledge within a specific domain, organizing concepts and their relationships. It structures concepts into subclasses of more general categories, such as a wardrobe being a subclass of furniture. BioCypher requires a certain amount of knowledge about ontologies and how to use them. We try to make dealing with ontologies as easy as possible, but some basic understanding is required.

Philosophically, a lot has changed since the introduction of current-generation large language models (LLMs). For instance, LLMs bring a sophisticated world model without explicitly modelling concepts, which is in stark contrast to the modelling decisions of traditional ontologies. We need to critically re-evaluate the future role of ontologies in the modern scientific knowledge management ecosystem. They provide valuable context via the thousands of hours of human curation, but they also come with many intricacies and inconsistencies.

Our Philosophy

BioCypher aims to disrupt the traditional workflow to boost knowledge management into the AI era. While we hope to preserve the benefits of human curation, we also want to critically re-evaluate the role of all parts of the knowledge representation pipeline.

3. Modular output formats

Outputs

Initially focused on Neo4j due to OmniPath's migration, BioCypher now supports multiple output formats, including RDF, SQL, ArangoDB, CSV, PostgreSQL, SQLite, and NetworkX, specified via the dbms parameter in the biocypher_config.yaml file. Users can choose between online mode (manipulation of a running database) or offline mode.

Configuration

Configuration in BioCypher involves setting up and customizing the system to meet specific needs. BioCypher provides default configuration parameters, which can be overridden by creating a biocypher_config.yaml file in your project's root or config directory, specifying the parameters you wish to change.