Quickstart
Note
If you already know how BioCypher works, we provide here a quickstart into the knowledge graph build process. We provide a template repository on GitHub, which you can use to get started with your own project. You can get it here. To set up a new project, simply follow the instructions in the README.
If you are new to BioCypher and would like a step-by-step introduction to the package, please follow the tutorial.
The BioCypher workflow of creating your own knowledge graph consists of two components:
the host module adapter, a python program, and
the schema configuration file, a YAML file.
The adapter serves as a data interface between the source and BioCypher, piping the “raw” data into BioCypher for the creation of the property graph, while the schema configuration tells BioCypher how the graph should be structured, detailing the names of constituents and how they should be connected.
The host module adapter
BioCypher follows a modular approach to data inputs; to create a knowledge graph, we use at least one adapter module that provides a data stream to build the graph from. Examples for current adapters can be found on the GitHub project adapter view. Adapters can ingest data from many different input sources, including Python modules as in the CROssBAR adapter (which uses the OmniPath backend software, PyPath, for downloading and caching data), advanced file management formats such as Parquet as in the Open Targets adapter, or simple CSV files as in the Dependency Map adapter.
The recommended way of interacting with BioCypher is via the
biocypher._driver.Driver
class. It can be called either starting
in “offline mode” using offline = True
, i.e., without connection to a running
Neo4j instance, or by providing authentication details via arguments or
configuration file:
import biocypher
d = biocypher.Driver(
offline = False,
db_uri = "bolt://localhost:7687",
db_user = "neo4j",
db_passwd = "password",
)
Note
We use the APOC library for Neo4j, which is not included automatically, but needs to be installed as a plugin to the DMBS. For more information, please refer to the APOC documentation.
Hint
The settings for the BioCypher driver can also be specified in a configuration file. For more details, please refer to the Setup instructions.
The main function of the adapter is to pass data into BioCypher, usually as some form of iterable (commonly a list or generator of items). As a minimal example, we load a list of proteins with identifiers, trivial names, and molecular masses from a (fictional) CSV:
# read into data frame
with open("file.csv", "r") as f:
proteins = pd.read_csv(f)
# yield proteins from data frame
def node_generator():
for p in proteins:
_id = p["uniprot_id"]
_type = "protein"
_props = {
"name": p["trivial_name"]
"mm": p["molecular_mass"]
}
yield (_id, _type, _props)
# write biocypher nodes
d.write_nodes(node_generator())
For nodes, BioCypher expects a tuple containing three entries; the preferred identifier of the node, the type of entity, and a dictionary containing all other properties (can be empty). What BioCypher does with the received information is determined largely by the schema configuration detailed below.
For advanced usage, the type of node or edge can be determined programatically. Properties do not need to be explicitly called one by one; they can be passed in as a complete dictionary of all entries and filtered inside BioCypher by detailing the desired properties per node type in the schema configuration file.
The schema configuration YAML file
The second important component of translation into a BioCypher-compatible
property graph is the specification of graph constituents and their mode of
representation in the graph. For instance, we want to add a representation for
proteins to the OmniPath graph, and the proteins should be represented as nodes.
To make this known to the BioCypher module, we use the
schema-config.yaml,
which details only the immediate constituents of the desired graph. Since the
identifier systems in the Biolink schema are not comprehensive and offer many
alternatives, we currently use the CURIE prefixes directly as given by
Bioregistry. For instance, a protein could be
represented, for instance, by a UniProt identifier, the corresponding ENSEMBL
identifier, or an HGNC gene symbol. The CURIE prefix for “Uniprot Protein” is
uniprot
, so a consistent protein schema definition would be:
protein:
represented_as: node
preferred_id: uniprot
input_label: protein
Note
For BioCypher classes, similar to the internal representation in the Biolink
model, we use lower sentence-case notation, e.g., protein
and small molecule
. For file names and Neo4j labels, these are converted to PascalCase.
In the protein case, we are specifying its representation as a node,
that we wish to use the UniProt identifier as the main identifier for
proteins, and that proteins in the input coming from PyPath
carry
the label protein
(in lowercase). Should one wish to use ENSEMBL
notation instead of UniProt, the corresponding CURIE prefix, in this
case, ensembl
, can be substituted.
protein:
represented_as: node
preferred_id: ensembl
input_label: protein
If there exists no identifier system that is suitable for coverage of
the data, the standard field id
can be used; this will not result in
the creation of a named property that reflects the identifier of each
node. See below for an example. The preferred_id
field can in this case also
be omitted entirely; this will lead to the same outcome (id
).
The other slots of a graph constituent entry contain information
BioCypher needs to receive the input data correctly and construct the
graph accordingly. For “Named Thing” entities such as the protein, this
includes the mode of representation (YAML entry represented_as
),
which can be node
or edge
. Proteins can only feasibly
represented as nodes, but for other entities, such as interactions or
aggregates, representation can be both as node or as edge. In Biolink,
these belong to the super-class
Associations.
For associations, BioCypher additionally requires the specification of
the source and target of the association; for instance, a
post-translational interaction occurs between proteins, so the source
and target attribute in the schema-config.yaml
will both be
protein
.
post translational interaction:
represented_as: node
preferred_id: id
source: protein
target: protein
input_label: post_translational
For the post-translational interaction, which is an association, we are specifying representation as a node (prompting BioCypher to create not only the node but also two edges connecting to the proteins participating in any particular post-translational interaction). In other words, we are reifying the post-translational interaction in order to have a node to which other nodes can be linked; for instance, we might want to add a publication to a particular interaction to serve as source of evidence, which is only possible for nodes in a property graph, not for edges.
Since there are no systematic identifiers for post-translational
interactions, we concatenate the protein ids and relevant properties of
the interaction to a new unique id. We prevent creation of a specific
named property by specifying id
as the identifier system in this case.
If a specific property name (in addition to the generic id
field) is
desired, one can use any arbitrary string as a designation for this
identifier, which will then be a named property on the
PostTranslationalInteraction
nodes.
Note
BioCypher accepts non-Biolink IDs since not all possible entries possess a
systematic identifier system, whereas the entity class (protein
, post translational interaction
) has to be included in the Biolink schema and
spelled identically. For this reason, we extend the Biolink
schema in cases where there exists no entry for
our entity of choice. Further, we are specifying the source and target classes
of our association (both protein
), which are optional, and the label we
provide in the input from PyPath
(post_translational
).
If we wanted the interaction to be represented in the graph as an edge,
we would also need to supply an additional - arbitrary - property,
label_as_edge
, which would be used as the relationship type; this
could for instance be INTERACTS_POST_TRANSLATIONALLY
, following the
property graph database consensus that property graph edges are
represented in all upper case form and as verbs, to distinguish from
nodes that are represented in PascalCase and as nouns. This would modify
the above example to the following:
post translational interaction:
represented_as: edge
preferred_id: id
source: protein
target: protein
input_label: post_translational
label_as_edge: INTERACTS_POST_TRANSLATIONALLY
The biocypher configuration YAML file
Most of the configuration options for BioCypher can and should be specified in
the configuration YAML file, biocypher_config.yaml
. While BioCypher comes with
default settings (the ones you can see in the Configuration section),
we can override them by specifying the desired settings in the local
configuration in the root or the config
directory of the project. The
primary BioCypher settings are found in the top-level entry biocypher
. For
instance, you can select your output format (dbms
) and output path, the
location of the schema configuration file, and the ontology to be used.
biocypher:
dbms: postgresql
output_path: postgres_out/
schema_config: config/schema-config.yaml
head_ontology:
url: https://github.com/biolink/biolink-model/raw/v3.2.1/biolink-model.owl.ttl
root_node: entity
You can currently select between postgresql
, neo4j
, and arangodb
(beta) as
your output format; more options will be added in the future. The output_path
is relative to your working directory, as is the schema-config path. The
ontology
should be specified as a (preferably persistent) URL to the ontology
file, and a root_node
to specify the node from which the ontology should be
traversed. We recommend using a URL that specifies the exact version of the
ontology, as in the example above.
DBMS-specific settings
In addition to the general settings, you can specify DBMS-specific settings
under the postgresql
, arangodb
, or neo4j
entry. For instance, you can
specify the database name, the host, the port, and the credentials for your
database. You can also set delimiters for the entities and arrays in your import
files. For a list of all settings, please refer to the Configuration
section.
neo4j:
database_name: biocypher
uri: neo4j://localhost:7687
user: neo4j
password: neo4j
delimiter: ','