Skip to content

Architecture

biotope has two concerns wired into one CLI:

  1. Project & metadata version control — git-like tracking of datasets and their Croissant metadata.
  2. Knowledge-graph construction — Croissant JSON-LD → BioCypher project, deterministically.

Both layers live in this repo.

Modules

biotope/
├── commands/              CLI verbs (Click)
│   ├── map.py             biotope map group: inspect / scaffold / preview / wizard
│   └── map_wizard.py      Rich-based guided wizard
├── croissant/             KG construction backend
│   ├── spec.py            Pydantic models for Croissant 1.1
│   ├── codegen/           Jinja schema codegen (typed Dataset/Field classes)
│   ├── acquisition/       DuckDB row streaming
│   ├── mapping/           semantic mapping IR (entities/relations/selectors/scans)
│   │   ├── model.py       Pydantic IR + legacy nodes/edges rejection
│   │   ├── selectors.py   value-level resolver (passthrough/as_curie/hash_id + $item)
│   │   ├── scans.py       RowScanOperation, ExplodeScanOperation
│   │   ├── inspector.py   deterministic Croissant inspector (records/fields/samples)
│   │   ├── preview.py     validate partial mapping + project BioCypher schema
│   │   ├── compile.py     compile mapping into BioCypher-compatible tuple streams
│   │   ├── defaults.py    unresolved-scaffold builder (heuristic-free)
│   │   └── render.py      semantic YAML renderer + inspector appendix
│   ├── alignment/         alignment.yaml schema + cross-mapping merge
│   ├── scaffold/          emits a runnable BioCypher project
│   ├── registry/          BioCypher-adapter registry client (local + HTTP)
│   └── api.py             pure-function surface for tests + CLI verbs
├── project_model.py       .biotope/project.yaml schema
└── templates/AGENTS.md    agent instructions copied into new projects

Data flow

biotope init ─► .biotope/project.yaml          (purpose only)
                  │ biotope map --entity ... --relation ...
            project.yaml with required_entities / required_relations
raw files ──► biotope add ──► .biotope/datasets/<name>.jsonld    (Croissant; baker-enriched)
                  │ biotope map scaffold <croissant>
            mappings/<name>.mapping.yaml  (unresolved slots + inspector appendix)
                  │ biotope map (wizard)  OR  edit YAML + biotope map preview --json
            mappings/<name>.mapping.yaml  (fully resolved entities + relations)
                  ├─► biotope propose-alignment ──► alignment.yaml      (optional, multi-mapping)
            biotope build ──► build/
                              ├── config/schema_config.yaml  (namespace + input_label)
                              ├── mappings/                  (copied YAML for provenance)
                              ├── generated/<stem>/          (deterministic adapter.py per mapping)
                              └── create_knowledge_graph.py  (BioCypher entry point)
                              python create_knowledge_graph.py ──► BioCypher CSV/Neo4j
                                   biotope view / benchmark

Each transformation is deterministic. Semantic decisions stay with the human or copilot agent: biotope only enumerates options, validates, and previews. build is strict — it refuses to compile mappings with unresolved slots or the legacy nodes/edges schema.

Configuration files

File Owner Purpose
.biotope/project.yaml content Competence questions: purpose, required_entities, required_relations
.biotope/config.yaml technical Croissant schema version, validation rules, registry URLs
.biotope/datasets/*.jsonld autogenerated Croissant metadata per tracked file (baker fills structure)
mappings/*.mapping.yaml authored Semantic IR: entities + relations over Croissant record sets, with ids for reusable selectors. Compiles directly into BioCypher node/edge tuples.
alignment.yaml proposed Cross-mapping same_node equivalences over semantic entity keys
AGENTS.md template Agent instructions; copied at init time

Resolution precedence (lowest first): ~/.config/biotope/config.yaml.biotope/config.yaml.biotope/project.yaml → CLI flag. CLI flags always win.

--visible at init promotes project.yaml to the project root for users who don't want a dotfolder.

Agent surface

The CLI is the agent contract. An agent reads AGENTS.md, asks the user competence questions, and translates answers into CLI invocations — the same ones a human would type. There is no MCP server; nothing is hidden behind a different protocol.

Hook points for richer integration exist in biotope.croissant.api and biotope.croissant.registry.client.RegistryClient.

Determinism boundary

Below the CLI everything is deterministic Python. LLMs, when present, sit above the CLI:

human / LLM agent  ── reads AGENTS.md ──► biotope <verb> --flag …
                                              │  deterministic
                                    biotope.croissant.api.*

Builds are reproducible from mapping.yaml + alignment.yaml + the Croissant files in .biotope/datasets/. Rebuilding with the same inputs yields the same graph.

Relationship to neighbouring repos

  • croissant-baker — invoked by biotope add to autogenerate Croissant field-level metadata (column types, row counts) for handled file formats.
  • BioCypher — the build output is a BioCypher project; biotope does not depend on BioCypher at the library level.
  • epistemic-agent — shares the problem-first / AGENTS.md convention. Integration via biotope read is planned.
  • BioContextAI registry — discovery surface for MCP servers; biotope discover reads from a separate BioCypher-adapter registry that mirrors the shape conventions.