Architecture
biotope has two concerns wired into one CLI:
- Project & metadata version control — git-like tracking of datasets and their Croissant metadata.
- Knowledge-graph construction — Croissant JSON-LD → BioCypher project, deterministically.
Both layers live in this repo.
Modules
biotope/
├── commands/ CLI verbs (Click)
│ ├── map.py biotope map group: inspect / scaffold / preview / wizard
│ └── map_wizard.py Rich-based guided wizard
├── croissant/ KG construction backend
│ ├── spec.py Pydantic models for Croissant 1.1
│ ├── codegen/ Jinja schema codegen (typed Dataset/Field classes)
│ ├── acquisition/ DuckDB row streaming
│ ├── mapping/ semantic mapping IR (entities/relations/selectors/scans)
│ │ ├── model.py Pydantic IR + legacy nodes/edges rejection
│ │ ├── selectors.py value-level resolver (passthrough/as_curie/hash_id + $item)
│ │ ├── scans.py RowScanOperation, ExplodeScanOperation
│ │ ├── inspector.py deterministic Croissant inspector (records/fields/samples)
│ │ ├── preview.py validate partial mapping + project BioCypher schema
│ │ ├── compile.py compile mapping into BioCypher-compatible tuple streams
│ │ ├── defaults.py unresolved-scaffold builder (heuristic-free)
│ │ └── render.py semantic YAML renderer + inspector appendix
│ ├── alignment/ alignment.yaml schema + cross-mapping merge
│ ├── scaffold/ emits a runnable BioCypher project
│ ├── registry/ BioCypher-adapter registry client (local + HTTP)
│ └── api.py pure-function surface for tests + CLI verbs
├── project_model.py .biotope/project.yaml schema
└── templates/AGENTS.md agent instructions copied into new projects
Data flow
biotope init ─► .biotope/project.yaml (purpose only)
│
│ biotope map --entity ... --relation ...
▼
project.yaml with required_entities / required_relations
│
raw files ──► biotope add ──► .biotope/datasets/<name>.jsonld (Croissant; baker-enriched)
│
│ biotope map scaffold <croissant>
▼
mappings/<name>.mapping.yaml (unresolved slots + inspector appendix)
│
│ biotope map (wizard) OR edit YAML + biotope map preview --json
▼
mappings/<name>.mapping.yaml (fully resolved entities + relations)
│
├─► biotope propose-alignment ──► alignment.yaml (optional, multi-mapping)
│
▼
biotope build ──► build/
├── config/schema_config.yaml (namespace + input_label)
├── mappings/ (copied YAML for provenance)
├── generated/<stem>/ (deterministic adapter.py per mapping)
└── create_knowledge_graph.py (BioCypher entry point)
│
▼
python create_knowledge_graph.py ──► BioCypher CSV/Neo4j
│
▼
biotope view / benchmark
Each transformation is deterministic. Semantic decisions stay with the human or copilot agent: biotope only enumerates options, validates, and previews. build is strict — it refuses to compile mappings with unresolved slots or the legacy nodes/edges schema.
Configuration files
| File | Owner | Purpose |
|---|---|---|
.biotope/project.yaml |
content | Competence questions: purpose, required_entities, required_relations |
.biotope/config.yaml |
technical | Croissant schema version, validation rules, registry URLs |
.biotope/datasets/*.jsonld |
autogenerated | Croissant metadata per tracked file (baker fills structure) |
mappings/*.mapping.yaml |
authored | Semantic IR: entities + relations over Croissant record sets, with ids for reusable selectors. Compiles directly into BioCypher node/edge tuples. |
alignment.yaml |
proposed | Cross-mapping same_node equivalences over semantic entity keys |
AGENTS.md |
template | Agent instructions; copied at init time |
Resolution precedence (lowest first): ~/.config/biotope/config.yaml → .biotope/config.yaml → .biotope/project.yaml → CLI flag. CLI flags always win.
--visible at init promotes project.yaml to the project root for users who don't want a dotfolder.
Agent surface
The CLI is the agent contract. An agent reads AGENTS.md, asks the user competence questions, and translates answers into CLI invocations — the same ones a human would type. There is no MCP server; nothing is hidden behind a different protocol.
Hook points for richer integration exist in biotope.croissant.api and biotope.croissant.registry.client.RegistryClient.
Determinism boundary
Below the CLI everything is deterministic Python. LLMs, when present, sit above the CLI:
human / LLM agent ── reads AGENTS.md ──► biotope <verb> --flag …
│ deterministic
▼
biotope.croissant.api.*
Builds are reproducible from mapping.yaml + alignment.yaml + the Croissant files in .biotope/datasets/. Rebuilding with the same inputs yields the same graph.
Relationship to neighbouring repos
- croissant-baker — invoked by
biotope addto autogenerate Croissant field-level metadata (column types, row counts) for handled file formats. - BioCypher — the
buildoutput is a BioCypher project; biotope does not depend on BioCypher at the library level. - epistemic-agent — shares the problem-first /
AGENTS.mdconvention. Integration viabiotope readis planned. - BioContextAI registry — discovery surface for MCP servers;
biotope discoverreads from a separate BioCypher-adapter registry that mirrors the shape conventions.