Annotate Omics
The biotope annotate
module provides tools for creating and managing metadata annotations using the Croissant ML schema. This document provides detailed examples and instructions for working with different layers of Croissant ML.
Installation
Basic Usage
The annotation module can be used in several ways:
# Interactive mode
biotope annotate interactive
# Create metadata from CLI parameters
biotope annotate create
# Validate existing metadata
biotope annotate validate --jsonld <file_name.json>
# Load existing record
biotope annotate load
Croissant ML Layers
Croissant ML organizes metadata in several layers, each serving a specific purpose in describing your dataset.
1. Dataset Layer
The dataset layer provides high-level information about your entire dataset.
Example:
{
"@type": "sc:Dataset",
"name": "Example Dataset",
"description": "A sample dataset for demonstration",
"license": "MIT",
"version": "1.0.0",
"datePublished": "2024-03-20",
"creator": {
"@type": "Person",
"name": "John Doe"
}
}
2. Distribution Layer
The distribution layer describes how the dataset is distributed and accessed.
Example:
{
"@type": "sc:DataDownload",
"name": "Dataset Distribution",
"contentUrl": "https://example.com/dataset.zip",
"encodingFormat": "application/zip",
"contentSize": "1.2GB",
"sha256": "abc123..."
}
3. Record Set Layer
The record set layer defines the structure of your data records.
Example:
{
"@type": "sc:RecordSet",
"name": "Main Records",
"description": "Primary data records",
"field": [
{
"@type": "sc:Field",
"name": "id",
"description": "Unique identifier",
"dataType": "string"
},
{
"@type": "sc:Field",
"name": "value",
"description": "Numerical value",
"dataType": "float"
}
]
}
4. Field Layer
The field layer provides detailed information about individual data fields.
Example:
{
"@type": "sc:Field",
"name": "temperature",
"description": "Temperature measurement in Celsius",
"dataType": "float",
"unit": "celsius",
"minimum": -273.15,
"maximum": 100.0
}
Best Practices
- Completeness: Always provide as much metadata as possible for each layer
- Consistency: Use consistent naming conventions and data types
- Validation: Regularly validate your metadata using
biotope annotate validate
- Versioning: Include version information for both the dataset and metadata
Common Use Cases
Creating a New Dataset Annotation
-
Start with the interactive mode:
-
Follow the prompts to enter:
- Dataset information (name, description, license)
- Distribution details (format, size, URL)
- Record structure (fields, data types)
- Field-specific metadata (units, ranges, descriptions)
Validating Existing Annotations
This will check your metadata against the Croissant ML schema and report any issues.
Future Improvements
The following features are planned for future releases: - Automatic metadata extraction from file contents - Integration with LLMs for automated annotation - File download and automatic annotation - Enhanced validation capabilities - Support for additional Croissant ML fields