Ensuring that Machine Learning datasets are not just loadable by code, but semantically understood across research infrastructures.
Our core strategy revolves around extending the ML-focused Croissant standard to bridge the gap between raw data and domain-specific knowledge.
Mapping ML dataset layers to high-level ontologies (e.g., Schema.org, PROV-O, and domain-specific vocabularies).
Utilizing RDF-native multilingual properties to allow metadata to be queryable and accessible in multiple languages.
Alignment and integration with other datasets to broaden research infrastructure compatibility.
Croissant is a high-level format for machine learning datasets that brings together four rich layers to ensure discoverability, portability, and reproducibility.
Definitions for FileObject (individual files) and FileSet (homogeneous collections). This layer handles the physical distribution and checksums for integrity.
Layer 2Describes the semantic structure of data. Whether it's tabular CSV rows or nested JSON objects, RecordSets provide a common field-based mapping for ML loaders.
Layer 3Extends the metadata to capture data lifecycle, labeling provenance, and AI safety markers, enabling automated RAI metrics computation.
Learn more at the official MLCommons Croissant Specification.
By adopting the Semantic Croissant approach, we deliver services that support the next generation of AI-driven research.
Explore the Project Scope