AI-ready datasets
Artificial intelligence depends on access to high-quality, reliable and structured data. Training models responsibly requires datasets that are governed, interoperable, traceable, and enriched with authoritative metadata, qualities that are often difficult for research and industry to obtain. Within the European ecosystem, this need has been recognised through the European AI policy agenda, which emphasises strengthening data ecosystems and improving access to trustworthy, reusable datasets for AI development.
As the official provider of publishing and reference data services for the European Union, the Publications Office plays a unique enabling role. Through the combination of Cellar, Europe’s common repository for documents and metadata, and EU Vocabularies, Europe’s reference and semantic asset hub, we are able to generate curated, AI-ready corpora. These corpora leverage human-annotated resources, authoritative taxonomies, persistent identifiers, multilingual assets and FAIR-aligned publication practices, making them highly suitable for training, validating and benchmarking AI systems.
This page introduces the first corpora produced under this initiative, knowing a set of general publications extracted from Cellar and tagged with EuroVoc concepts. The aim of this dataset is to support the development of machine learning models for document classification, semantic tagging and entity recognition in institutional environments.
This first corpus demonstrates how Cellar and EU Vocabularies act as an AI-ready foundation: they transform public sector knowledge into reusable training assets for AI applications, support evaluation of machine learning solutions, and contribute to data spaces, AI factories and semantic services across Europe.
This page will continue to evolve with updates on datasets, usage guidance, governance principles and links for access, contributing to trustworthy, European AI development grounded in authoritative public data.