Dataset publication guidelines
To ensure the proper ingestion of your datasets and facilitate its dissemination on the EU Vocabularies website we advise you to comply with a set of basic rules as follows:
Packaging format and communication
The content of the future publication will be delivered as a zip archive.
The delivery has to take place in accordance with the scheduled “code freeze” date.
Any change of date has to be communicated at least 2 weeks in advance of “code freeze”.
Unless defined otherwise, the package will be sent to the following email address:
OP-EU-VOCABULARIES@publications.europa.eu
Content of the publication package
A package will not be accepted for publication unless the following components are included:
Dataset file(s)
- The actual dataset files will always be located in the root folder of the archive
- Depending on the type, the files will be in one of the following formats
- Semantic vocabularies: RDF, TTL, XML, JSON-LD
- Generic vocabularies: CSV, GC, XML, SVG
- Models: OWL, XML Schema, DTD, XML, TTL
- Alignments: RDF, TTL, XML
Documentation
- Every dataset type intended for publication will be accompanied by at least a documentation file and a release note
- All documentation files associated with the dataset will be stored in the Documentation folder
- The Documentation folder will be located in the root folder of the main package
- The documentation will be provided only in HTML or PDF format
- Any documentation file will clearly state in the beginning the dataset name and the title of the document (first page or first screen to be displayed)
- If only on documentation file is provided, this file will contain at least the following sections:
- Title of the document
- Title of the dataset
- The scope and intended target of the document
- A basic description of the dataset
A main section presenting the dataset at large, as well as its intended use, should be included. Such a description might give details about the structure, usage principles, data models, associated statistics, etc.
- The Release notes will be stored in the Release folder that is located in the root folder of the main package
- The release notes will be delivered as a HTML, PDF or TXT file.
- The Release note will contain at the minimum : the version ID, a list of distribution formats included in the release, contact details of the copyright owner and if possible a list of new elements that the release is providing
Optionally, and if relevant for the scope of the dataset, a publication package might contain as well:
- Sample files – Packed together as a zip file with the name Samples. Stored in the root folder of the main package
- Diff files – Stored as independent files under the folder Diff that is located in the root folder of the main package
Depending on the type of dataset, some elements of the package might differ.
Any such deviation has to be clarified in advance with the publication team (OP-EU-VOCABULARIES@publications.europa.eu)
File naming and conventions
In order to ensure clarity in communicating the scope of each file to the intended users it is advisable to use a proper naming convention for the various files stored in the publication package.
Our preferred file naming structure follows the rules bellow:
DA – [Required] Dataset name or acronym (e.g. EuroVoc, IMMC, ECLAS, etc.)
FC – [Required] File content, intent or distribution (e.g. Alignment, Example, User_manual, Release_note, Diff, SKOS, MARC, etc)
VS – [Optional] Version ID or date of the dataset|
EXT – File extension (e.g., RDF, TTL, XML, PDF, CSV, etc.)
File name = DA_FC_VS.EXT
No spaces are accepted in the file names of the package or the files included in the publication package.
In case of non compliance
If an already existing convention (for content, labels, etc.) was defined and/or used for previously published packages, please inform the publication team (OP-EU-VOCABULARIES@publications.europa.eu) to identify the best approach to be followed.
Conference tutorials
Navigating Reference Data: Advanced Techniques for Editing, Visualising and Retrieving Reference Datasets
This half-day tutorial (ESWC 2026) demonstrates how interoperable, multilingual semantic assets can be collaboratively produced, validated, enriched, and published using mature, large-scale tools deployed by the Publications Office of the European Union. It presents a complete workflow built around VocBench, ShowVoc, SELEN, and Cellar, bridging semantic-web research with production-grade infrastructures that support EU-wide interoperability.
The tutorial showcases an ecosystem where core Semantic Web standards (RDF, SKOS, OWL, SPARQL, SHACL) are used at scale to manage reference data, power legal and open-data portals, and feed constantly maintained “AI-ready” datasets. Participants will see how semantic assets developed in this environment underpin services like EU Vocabularies, EUR-Lex, data.europa.eu, and how similar approaches can be reused in national administrations, data spaces, research infrastructures and other large organisations interested in FAIR, machine-actionable semantics.
The tutorial covers the full lifecycle of semantic assets, from collaborative editing and validation to enriched publication and programmatic access. It is structured around four core components:
VocBench – a collaborative web-based editor for ontologies, SKOS(/XL) thesauri, lexicons, authority tables and other RDF datasets. It supports multilingual editing (with change tracking, versioning, and SHACL validation) and staged workflows, allowing editors to work on draft projects and promote stable versions for publication.
ShowVoc – a multilingual catalogue and viewer for published vocabularies. It provides an intuitive interface for browsing ontologies and thesauri with search, faceted navigation, concept visualisations, and multi-format exports. ShowVoc also exposes persistent URIs, REST APIs and SPARQL endpoints to enable machine access to the reference data.
SELEN – an AI-assisted auto-tagging and semantic enrichment service integrated into editorial workflows. SELEN analyses textual content (titles, abstracts, descriptions) and suggests relevant concepts (e.g. from EuroVoc) to accelerate and improve the consistency of manual annotations. It uses language-processing pipelines and vector representations of terms to rank candidate annotations, with a human-in-the-loop approach for editors to accept or refine suggestions.
Cellar – the common data repository and RDF knowledge graph of the Publications Office. It stores metadata and content for EU portals (EUR-Lex, EU Publications, data.europa.eu, etc.) and exposes them via RESTful services and a public SPARQL endpoint. In the tutorial, practical SPARQL queries against Cellar will demonstrate how to retrieve vocabularies and their usage in EU content, with discussion of query design, pagination, and performance.
After the tutorial, participants will be able to:
- Grasp the end-to-end lifecycle of reference data management in a large institution – from collaborative creation to publication and querying.
- Use VocBench to model, validate, and maintain multilingual SKOS/OWL vocabularies and authority tables.
- Leverage ShowVoc to disseminate vocabularies via user-friendly web interfaces and APIs for both human and machine consumers.
- Write SPARQL queries on Cellar to retrieve semantic assets and examples of their usage in EU content.
- Understand AI enrichment tools like SELEN for semantic tagging, and evaluate their benefits and limitations in a curation workflow.
Relevant references
VocBench video tutorials
ShowVoc video tutorials
09:00 – 09:15 Introduction and motivation
09:15 – 9:45 Session 1 – VocBench: collaborative editing and governance
09:45 – 10:00 Session 2 – ShowVoc: publication, browsing and APIs
10:00 –10:30 Session 3 – SELEN: AI-assisted semantic enrichment
10:30 –11:00 Break
11:00 –11:15 Session 4 – Cellar: SPARQL access to the Publications Office content
11:15 –12:15 Hands-on exercises
12:15 –12:30 Wrap-up and Q&A
Intended audience: This tutorial is designed for semantic web practitioners, ontology engineers, data stewards, data-space architects, researchers and developers who work with reference data, linked data portals or interoperable information systems. It is especially relevant to professionals in public administrations, European institutions, research infrastructures, and other large organisations that maintain controlled vocabularies or ontologies, or that integrate such reference data into applications (including AI systems). A basic familiarity with RDF and SKOS is recommended, but the material will remain accessible to motivated newcomers.
Anikó Gerencsér is the Team leader of the Reference data team at the Publications Office of the European Union. She holds a Master`s Degree in Library and Information Science and PhD in Italian Literature from the University Eötvös Loránd Tudományegyetem of Budapest (ELTE). In her current role she coordinates the maintenance of taxonomies, thesauri, ontologies and authority lists and their publication on the EU Vocabularies website. Her team is actively supporting EU institutions and Data Spaces by providing reference data maintenance solutions and publication services, data modelling according to Semantic Web standards, and creating alignments between controlled vocabularies. She is involved in the coordination of the development of the VocBench Suite tools and she has a long experience in providing presentations, demos, trainings and workshops about the use of the tools to clients from EU institutions and beyond.
Vassilis Tzouvaras has received the B-Eng in Electronic Engineering of Essex University, the M-Eng in Automatic Control & Systems Engineering of Sheffield University and the Ph.D. in the at the National Technical University of Athens in the field of knowledge technologies. He is an External Semantic Web Expert in the Reference Data Management team at the Publications Office of the European Union, contributing to the management of EU Vocabularies through VocBench, ShowVoc, and the end-to-end semantic publication workflow. In parallel, he is a Senior Researcher at the National Technical University of Athens (NTUA), where his research focuses on Semantic Web technologies, knowledge graphs, and AI-driven data enrichment. He has experience in the design of AI-assisted semantic annotation workflows, including the integration and evaluation of tools such as SELEN for semantic auto-tagging and the creation of high-quality, AI-ready training datasets.
Sébastien Albouze has an Engineering degree from the École des Mines de Nancy (Master’s degree) in Computer Science. He is currently an External Semantic Web Expert in the Reference Data Management team at the Publications Office of the European Union. He is responsible for the operation and evolution of VocBench and ShowVoc, and for supporting the end-to-end publication workflow of EU reference data and vocabularies. His expertise includes SPARQL query design, semantic data validation, and large-scale RDF publishing over the Cellar knowledge graph. He has extensive experience in managing production-grade semantic infrastructures and supporting institutional users in the reuse of authoritative reference data.
Cosimo Palma currently serves as a Knowledge Management Assistant at the Publications Office of the European Union. He recently obtained a PhD in Artificial Intelligence from the University of Naples "L'Orientale" and the University of Pisa, where his research explored integrating Large Language Models with Knowledge Graphs for Automatic Story Generation in Cultural Heritage. Drawing from his background in Philosophy, History, Computational Linguistics, and Computer Sciences, with his work he tries to catalyze a human-centered AI. At thereference data team of the Publications Office, he contributes to the maintenance and use of semantic tools including VocBench and ShowVoc, and applies SPARQL queries on the Cellar knowledge graph to support semantic annotation and vocabulary reuse.