Publications Office of the EU
General publications corpus - EU Vocabularies
DisplayCustomHeader
EU publications corpus

EU publications corpus

This corpus contains a curated set of EU publications documents (including educational materials, reports, briefings, guidelines etc.) extracted from the CELLAR repository, associated with EuroVoc tags. This corpus was developed to provide a large and robust dataset with a broad coverage of domains and tags to capture variance.

Methodology 

The creation of the corpus was a multi-phase process designed to systematically refine a vast and noisy initial dataset into a high-quality, balanced corpus.

Phase 1: General Data Extraction from CELLAR 

The foundational step involved a bulk extraction of EU publications document links, their corresponding metadata, and especially EuroVoc tags, from the CELLAR repository. This initial trawl was designed to be broad, capturing a wide array of documents to form the raw material for the corpus. We restricted this extraction to English language documents in PDF/A1b or PDF/A2a formats only and collected 13 984 document links.

Phase 2: Exploratory Data Analysis and Frequency Distribution

To further refine this pool, a detailed analysis of the 3 965 EuroVoc tags was performed. As shown in Graph 1 below the tag EU Member State is an important outlier with over 2 500 occurrences.

Graph 1: 50 most frequent EuroVoc tags by Frequency in the general extraction 

 

This tag is so general that its inclusion in a filtering model is not necessary and drowns out more specific and informative tags. The gradual decrease in frequency for the remaining tags, with no clear cutoff, made it difficult to separate "high-frequency" from "low-frequency" tags, necessitating a more arbitrary selection strategy. Out of the 1 406 active tags, we decided it was necessary to select the ones that we wanted to use to refine our extraction.

Phase 3: Scenario Modelling and Selection

Several extraction scenarios were evaluated, varying both the number of tags considered (for example, the 50 vs 150 most frequent EuroVoc tags) and the minimum number of such tags that a document must display. Our goal was to find a trade-off between tag diversity and the volume of unique documents to be selected. We also wanted to avoid over-representing the least frequent tags that are very diverse the more we go down the gradual decrease of frequency.

The chosen scenario requires each document in the golden sample to contain at least two EuroVoc tags drawn from the 100 most frequent tags in the general extraction, excluding the EU Member State outlier. This resulted in a compromise between reasonably scaling down the original extraction to 4 114 documents and offering a good coverage of the tag space.

This selection strategy is appropriate for building models aiming to capture multidimensional content tagging and intermediate diversity of topics. This corpus suits applications requiring moderately granular tagging such as domain-specific content categorization, improved metadata quality, and enhanced document discoverability through more nuanced tag combinations.

Content description

The corpus reflects a clear focus on the EU publications stored in the CELLAR repository, with the data heavily centered on Education and communications (over 3 800 occurrences) and European Union (approx. 2 500) as shown in the graph 2 below. This distribution suggests that the collection is primarily composed of documents designed for public outreach, information sharing, and institutional transparency. Because these are the most common categories in the original source, the corpus acts as a direct and accurate representative of the EU publications found in CELLAR.

While these two areas lead the dataset, the corpus remains highly diverse. Significant coverage in domains such as EMPLOYMENT AND WORKING CONDITIONS, PRODUCTION, TECHNOLOGY AND RESEARCH as well as SOCIAL QUESTIONS or ECONOMICS, ensures that the selected documents represent a wide range of EU activities.

This a graph 2 with shows EuroVoc domains distribution by Frequency in the general publication corpus
Graph 2: EuroVoc domains distribution by Frequency in the EU publications corpus 


 

The graph below also shows the distribution of the 50 most frequent EuroVoc tags in the selected documents. This corpus covers representatively the most frequent EuroVoc tags present in the original extraction while avoiding diluting the corpus with extremely rare tags.

The data also reveals that while "European Union" serves as a broad organizational category, the specific tags provide deep insights into specialized fields like environmental policy, vocational training, and regional policy. This makes the corpus a versatile tool for any user looking for more than just high-level policy papers. For instance, the high frequency of tags related to EU statistics, agricultural data, and employment metrics shows a strong foundation of empirical evidence. Furthermore, the inclusion of user guides and research reports highlights the technical and scientific depth available throughout the collection, moving beyond simple administrative documentation.

This is a graph 3 which shows 50 most frequent EuroVoc tags by Frequency in the general publication corpus
Graph 3: 50 most frequent EuroVoc tags by Frequency in the EU publications corpus 


 

Recap statistics and link to the corpus 

Size: 4 114 documents

Format: PDF/A-1b or PDF/A-2a

Language: English

Tags: 2 out of the 100 most frequent EuroVoc tags

Domain: EU publications

Links to the resources : 

Corpus part 1

Corpus part 2

Corpus part 3

Corpus part 4

Corpus part 5

AI corpus tags