Common Data Model
The aim of the Cellar is to allow the sharing and reuse of the publications and metadata produced at the Publications Office and is based for that purpose on a chain of production that includes the use of different technologies, including the use of semantic technologies such as the CDM.
The creation of a specialized ontology at the Publications Office described in Web Ontology Language (OWL) is motivated by guidelines developed by the World Wide Web Consortium (W3C) for a web based on standard development processes, promoting the principles of accessibility, internationalization, privacy, security, and overhaul efficiency of the web. The use of semantic web technologies then comes to further support the European Commission’s priorities of interoperability.
To allow a fast and easy way to make documents accessible, it is important for them to be published in a consistent manner and to keep them updated efficiently. Part of the solution lies in the way those documents and their metadata are received. This is what is at the heart of the IMMC protocol, a standardized way of receiving metadata, commonly agreed on at the inter-institutional level, and aimed at "sustaining data flows in an efficient and scalable way".
Metadata instances compliant with the IMMC schema are then converted to information matching the Common Metadata Modelling system used in the Cellar, i.e., the CDM. The CDM structures the documents’ metadata and allows those documents to be easily retrieved in the Cellar through the formal description and connection of their metadata.
Finally, the CDM is also the foundation of the Virtuoso triple store on which Cellar relies, and that is at the heart of the communication between the SPARQL interface, and the documents stored in the Cellar.
The CDM allows for a more precise description of the metadata in the Cellar. Indeed, the CDM helps to describe all the information related to the documents that are stored, and this through the definition of concepts. The CDM also describes the links between the documents and their metadata by using diverse properties to ensure that they follow Linked Data standards.
For example, the CDM can connect different documents created by the same author (e.g., document 1, 2 and 4) or issued in the same type of publication (e.g., document 1, 2 and 3), or both, meaning documents created by the same author and issued in the same type of publication (e.g., document 1 and 2).
Those concepts and properties are partly defined by existing semantic methodologies and models such as the FRBR Model, authority tables, the EuroVoc thesaurus, and taxonomies.
However, the CDM also offers the possibility to customize these concepts and properties to fit the business’ needs for describing their documents. Finally, these concepts and properties use external standards of the semantic web such as SKOS, OWL or RDFS, depending on the level of complexity the model displays, OWL being the standard that is usually used for complex models.
The Common Data Model (CDM) is the formal description of the EU institution's official documents, and legislative decision-making process and its related publications.
As part of the foundations of the Cellar, the CDM ontology describes more than 200 different types of documents, that way supporting the Cellar in their dissemination across all 76 EU institutions, bodies, and agencies.
The CDM describes documents created for legal and legislative purposes (e.g., the Official Journal, case law, consolidated legislation, preparatory acts, etc.) as well as general publications usually identified by a catalogue number (e.g., reports, periodical serials, books, brochures, etc.).
In addition to the WEMI architecture entailed by the FRBR method, which focus mainly on documents as we’ll see in the next section, the CDM also uses additional key concepts related to the creation of official documents in the EU, knowing the concepts of AGENT, DOSSIER, and EVENT.
For example, with the CDM, we can formally express the following statement “an advocate general delivers a case law” by defining an AGENT (e.g., class cdm:advocate_general) that acts (e.g., object property cdm:delivers) on a WORK (e.g., class cdm:case-law). In other words, the class AGENT refers to entities that can act in the document’s creation. Those entities can be humans such as advocates and judges, or they can be more abstract, non-human entities, such as institutions, countries, or organizations.
The creation process of a document/object can also be formalized in the CDM through the identification of a TEMPORAL ENTITY (e.g., DOSSIER or EVENT) related to it.
The DOSSIER class is a container class referring to a timespan during which a procedure of the institution can extend. For example, it can refer to timespans during which cases at court (cdm:case_court) or joint declarations (cdm:joint-declaration) , for example, can take place. The DOSSIER class is related to, and sometimes must contain at least one EVENT, like the cdm:case_court we illustrate below. Indeed, the EVENT class refers to precise moments in time related to the realization of procedures and that were then documented.
The CDM is partially based on the model of Functional Requirements for Bibliographic Record (FRBR) published by the International Federation of Library Associations. The FRBR proposes 4 concepts to formalize relations between bibliographic entities: WORK, EXPRESSION, MANIFESTATION, and ITEM. It is otherwise known as the WEMI-Model (IFLA, n.d.). We tailored this WEMI model to match the CDM’s purpose, which means the EXPRESSION and ITEM entities differ from their original description.
- A WORK corresponds to the abstract idea of an intellectual or artistic content. For example, it can refer to a legal resource such as the Official Journal, case law, consolidated legislation, and other legal and legislative publications, without referring to a particular EXPRESSION of it.
- An EXPRESSION corresponds to the language used for editing a WORK.
- A MANIFESTATION is the embodiment of an EXPRESSION of a WORK. In our legal resource case, it can refer for example to the PDF format (MANIFESTATION) used to issue the text in English (EXPRESSION) of the Official Journal (WORK).
- An ITEM is a single exemplar of a manifestation. In our case, it can refer to one specific issue in a physical format (ITEM), usually identified by a cataloguing number, of the Official Journal (WORK). It can take the form of a physical copy of the journal (MANIFESTATION) that contains text in English (EXPRESSION) and that has unique DOI. As mentioned on the IFLA webpage, “Cataloguing is generally done, based on an item directly available to a cataloguer” (IFLA, n.d.).
That way, our adapted FRBR model entails that an ITEM belongs to 1 or N MANIFESTATION(S), which itself/themselves manifest(s) exactly 1 EXPRESSION that belongs to exactly 1 WORK.
We can further illustrate our adaptation of the FRBR model with an excerpt of the metadata associated with the published issue L_202400568 of the Official Journal:
The CDM uses this WEMI architecture as the foundation for a more complex model representation of publications. Indeed, each of these abstract concepts are more precisely described in the scope of the European publications through the CDM.
For example, the WORK class displays about 250 subclasses to account for the different types of publications that can be issued by the EU’s institutions. We only display an excerpt of these subclasses of WORK to illustrate the diversity of publications that can be described with the CDM. These subclasses are not fixed and can evolve according to changes in the businesses’ needs.
Though less extensively, the EXPRESSION, MANIFESTATION and ITEM classes are also more detailed than WEMI’s basic architecture to accommodate the Publications Office’s needs.
The CDM also relies on authority tables (otherwise known as Name Authority Lists (NALs) ) that “help harmonize and standardize the codes and associated labels used in various environments” (Publications Office of the European Union, n.d.). NALs allow for the creation of a controlled vocabulary to reduce linguistic variations and/or orthographic variations, as well as variations, ambiguities, and redundancy of acronyms across institutions.
The NALs are concepts defined with resource IRI such as:
For example, we mentioned the AGENT cdm:country earlier. This agent is constrained in the CDM as having a skos:inScheme value corresponding to the following authority list:
Any metadata in the Cellar that corresponds to the concept cdm:country has to respect encoding standards defined in the authority table. That way, we avoid as many variations in the metadata as there are ways to refer to a given country across languages (e.g., Belgium, Belgique, Bélgica, etc.) or across possible acronyms (e.g., BE, BEL, etc.).
These authority lists are also used, for example, for metadata referring to languages, corporate bodies, institutions, types of procedures or treaty classification.
How do properties create links between concepts?
As an OWL ontology, the CDM allows to create as many properties as needed to fully describe the many interactions between concepts. Currently, there are approximately 1000 relations between the objects described in the CDM.
This allows for metadata to be connected in the Cellar in both inferred and non-inferred way.
For example, in Cellar, the metadata may say that “advocate general Y delivers case law X”. If we suppose that the AGENT and the WORK are connected in a non-inferred way: this means that the link between the two objects is obtained directly from the metadata of the document ingested in the Cellar.
Then, because there is an inverse object property defined for cdm:delivers in the CDM (cdm:delivered_by) Cellar will also be able to infer the following statement “case law X is delivered by advocate general Y”.
That way, it is possible to retrieve the information for the case law X coming from the agent’s metadata, but it is also possible to retrieve the agent’s information coming from the case law metadata.
How do properties shape metadata?
In addition to object properties, OWL ontologies also allow the definition of data properties. These properties describe the type of data that is expected for a given concept.
For example, in the CDM the data property cdm:case-law_affaire_year specifies that this metadata has a domain (i.e., the subject of the data property) that is cdm:case-law and a range (i.e., the value of this property) that is xsd:gYear. In other words, to associate a year of the affaire to a case law, the value needs to be expressed in an xsd:gYear format, and this constraint comes directly from the CDM rules.
We have approximately 900 data properties in the CDM.
What is cardinality in the CDM?
As mentioned in the previous sections, properties can define relationships between concepts and constrain the format of metadata according to specific rules defined in the CDM. Another possibility that the CDM gives is to define cardinality for properties.
For example, the CDM allows to define a property cdm:item_belongs_to_manifestation and give it a cardinality of “minimum 1” (1 to N). Because this is defined as a restriction in the CDM, this means that to be considered an “item”, an instance has to be linked to minimum 1 “manifestation” with the property cdm:item_belongs_to_manifestation. The cardinality for an instance to be considered a “manifestation” is of exactly 1 “expression”. Cardinality also works with data properties, such as defining exactly 1 “manifestation type” in the rdfs:Literal format.
In other words, what the illustration below tells us is that for an instance in the Cellar to be considered a manifestation, it must be the manifestation of exactly 1 “expression”, and it needs to have a type formatted in a textual string for example.
Bear in mind that we only show a selection of examples here and that cardinality and property restrictions are a lot more complex than that in the CDM, with dozens of them defined for each class.
How do we assess the quality of the metadata?
One key principle of CDM is to ensure a high level of quality for the metadata while responding to the evolving needs of the businesses. To do so, different tests are implemented to guarantee this level of quality.
One of them consists of making sure the metadata in the CDM corresponds to our standards. This is done, for example, using a consistent and normalized structure across versions of the CDM, the systematic definition of classes and properties, or the use of naming conventions, etc.
With these tests, we also want to ensure the backward compatibility of the CDM so that the business’ requirements are met, without breaking existing functionalities by adding or changing code.