Publications Office of the EU
Resource alignments: Main elements of a proper methodology - EU Vocabularies
Semantic knowledge base - title

Semantic knowledge base

Knowledge Base Display

Resource alignments: Main elements of a proper methodology

This article intends to describe the alignment outcomes, deliverables and methodology. The main idea is to perform automatic alignment between two RDF (mainly SKOS) datasets based on lexical content comparison. The expected result is a set of resource pairs (each from a different dataset) that shall be considered the same or similar, with various degrees of confidence.

Goals

One or multiple files alignment files (in SKOS or EDOAL formats)
One or multiple files containing evaluation samples
A report describing the preliminary dataset assessment, the designed process and parameters, the output alignment files and a final basement

Methodology

Preliminary assessment

In this step the asset pair or set of asset pairs are established and their initial state is assessed to define whether they are suitable as input for the automatic alignment software. Attention shall be paid to both technical and content quality, available languages, presence of duplicates, encoding, estimated pre-processing operations and other aspects. At this step is important to document the initial state of the resources, business relevance of the resources, some of their history, internal structure, then describe what are the final outcomes followed by an enumeration of intended operations to be performed. 

Pre-processing

Based on the initial assessment the input datasets are cleaned up, normalised and transformed into a form suitable for the automatic alignment software.

Useful tools during the pre-processing phase are:

  • VocBench3: Sheet2RDF tool
  • KNimes
  • LinkedPipes ETL
  • SKOS Play from Sparna
  • OpenRefine
  • Custom Python scripts

Alignment design

Following parameters of the project are established in this step:

  • Main inputs: a pair of datasets or in case of batch alignments many-to-one or one-to-many (don’t do many to many)
  • Main outputs: SKOS and/or EDOAL formats
  • Matching rules:
    • Exact matches: only based on perfect equality operator (expected one output) OR
    • Close matched: based on a designed comparison operator (expected multiple outputs, one per degree of confidence: high, medium, low)

Comparison operator(s) design

The operators are encoded in SILK workbench as a Linking Task.

The main fields considered by the alignment comparison operator are linguistic in nature. This means that concepts such as language, word, spacing, sequencing, capitalisation, script, encoding, transliteration and others shall be taken into consideration. In case of SKOS datasets (most of them are expected to be such) the following properties are considered of primary relevance (with various weights):

skos:prefLabel, skos:altLabel
skos:definition, skos:scopeNote
rdfs:label, rdfs:comment


In designing the alignment procedure please consider the relevant factors from the systematisation presented below.

Oznake
alignments semantic technologies
Najnovije
Federated queries 2021. listopada 24
What is a URI Prethodno