LINK CONSTRUCTION ¶
Linking co-referent entities across a variety of datasources is a pragmatic and fast way to seamlessly navigate across datasets without having to agree in a uniform vocabulary. This solution offered in the Semantic Web architecture appears attractive as the ultimate goal for the researcher executing this task is not the integration of data but the extraction of vital information for reaching valid conclusions about problems under scrutiny. This said, the Lenticular Lens offers means to reach that ultimate goal of the researcher while making sure that the steps taken by the researcher are documented such that other researchers can easily re-generate the data leading to specific conclusions if need be.
Along the way of entity-based data integration and data extraction, the Lenticular Lens aims to document among others:
-
The datasources to integrate;
-
The reasons behind a specific integration;
-
The entity types and restrictions that ensure correctness in bridging across datasources of interest;
-
The matching methods and specifications justifying the existence of a set of links.
The Lenticular Lens tool aims to provide generic methods that allows a broader audience suffering the same need for data integration. The first step in creating and documenting links using the Lenticular Lens is defining the scope in witch links are to be created and possibly validated. For that, the tool offers the RESEARCH menu followed by the SELECT and CONSTRUCT menus.
We now go through each of these first three menus underlying the existence of links.
1. Scope¶
The RESEARCH menu is the starting point in learning how to interact with the Lenticular Lens tool. In general, a research question somehow sets the scope in which link creations, manipulations or validation take place. This provides the first building block supporting the user with defining the context in which a particular alignment is generated. Using this menu, part of the context is made explicit by selecting the datasets and entity types necessary to continue the investigation.
As an overview, the RESEARCH menu provides researchers with means to describe the research of interest in terms of:
-
Research Question for inserting the main research question driving the integration.
-
Hypothesis for pointing out the hypothesis in mind prior to the data extraction.
-
Link and Citation to ensure that, if the results happen to be published, the researcher still has the facility to add a link to the publication and and a bibliographic reference for future reuse.
Fig. 4.1 illustrates the different fields to be filled in by the researcher for a quick overview of what can happen in this research project and why. Once providing the information is done, the Save button at the bottom of the page can be clicked to save the provided information and exit the Lenticular Lens if the user which to continue with other tasks. Or, should the user choose to continue the alternative Save and next button can be used to save the project and move to the next window.
2. Data¶
In the previous step or window, the researcher has defined the scope of the research for which data are to be extracted and analysed. In this second window labelled SELECT, the user is to describe and select the entity types involved in his research. For that, the location of the datasource needs to be provided and the datasets in which the respective entities of interest reside need to be selected.
2.1 Data Selection¶
As the user activates the Saves and next button at the end of the previous page, she is presented with a new window with a single card labelled Entity-Type Selection 1 as presented in Fig. 4.2. The plus button at the right side of the picture enables the user to create new cards when needed while the arrow-head button at the left side of the card’s label allows for the unveiling of the card as displayed in Fig 3.
Describing the type of an entity can be done using the Description text box for each entity type. To provide the location of the data, the GraphQL Endpoint text box can be use to fill in the URL of any GraphQL end point. Once the endpoint is given and loaded, a dataset can be selected from the list of datasets available at the provided endpoint. The selection of a dataset will prompt a new dropdown text box as Entity type, providing the user with the facility to select the entity type of interest. After loading the provided URL of the default Golden Agent’s endpoint, Fig. 4.3 shows the list of datasets available at that location to choose from.
2.2 Data Restrictions¶
If need be to filter entities based on specific conditions, this is also possible with the Filter card shown in Fig. 4.6.
Once the button is clicked, this card presents the user with a Filter-Logic box which enable the creation of a relatively complex and versatile entity restrictions. Fig. 4.7 for example show the list of available filtering options while Fig. 4.8 illustrates an example where the has minimum date and has maximum date filtering options are used to isolate entities of interest. These entities are now those between with a registration date between [1600, 1659] and having their respective literal name exempt of trailing dots (…).
2.2.1 Restriction options¶
-
Equal to / Not Equal to. This option allows one to select entities that have the value of a certain property equal (or not) to a certain value. For example, all entities with property ex:workLocation equal (or not) to Amsterdam.
-
Contains / Does not contain. This option is used to make sure that the property-value of the entities of interest contains or does not contain a specific sequence of characters. For example,
%...%
could be used for (i) excluding people whose names contain trailing dots or (ii) to select those entities to apply a particular modification onto their names, like adding the surname of the father for a baptised child whose surname is given as...
. -
Has property / Has no property. This option is used to select entities based on the existence (or not) of a certain property. Let assume, for example, that the user is interested in entities that are parents. This option allows one to filter all entities for which the a value exists for the property ex:parentOf for example. It also allows you to exclude all entities that are parents if the option Has no property is used instead.
-
Has minimum / maximum value. This option allows for restricting entities to be within or outside a specified range given user’s specified property-values of type number over which the restriction can be applied. To delimit both upper and lower bounds, the user can combine minimum and maximum using the logical box AND.
-
Has minimum / maximum date. This option allows for restricting entities to be within or outside a specified range given user’s specified property-values of type date over which the restriction can be applied. Within this option, a date format can be specified. The default format is
YYYY-MM-DD
. The values 10, 300 and 1990 for example will be considered as year while 10-1, 300-1 and 1990-1 will be considered as the first month of the respective year values. To delimit both upper and lower bounds, the user can combine minimum and maximum using the logical box AND. -
Has minimum / maximum appearances. This option allows for restricting entities for which a given property value occurs within a specified range. For example, to avoid excessive number of possible matches, one can delimit that only entities whose name value occur less than 5 times in the dataset will be included. To delimit both upper and lower bounds, the user can combine minimum and maximum using the logical box AND.
-
In set. This option allows the filtering of a collection of resources of interest based on a set of resources. These set of resources is not manually provided but can be obtained through a list of existing linksets or lenses. The example below provides a detailed understanding of this filtering approach.
Example 1: IN SET
Two collections A and B to be matched via whatever method would create a se of links labelled linkset-AB. However, we are only interested in a subset of linkset-AB, such that it’s resources (subject, object or both) are present in another given set, namely an input-linkset I. For efficiency purposes, linkset-AB does not need to be fully created to be filtered later on. This implies that the collections A and/or B need to be filtered such that A’ = A ∩ I and/or B’ = B ∩ I before executing the matching algorithm.
######################################################
# Linksets as named graphs #
######################################################
ex:input-linkset
{
A:Chiara owl:sameAs C:Latronico .
A:Al owl:sameAs C:Al_Idrissou .
A:Al owl:sameAs C:Al_Koudous .
}
ex:linkset-AB
{
A:Chiara owl:sameAs B:Chiara .
A:Kerim owl:sameAs B:Kerim .
}
######################################################
# In Resource Set #
######################################################
### The set S of resources from input-linkset is:
### S = {A:Chiara, A:Al, C:Latronico, C:Al_Idrissou, C:Al_Koudous}
ex:linkset-SubjectInSet
{
A:Chiara owl:sameAs B:Chiara .
}
2.3 Data Exploration¶
At this point, successfully providing the required information (Dataset and Entity-Type) triggers the appearance of the Explore Sample button at the right side of the card’s label (Entity-type selection 1) as displayed in Fig. 4.4. As illustrated in Fig. 4.5, with this button, users are now able to explore information of their choice about the entities of interest by selecting properties describing them. Keep in mind that this feature is only intended as exploration alternative to make sure of the choices (dataset, entity-type and restrictions) made.
3. Matching in Practice¶
Now that we have gone through available matching methods and how to combine them in the Lenticular Lens, we show their application in some case-studies aligning resources stemmed from various datasources of one’s choice. We also provide example on the rdf export of the resulting linksets with metadata. For this purpose we choose as syntax the turtle format and RDFstar reification.
3.1 Simple Methods¶
This case-study section aims to showcase matching problems involving a SINGLE matching method (Embedded, Exact, Intermediate, Levenshtein Distance, Soundex Distance, Gerrit Bloothooft, Word Intersection, List Intersection, Numbers and TeAM) run over one or multiple datasets.
We call them Simple Methods as opposed to Complex Methods illustrated in the sequel. Keep in mind that the terms Simple and Complex refer to the use of single or combined methods and not to the algorithm complexity of the underlying the method(s).
Case-1: Grid¶
In this case study, displayed in Fig 4.10, the goal is to find out whether there exist duplicates Education Instances within the Grid’s dataset. The dataset is composed of nine types of institutions including 27715 Companies, 19353 Educations, 12547 Nonprofit institutes, 12465 Healthcare institutes, 8499 Facility institutes, 5762 Government institutes, 2724 Archive institutes and 7823 institutes with no type specified. Although the dataset is of multiple types of entities, the case-study here aims only to deduplicate instances of type Education. This is depicted in Fig 4.10 where the Sources and Targets cards are GRID[Education] showing that the entity type Education has been selected within the GRID dataset.
Case-1: Linkset Specifications
Also in the Matching Methods card, it can be seen that on both sides (source and target) two properties are selected for checking whether duplicates exist. This check relies on whether there exist entities that are documented within the GRID dataset with similar names using rdfs_label and skos_prefLabel. As the similarity score is measured in the interval 0 (not similar) to 1 (exactly similar), the threshold defined as 0.9 ensures that only paired entities with a high similarity (0.9 or above) are accepted.
The same card shows the selected algorithm as Levenshtein Distance, which is run over the selected predicates generating 1,692 distinct links as shown in the statistics card (on the top). The latter card also provides statistics on:
-
The number of entities at the Source and Target. In this particular case, over 19K educational institutes at both source and target as they are the same dataset. Such information provides hints on the maximum number of links to expect in the worst case scenario as well as an idea on how long the running algorithm could take.
-
The number of entities matched at the subject and object positions.
-
The number of clusters derived from the links found. Here, this provides a potentially better picture on the number of real entities, as co-referent are grouped together in clusters of various sizes.
-
The Runtime durations informing on the elapsed time for (1) finding links and for (2) clustering them.
In this Image 1, we deliberately choose two properties at both the Source and Target datasets for the deduplication. Choosing for more than one property either for the Source or Target triggers a combination of pairwise property-value matching joined with the logic operator OR. For example choosing properties x and y at the source while choosing only z at the target triggers the following pairwise combinations: (x AND z) OR (y AND z).
In the current use-case, choosing for example rdfs_label and skos_prefLabel at both Source AND Target generates the following combination: rdfs_label AND rdfs_label OR rdfs_label AND skos_prefLabel OR skos_prefLabel AND skos_prefLabel. This explicit combination is implemented as an alternative complex method in the next section, where three executions of the `Levenshtein Distance algorithm is required, instead of one.
Case-1: RDF Results.
This section provides the complete metadata of the resulting Linkset for the specification above in Example 4.15, plus a sample of 9 links due to space limitation.
From this metadata, a number of general statistical information on the linkset can be obtained, such as the number of distinct triples, entities or clusters, the number of links accepted or rejected and more.
The metadata also presents a detailed description on the methods used to generate the links. For example, for each algorithm used, a uri and description is provided. This algorithm can be used in one or more methods, provided the link acceptance threshold, the vrange of the similarity score, the datasets, data-types and predicates uris used for link findings.
Furthermore, a specific annotation is provided in an RDFstar format for each generated link. In this example, we have the strength of the link and whether the link has been validated (accepted, rejected or not_validated ).
Case-1: Turtle file sample
### PREDEFINED SHARED NAMESPACES ###
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix void: <http://rdfs.org/ns/void#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix format: <http://www.w3.org/ns/formats/> .
@prefix pav: <http://purl.org/ontology/similarity/> .
@prefix cc: <http://creativecommons.org/ns#> .
### PREDEFINED SPECIFIC NAMESPACES ###
@prefix ll: <http://data.goldenagents.org/ontology/> .
@prefix ll_algo: <http://data.goldenagents.org/ontology/matching-method/> .
@prefix ll_val: <http://data.goldenagents.org/ontology/validation/> .
@prefix linkset: <http://data.goldenagents.org/resource/linkset/> .
@prefix dataset: <http://data.goldenagents.org/resource/dataset/> .
### AUTOMATED NAMESPACES ###
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix institutes_S1: <http://www.grid.ac/institutes/> .
###########################################
# GENERIC METADATA #
###########################################
linkset:Grid
a void:Linkset ;
cc:attributionName "LenticularLens" ;
void:feature format:Turtle ;
cc:license <http://purl.org/NET/rdflicense/W3C1.0> ;
ll:has-logic-formulation <http://data.goldenagents.org/resource/PHbb54a8dab0d2954> ;
void:linkPredicate skos:exactMatch ;
void:subjectsTarget <http://data.goldenagents.org/resource/dataset/Grid> ;
void:objectsTarget <http://data.goldenagents.org/resource/dataset/Grid> ;
dcterms:description "Deduplication of entities of type Education in the GRID dataset"@en ;
void:triples 1692 ;
void:entities 1737 ;
void:distinctSubjects 1737 ;
void:distinctObjects 1737 ;
ll:has-clusters 619 ;
ll_val:has-validations 18 ;
ll_val:has-accepted 3 ;
ll_val:has-rejected 6 ;
ll_val:has-remaining 1683 .
#############################################
# LOGIC FORMULA PARTS #
#############################################
<http://data.goldenagents.org/resource/PHbb54a8dab0d2954>
a ll:LogicFormulation ;
ll:has-method <http://data.goldenagents.org/resource/Normalised-EditDistance-H30d57e26e41bb04> ;
ll:has-formula-description """<http://data.goldenagents.org/resource/Normalised-EditDistance-H30d57e26e41bb04>
""" .
#############################################
# METHOD SIGNATURES #
#############################################
### ll_algo:Normalised-EditDistance ###
<http://data.goldenagents.org/resource/Normalised-EditDistance-H30d57e26e41bb04>
a ll:MatchingMethod ;
ll:has-algorithm ll_algo:Normalised-EditDistance ;
ll:has-threshold 0.9 ;
ll:has-threshold-range "]0, 1]" ;
ll:has-threshold-acceptance-operator <http://data.goldenagents.org/resource/Greater-than-or-equal-to> ;
ll:has-subj-entity-selection <http://data.goldenagents.org/resource/EntitySelection-PH61bd543e4ce34c2> ;
ll:has-subj-predicate-selection <http://data.goldenagents.org/resource/PredicateSelection-PHab504e102405ab0> ;
ll:has-subj-predicate-selection <http://data.goldenagents.org/resource/PredicateSelection-PH0d712649af643f3> ;
ll:has-obj-entity-selection <http://data.goldenagents.org/resource/EntitySelection-PH61bd543e4ce34c2> ;
ll:has-obj-predicate-selection <http://data.goldenagents.org/resource/PredicateSelection-PHab504e102405ab0> ;
ll:has-obj-predicate-selection <http://data.goldenagents.org/resource/PredicateSelection-PH0d712649af643f3> .
#############################################
# METHOD DESCRIPTIONS #
#############################################
ll_algo:Normalised-EditDistance
a ll:MatchingAlgorithm ;
dcterms:description """
This method is used to align source and target’s IRIs whenever the similarity score of their respective
user selected property values are above a given Levenshtein (edit) Distance threshold.
Edit distance is a way of quantifying how dissimilar two strings (e.g., words) are to one another by
counting the minimum number of operations ε (removal, insertion, or substitution of a character in the
string) required to transform one string into the other. For example, the Levenshtein distance between
kitten and sitting is ε = 3 as it requires a two substitutions (s for k and i for e) and one insertion
of g at the end [https://en.wikipedia.org/wiki/Edit_distance].
"""@en .
#############################################
# DATASET AND ENTITY SELECTIONS #
#############################################
### ENTITY SELECTION [SOURCE] N0: 1 ###
<http://data.goldenagents.org/resource/EntitySelection-PH61bd543e4ce34c2>
a ll:EntitySelection ;
ll:has-dataset <http://data.goldenagents.org/resource/dataset/Grid> ;
ll:has-entity-type <http://www.grid.ac/ontology/Education> .
#############################################
# PREDICATE SELECTIONS #
#############################################
### PREDICATE SELECTED [SOURCE] N0: 1 ###
<http://data.goldenagents.org/resource/PredicateSelection-PHab504e102405ab0>
a ll:PropertySelection ;
ll:has-entity-selection <http://data.goldenagents.org/resource/EntitySelection-PH61bd543e4ce34c2> ;
ll:has-predicate <http://www.w3.org/2000/01/rdf-schema#label> .
### PREDICATE SELECTED [SOURCE] N0: 2 ###
<http://data.goldenagents.org/resource/PredicateSelection-PH0d712649af643f3>
a ll:PropertySelection ;
ll:has-entity-selection <http://data.goldenagents.org/resource/EntitySelection-PH61bd543e4ce34c2> ;
ll:has-predicate <http://www.w3.org/2004/02/skos/core#prefLabel> .
###########################################
# ANNOTATED LINKSET #
###########################################
linkset:Grid
{
<<institutes_S1:grid.1017.7 skos:exactMatch institutes_S1:grid.501980.5>>
ll_val:has-validation "rejected" :
ll:has-matching-strength 0.933 .
<<institutes_S1:grid.1019.9 skos:exactMatch institutes_S1:grid.449929.b>>
ll_val:has-validation "accepted" ;
ll:has-matching-strength 1 .
<<institutes_S1:grid.1020.3 skos:exactMatch institutes_S1:grid.266826.e>>
ll_val:has-validation "not_validated" ;
ll:has-matching-strength 1 .
<<institutes_S1:grid.10215.37 skos:exactMatch institutes_S1:grid.10347.31>>
ll_val:has-validation "rejected" ;
ll:has-matching-strength 0.950 .
<<institutes_S1:grid.10215.37 skos:exactMatch institutes_S1:grid.10595.38>>
ll_val:has-validation "rejected" ;
ll:has-matching-strength 0.900 .
<<institutes_S1:grid.10215.37 skos:exactMatch institutes_S1:grid.4462.4>>
ll_val:has-validation "rejected" ;
ll:has-matching-strength 0.900 .
<<institutes_S1:grid.10347.31 skos:exactMatch institutes_S1:grid.10595.38>>
ll_val:has-validation "rejected" ;
ll:has-matching-strength 0.900 .
<<institutes_S1:grid.10347.31 skos:exactMatch institutes_S1:grid.441173.4>>
ll_val:has-validation "rejected" ;
ll:has-matching-strength 0.900 .
<<institutes_S1:grid.10347.31 skos:exactMatch institutes_S1:grid.4462.4>>
lll_val:has-validation "accepted" ;
ll:has-matching-strength 0.900 .
• • •
}
3.2 Complex Methods¶
Case-1: Alternative¶
In Fig 4.11 is displayed an alternative where
Case-1: Linkset Specifications
Case-1: RDF Results
Case-1: Turtle file sample
###########################################
# NAMESPACES #
###########################################
### PREDEFINED SHARED NAMESPACES
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix void: <http://rdfs.org/ns/void#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix format: <http://www.w3.org/ns/formats/> .
@prefix pav: <http://purl.org/ontology/similarity/> .
@prefix cc: <http://creativecommons.org/ns#> .
### PREDEFINED SPECIFIC NAMESPACES
@prefix ll: <http://data.goldenagents.org/ontology/> .
@prefix ll_algo: <http://data.goldenagents.org/ontology/matching-method/> .
@prefix ll_val: <http://data.goldenagents.org/ontology/validation/> .
@prefix linkset: <http://data.goldenagents.org/resource/linkset/> .
@prefix dataset: <http://data.goldenagents.org/resource/dataset/> .
### AUTOMATED NAMESPACES
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix institutes_S1: <http://www.grid.ac/institutes/> .
##############################################################################################################
# GENERIC METADATA #
##############################################################################################################
linkset:Grid_2
a void:Linkset ;
cc:attributionName "LenticularLens" ;
void:feature format:Turtle ;
cc:license <http://purl.org/NET/rdflicense/W3C1.0> ;
ll:has-logic-formulation <http://data.goldenagents.org/resource/PH1ec0ee6f368dd62> ;
void:linkPredicate skos:exactMatch ;
void:subjectsTarget <http://data.goldenagents.org/resource/dataset/Grid> ;
void:objectsTarget <http://data.goldenagents.org/resource/dataset/Grid> ;
dcterms:description "Deduplication of entities of type Education in the GRID dataset"@en ;
void:triples 1692 ;
void:entities 1737 ;
void:distinctSubjects 1737 ;
void:distinctObjects 1737 ;
ll:has-clusters 619 ;
ll_val:has-validations 18 ;
ll_val:has-accepted 3 ;
ll_val:has-rejected 6 ;
ll_val:has-remaining 1683 .
################################################################################
# LOGIC FORMULA PARTS #
################################################################################
<http://data.goldenagents.org/resource/PH1ec0ee6f368dd62>
a ll:LogicFormulation ;
ll:has-method <http://data.goldenagents.org/resource/Normalised-EditDistance-H779a0ad1b5e5f93> ;
ll:has-method <http://data.goldenagents.org/resource/Normalised-EditDistance-H3de4966a0b8aa01> ;
ll:has-method <http://data.goldenagents.org/resource/Normalised-EditDistance-H11cbb0cc77c44a9> ;
ll:has-formula-description """<http://data.goldenagents.org/resource/Normalised-EditDistance-H779a0ad1b5e5f93>
and (⊤min) <http://data.goldenagents.org/resource/Normalised-EditDistance-H3de4966a0b8aa01>
and (⊤min) <http://data.goldenagents.org/resource/Normalised-EditDistance-H11cbb0cc77c44a9>
""" .
################################################################################
# METHOD SIGNATURES #
################################################################################
### ll_algo:Normalised-EditDistance
<http://data.goldenagents.org/resource/Normalised-EditDistance-H779a0ad1b5e5f93>
a ll:MatchingMethod ;
ll:has-algorithm ll_algo:Normalised-EditDistance ;
ll:has-threshold 0.9 ;
ll:has-threshold-range "]0, 1]" ;
ll:has-threshold-acceptance-operator <http://data.goldenagents.org/resource/Greater-than-or-equal-to> ;
ll:has-subj-entity-selection <http://data.goldenagents.org/resource/EntitySelection-PH61bd543e4ce34c2> ;
ll:has-subj-predicate-selection <http://data.goldenagents.org/resource/PredicateSelection-PHab504e102405ab0> ;
ll:has-obj-entity-selection <http://data.goldenagents.org/resource/EntitySelection-PH61bd543e4ce34c2> ;
ll:has-obj-predicate-selection <http://data.goldenagents.org/resource/PredicateSelection-PHab504e102405ab0> .
### ll_algo:Normalised-EditDistance
<http://data.goldenagents.org/resource/Normalised-EditDistance-H3de4966a0b8aa01>
a ll:MatchingMethod ;
ll:has-algorithm ll_algo:Normalised-EditDistance ;
ll:has-threshold 0.9 ;
ll:has-threshold-range "]0, 1]" ;
ll:has-threshold-acceptance-operator <http://data.goldenagents.org/resource/Greater-than-or-equal-to> ;
ll:has-subj-entity-selection <http://data.goldenagents.org/resource/EntitySelection-PH61bd543e4ce34c2> ;
ll:has-subj-predicate-selection <http://data.goldenagents.org/resource/PredicateSelection-PHab504e102405ab0> ;
ll:has-obj-entity-selection <http://data.goldenagents.org/resource/EntitySelection-PH61bd543e4ce34c2> ;
ll:has-obj-predicate-selection <http://data.goldenagents.org/resource/PredicateSelection-PH0d712649af643f3> .
### ll_algo:Normalised-EditDistance
<http://data.goldenagents.org/resource/Normalised-EditDistance-H11cbb0cc77c44a9>
a ll:MatchingMethod ;
ll:has-algorithm ll_algo:Normalised-EditDistance ;
ll:has-threshold 0.9 ;
ll:has-threshold-range "]0, 1]" ;
ll:has-threshold-acceptance-operator <http://data.goldenagents.org/resource/Greater-than-or-equal-to> ;
ll:has-subj-entity-selection <http://data.goldenagents.org/resource/EntitySelection-PH61bd543e4ce34c2> ;
ll:has-subj-predicate-selection <http://data.goldenagents.org/resource/PredicateSelection-PH0d712649af643f3> ;
ll:has-obj-entity-selection <http://data.goldenagents.org/resource/EntitySelection-PH61bd543e4ce34c2> ;
ll:has-obj-predicate-selection <http://data.goldenagents.org/resource/PredicateSelection-PH0d712649af643f3> .
################################################################################
# METHOD DESCRIPTIONS #
################################################################################
ll_algo:Normalised-EditDistance
a ll:MatchingAlgorithm ;
dcterms:description """
This method is used to align source and target’s IRIs whenever the similarity score of their respective
user selected property values are above a given Levenshtein (edit) Distance threshold.
Edit distance is a way of quantifying how dissimilar two strings (e.g., words) are to one another by
counting the minimum number of operations ε (removal, insertion, or substitution of a character in the
string) required to transform one string into the other. For example, the Levenshtein distance between
kitten and sitting is ε = 3 as it requires a two substitutions (s for k and i for e) and one insertion
of g at the end [https://en.wikipedia.org/wiki/Edit_distance].
"""@en .
################################################################################
# DATASET AND ENTITY SELECTIONS #
################################################################################
### ENTITY SELECTION [SOURCE] N0: 1 ###
<http://data.goldenagents.org/resource/EntitySelection-PH61bd543e4ce34c2>
a ll:EntitySelection ;
ll:has-dataset <http://data.goldenagents.org/resource/dataset/Grid> ;
ll:has-entity-type <http://www.grid.ac/ontology/Education> .
################################################################################
# PREDICATE SELECTIONS #
################################################################################
### PREDICATE SELECTED [SOURCE] N0: 1 ###
<http://data.goldenagents.org/resource/PredicateSelection-PHab504e102405ab0>
a ll:PropertySelection ;
ll:has-entity-selection <http://data.goldenagents.org/resource/EntitySelection-PH61bd543e4ce34c2> ;
ll:has-predicate <http://www.w3.org/2000/01/rdf-schema#label> .
### PREDICATE SELECTED [TARGET] N0: 2 ###
<http://data.goldenagents.org/resource/PredicateSelection-PH0d712649af643f3>
a ll:PropertySelection ;
ll:has-entity-selection <http://data.goldenagents.org/resource/EntitySelection-PH61bd543e4ce34c2> ;
ll:has-predicate <http://www.w3.org/2004/02/skos/core#prefLabel> .
##############################################################################################################
# ANNOTATED LINKSET #
##############################################################################################################
linkset:Grid_2
{
<<institutes_S1:grid.1017.7 skos:exactMatch institutes_S1:grid.501980.5>>
ll_val:has-validation "rejected" .
<<institutes_S1:grid.1019.9 skos:exactMatch institutes_S1:grid.449929.b>>
ll_val:has-validation "accepted" .
<<institutes_S1:grid.1020.3 skos:exactMatch institutes_S1:grid.266826.e>>
ll_val:has-validation "accepted" .
<<institutes_S1:grid.10215.37 skos:exactMatch institutes_S1:grid.10347.31>>
ll_val:has-validation "rejected" .
<<institutes_S1:grid.10215.37 skos:exactMatch institutes_S1:grid.10595.38>>
ll_val:has-validation "rejected" .
<<institutes_S1:grid.10215.37 skos:exactMatch institutes_S1:grid.4462.4>>
ll_val:has-validation "rejected" .
<<institutes_S1:grid.10347.31 skos:exactMatch institutes_S1:grid.10595.38>>
ll_val:has-validation "rejected" .
<<institutes_S1:grid.10347.31 skos:exactMatch institutes_S1:grid.441173.4>>
ll_val:has-validation "rejected" .
<<institutes_S1:grid.10347.31 skos:exactMatch institutes_S1:grid.4462.4>>
ll_val:has-validation "accepted" .
<<institutes_S1:grid.10373.36 skos:exactMatch institutes_S1:grid.266769.a>>
ll_val:has-validation "not_validated" .
• • •
}
Case-2: Getty¶
In Fig 4.12 is displayed an alternative where
Case-2: Linkset Specifications.
Fig 4.12: An example showing how to deduplicate a dataset using an edit distance with threshold 0.9.
Case-2: RDF Results.
Case-2: Turtle file sample.
NAMESPACES¶
PREDEFINED SHARED NAMESPACES¶
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix void: <http://rdfs.org/ns/void#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix format: <http://www.w3.org/ns/formats/> .
@prefix pav: <http://purl.org/ontology/similarity/> .
@prefix cc: <http://creativecommons.org/ns#> .
PREDEFINED SPECIFIC NAMESPACES¶
@prefix ll: <http://data.goldenagents.org/ontology/> .
@prefix ll_algo: <http://data.goldenagents.org/ontology/matching-method/> .
@prefix ll_val: <http://data.goldenagents.org/ontology/validation/> .
@prefix linkset: <http://data.goldenagents.org/resource/linkset/> .
@prefix dataset: <http://data.goldenagents.org/resource/dataset/> .
AUTOMATED NAMESPACES¶
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix institutes_S1: <http://www.grid.ac/institutes/> .
@prefix time: <http://www.w3.org/2006/time#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix Person_S1: <http://goldenagents.org/uva/SAA/Person/> .
@prefix PersonName_T1: <https://data.goldenagents.org/datasets/SAA/PersonName/> .
GENERIC METADATA¶
linkset:Getty
a void:Linkset ;
cc:attributionName "LenticularLens" ;
void:feature format:Turtle ;
cc:license <http://purl.org/NET/rdflicense/W3C1.0> ;
ll:has-logic-formulation <http://data.goldenagents.org/resource/PH6d47b550d1695d5> ;
void:linkPredicate owl:sameAs ;
void:subjectsTarget <https://data.goldenagents.org/datasets/ufab7d657a250e3461361c982ce9b38f3816e0c4b/frick_collection_montias_data_20200604> ;
void:subjectsTarget <https://data.goldenagents.org/datasets/ufab7d657a250e3461361c982ce9b38f3816e0c4b/getty_provenance_index_montias_data_20200604> ;
void:objectsTarget <https://data.goldenagents.org/datasets/ufab7d657a250e3461361c982ce9b38f3816e0c4b/index_op_notarieel_archief_enriched_20191202> ;
dcterms:description "Deduplication of entities of type Education in the GRID dataset"@en ;
void:triples 147 ;
void:entities 261 ;
void:distinctSubjects 135 ;
void:distinctObjects 126 ;
ll:has-clusters 117 ;
ll_val:has-remaining 147 .
################################################################################
# LOGIC FORMULA PARTS #
################################################################################
http://data.goldenagents.org/resource/PH6d47b550d1695d5
a ll:LogicFormulation ;
ll:has-method <http://data.goldenagents.org/resource/Normalised-Soundex-H4970fc2fe79ea5f> ;
ll:has-method <http://data.goldenagents.org/resource/Exact-H918a02351d48ca9> ;
ll:has-method <http://data.goldenagents.org/resource/Time-Delta-Hdcc5070996853e9> ;
ll:has-formula-description """<http://data.goldenagents.org/resource/Normalised-Soundex-H4970fc2fe79ea5f>
and (⊤min) <http://data.goldenagents.org/resource/Exact-H918a02351d48ca9>
and (⊤min) <http://data.goldenagents.org/resource/Time-Delta-Hdcc5070996853e9>
""" .
################################################################################
# METHOD SIGNATURES #
################################################################################
ll_algo:Normalised-Soundex¶
http://data.goldenagents.org/resource/Normalised-Soundex-H4970fc2fe79ea5f
a ll:MatchingMethod ;
ll:has-algorithm ll_algo:Normalised-Soundex ;
ll:has-threshold 0.85 ;
ll:has-threshold-range "]0, 1]" ;
ll:has-threshold-acceptance-operator <http://data.goldenagents.org/resource/Greater-than-or-equal-to> ;
ll:has-subj-entity-selection <http://data.goldenagents.org/resource/EntitySelection-PH4ba00e26b03e5dc> ;
ll:has-subj-entity-selection <http://data.goldenagents.org/resource/EntitySelection-PHb97c2bc9d29ba36> ;
ll:has-subj-predicate-selection <http://data.goldenagents.org/resource/PredicateSelection-PH914a94c6f3c93b4> ;
ll:has-subj-predicate-selection <http://data.goldenagents.org/resource/PredicateSelection-PH10aaeebb6832fdf> ;
ll:has-obj-entity-selection <http://data.goldenagents.org/resource/EntitySelection-PH769c39438419b10> ;
ll:has-obj-predicate-selection <http://data.goldenagents.org/resource/PredicateSelection-PH7879419327d373d> .
ll_algo:Exact¶
http://data.goldenagents.org/resource/Exact-H918a02351d48ca9
a ll:MatchingMethod ;
ll:has-algorithm ll_algo:Exact ;
ll:has-threshold 1 ;
ll:has-threshold-range "1" ;
ll:has-threshold-acceptance-operator <http://data.goldenagents.org/resource/Equal> ;
ll:has-subj-entity-selection <http://data.goldenagents.org/resource/EntitySelection-PHb97c2bc9d29ba36> ;
ll:has-subj-entity-selection <http://data.goldenagents.org/resource/EntitySelection-PH4ba00e26b03e5dc> ;
ll:has-subj-predicate-selection <http://data.goldenagents.org/resource/PredicateSelection-PH46d91d4f6e2209e> ;
ll:has-subj-predicate-selection <http://data.goldenagents.org/resource/PredicateSelection-PH0df0471cb5df515> ;
ll:has-obj-entity-selection <http://data.goldenagents.org/resource/EntitySelection-PH769c39438419b10> ;
ll:has-obj-predicate-selection <http://data.goldenagents.org/resource/PredicateSelection-PHe3cb3236c5b11b1> ;
ll:has-obj-predicate-selection <http://data.goldenagents.org/resource/PredicateSelection-PHb2a681013fbb430> ;
ll:has-obj-predicate-selection <http://data.goldenagents.org/resource/PredicateSelection-PH9f7bd41ea902bd0> .
ll_algo:Time-Delta¶
http://data.goldenagents.org/resource/Time-Delta-Hdcc5070996853e9
a ll:MatchingMethod ;
ll:has-algorithm ll_algo:Time-Delta ;
ll:has-threshold 0 ;
ll:has-threshold-range "ℕ" ;
time:unitType time:unitYear ;
ll:has-threshold-acceptance-operator <http://data.goldenagents.org/resource/Equal> ;
ll:has-subj-entity-selection <http://data.goldenagents.org/resource/EntitySelection-PHb97c2bc9d29ba36> ;
ll:has-subj-entity-selection <http://data.goldenagents.org/resource/EntitySelection-PH4ba00e26b03e5dc> ;
ll:has-subj-predicate-selection <http://data.goldenagents.org/resource/PredicateSelection-PHeb98e7f77b22fce> ;
ll:has-subj-predicate-selection <http://data.goldenagents.org/resource/PredicateSelection-PHf741a098569afb1> ;
ll:has-obj-entity-selection <http://data.goldenagents.org/resource/EntitySelection-PH769c39438419b10> ;
ll:has-obj-predicate-selection <http://data.goldenagents.org/resource/PredicateSelection-PHee886bb3d021a6a> ;
ll:has-obj-predicate-selection <http://data.goldenagents.org/resource/PredicateSelection-PH715d032180bd40c> ;
ll:has-obj-predicate-selection <http://data.goldenagents.org/resource/PredicateSelection-PHe349eb18e5ba638> .
################################################################################
# METHOD DESCRIPTIONS #
################################################################################
ll_algo:Normalised-Soundex a ll:MatchingAlgorithm ; dcterms:description “”” “Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for ho- mophones to be encoded to the same representation so that they can be matched despite minor differences in spelling. The algorithm mainly encodes consonants; a vowel will not be encoded unless it is the first let- ter” [https://en.wikipedia.org/wiki/Soundex]. In the Lenticular Lens, Soundex is used as a normaliser in the sense that an edit distance is run over the soundex code version of a name. For example, the in the table below, the normalisation of both Louijs Roc- ourt and `Lowis Ricourt becomes L200 R263 leading to an edit distance of 0 and a relative strength of 1. However, computing the same names using directly an edit distance results in an edit distance of 3 and a relative matching strength of 0. 79.
--------------
-- Example -- THE USE OF SOUNDEX CODE FOR STRING APPROXIMATION
--------------
The example below shows the implementation of Soundex Distance
in the Lenticular Lens and how it compares with Edit Distance
over the original names (no soundex-based normalisation).
------------------------------------------------------------------------------------------------------------------------------------------------------
Source Target E. Dist Rel. distance Source soundex Target soundex Code E. Dist Code Rel. Dist
------------------------------------------------------------------------------------------------------------------------------------------------------
Jasper Cornelisz. Lodder Jaspar Cornelisz Lodder 2 0.92 J216 C654 L360 J216 C654 L360 0 1.0
Barent Teunis Barent Teunisz gen. Drent 12 0.52 B653 T520 B653 T520 G500 D653 10 0.47
Louijs Rocourt Louys Rocourt 2 0.86 L200 R263 L200 R263 0 1.0
Louijs Rocourt Lowis Ricourt 3 0.79 L200 R263 L200 R263 0 1.0
Louys Rocourt Lowis Ricourt 3 0.77 L200 R263 L200 R263 0 1.0
Cornelis Dircksz. Clapmus Cornelis Clapmuts 10 0.6 C654 D620 C415 C654 C415 5 0.64
Geertruydt van den Breemde Geertruijd van den Bremde 4 0.85 G636 V500 D500 B653 G636 V500 D500 B653
"""@en .
ll_algo:Exact a ll:MatchingAlgorithm ; dcterms:description “”” Aligns source and target’s IRIs whenever their respective user selected property values are identical.”“”@en .
ll_algo:Time-Delta a ll:MatchingAlgorithm ; dcterms:description “”” 10.1 Time Delta. This function allows for finding co-referent entities on the basis of a minimum time dif- ference between the times reported by the source and the target entities. For example, if the value zero is assigned to the time difference parameter, then, for a matched to be found, the time of the target and the one of the source are to be the exact same times. While accounting for margins of error, one may consider a pair of entities to be co-referent if the real entities are born lambda days, months or years apart among other-things (similar name, place..). “”“@en .
################################################################################
# DATASET AND ENTITY SELECTIONS #
################################################################################
ENTITY SELECTION [SOURCE] N0: 1¶
http://data.goldenagents.org/resource/EntitySelection-PH4ba00e26b03e5dc a ll:EntitySelection ; ll:has-dataset https://data.goldenagents.org/datasets/ufab7d657a250e3461361c982ce9b38f3816e0c4b/frick_collection_montias_data_20200604 ; ll:has-entity-type https://data.goldenagents.org/datasets/SAA/ontology/Person .
ENTITY SELECTION [SOURCE] N0: 2¶
http://data.goldenagents.org/resource/EntitySelection-PHb97c2bc9d29ba36 a ll:EntitySelection ; ll:has-dataset https://data.goldenagents.org/datasets/ufab7d657a250e3461361c982ce9b38f3816e0c4b/getty_provenance_index_montias_data_20200604 ; ll:has-entity-type https://data.goldenagents.org/datasets/SAA/ontology/Person .
ENTITY SELECTION [TARGET] N0: 3¶
http://data.goldenagents.org/resource/EntitySelection-PH769c39438419b10 a ll:EntitySelection ; ll:has-dataset https://data.goldenagents.org/datasets/ufab7d657a250e3461361c982ce9b38f3816e0c4b/index_op_notarieel_archief_enriched_20191202 ; ll:has-entity-type https://w3id.org/pnv#PersonName .
################################################################################
# PREDICATE SELECTIONS #
################################################################################
PREDICATE SELECTED [SOURCE] N0: 1¶
http://data.goldenagents.org/resource/PredicateSelection-PH914a94c6f3c93b4 a ll:PropertySelection ; ll:has-entity-selection http://data.goldenagents.org/resource/EntitySelection-PH4ba00e26b03e5dc ; ll:has-predicate http://www.w3.org/2000/01/rdf-schema#label .
PREDICATE SELECTED [SOURCE] N0: 2¶
http://data.goldenagents.org/resource/PredicateSelection-PH10aaeebb6832fdf a ll:PropertySelection ; ll:has-entity-selection http://data.goldenagents.org/resource/EntitySelection-PHb97c2bc9d29ba36 ; ll:has-predicate http://www.w3.org/2000/01/rdf-schema#label .
PREDICATE SELECTED [TARGET] N0: 3¶
http://data.goldenagents.org/resource/PredicateSelection-PH7879419327d373d a ll:PropertySelection ; ll:has-entity-selection http://data.goldenagents.org/resource/EntitySelection-PH769c39438419b10 ; ll:has-predicate http://www.w3.org/2000/01/rdf-schema#label .
PREDICATE SELECTED [SOURCE] N0: 4¶
http://data.goldenagents.org/resource/PredicateSelection-PH46d91d4f6e2209e a ll:PropertySelection ; ll:has-entity-selection http://data.goldenagents.org/resource/EntitySelection-PHb97c2bc9d29ba36 ; ll:has-predicate http://data.goldenagents.org/resource/PH1df7dbfdf1d1eb8 .
http://data.goldenagents.org/resource/PH1df7dbfdf1d1eb8 a ll:SequenceSelection ; rdf:_1 https://data.goldenagents.org/datasets/SAA/ontology/isInRecord ; rdf:_2 https://data.goldenagents.org/datasets/SAA/ontology/Inventory ; rdf:_3 https://data.goldenagents.org/datasets/SAA/ontology/documentedIn ; rdf:_4 https://data.goldenagents.org/datasets/SAA/ontology/InventoryBook ; rdf:_5 https://data.goldenagents.org/datasets/SAA/ontology/inventoryNumber .
PREDICATE SELECTED [SOURCE] N0: 5¶
http://data.goldenagents.org/resource/PredicateSelection-PH0df0471cb5df515 a ll:PropertySelection ; ll:has-entity-selection http://data.goldenagents.org/resource/EntitySelection-PH4ba00e26b03e5dc ; ll:has-predicate http://data.goldenagents.org/resource/PH1df7dbfdf1d1eb8 .
PREDICATE SELECTED [TARGET] N0: 6¶
http://data.goldenagents.org/resource/PredicateSelection-PHe3cb3236c5b11b1 a ll:PropertySelection ; ll:has-entity-selection http://data.goldenagents.org/resource/EntitySelection-PH769c39438419b10 ; ll:has-predicate http://data.goldenagents.org/resource/PH7334cc09b832a17 .
http://data.goldenagents.org/resource/PH7334cc09b832a17 a ll:SequenceSelection ; rdf:_1 https://data.goldenagents.org/datasets/SAA/ontology/isInRecord ; rdf:_2 https://data.goldenagents.org/datasets/SAA/ontology/HuwelijkseVoorwaarden ; rdf:_3 https://data.goldenagents.org/datasets/SAA/ontology/inventoryNumber .
PREDICATE SELECTED [TARGET] N0: 7¶
http://data.goldenagents.org/resource/PredicateSelection-PHb2a681013fbb430 a ll:PropertySelection ; ll:has-entity-selection http://data.goldenagents.org/resource/EntitySelection-PH769c39438419b10 ; ll:has-predicate http://data.goldenagents.org/resource/PH00d91362d72928f .
http://data.goldenagents.org/resource/PH00d91362d72928f a ll:SequenceSelection ; rdf:_1 https://data.goldenagents.org/datasets/SAA/ontology/isInRecord ; rdf:_2 https://data.goldenagents.org/datasets/SAA/ontology/Boedelinventaris ; rdf:_3 https://data.goldenagents.org/datasets/SAA/ontology/inventoryNumber .
PREDICATE SELECTED [TARGET] N0: 8¶
http://data.goldenagents.org/resource/PredicateSelection-PH9f7bd41ea902bd0 a ll:PropertySelection ; ll:has-entity-selection http://data.goldenagents.org/resource/EntitySelection-PH769c39438419b10 ; ll:has-predicate http://data.goldenagents.org/resource/PH9f325d76d5fa623 .
http://data.goldenagents.org/resource/PH9f325d76d5fa623 a ll:SequenceSelection ; rdf:_1 https://data.goldenagents.org/datasets/SAA/ontology/isInRecord ; rdf:_2 https://data.goldenagents.org/datasets/SAA/ontology/Boedelscheiding ; rdf:_3 https://data.goldenagents.org/datasets/SAA/ontology/inventoryNumber .
PREDICATE SELECTED [SOURCE] N0: 9¶
http://data.goldenagents.org/resource/PredicateSelection-PHeb98e7f77b22fce a ll:PropertySelection ; ll:has-entity-selection http://data.goldenagents.org/resource/EntitySelection-PHb97c2bc9d29ba36 ; ll:has-predicate http://data.goldenagents.org/resource/PH22c48ebe6b24223 .
http://data.goldenagents.org/resource/PH22c48ebe6b24223 a ll:SequenceSelection ; rdf:_1 https://data.goldenagents.org/datasets/SAA/ontology/isInRecord ; rdf:_2 https://data.goldenagents.org/datasets/SAA/ontology/Inventory ; rdf:_3 https://data.goldenagents.org/datasets/SAA/ontology/registrationDate .
PREDICATE SELECTED [SOURCE] N0: 10¶
http://data.goldenagents.org/resource/PredicateSelection-PHf741a098569afb1 a ll:PropertySelection ; ll:has-entity-selection http://data.goldenagents.org/resource/EntitySelection-PH4ba00e26b03e5dc ; ll:has-predicate http://data.goldenagents.org/resource/PH22c48ebe6b24223 .
PREDICATE SELECTED [TARGET] N0: 11¶
http://data.goldenagents.org/resource/PredicateSelection-PHee886bb3d021a6a a ll:PropertySelection ; ll:has-entity-selection http://data.goldenagents.org/resource/EntitySelection-PH769c39438419b10 ; ll:has-predicate http://data.goldenagents.org/resource/PHda68158d0b9f392 .
http://data.goldenagents.org/resource/PHda68158d0b9f392 a ll:SequenceSelection ; rdf:_1 https://data.goldenagents.org/datasets/SAA/ontology/isInRecord ; rdf:_2 https://data.goldenagents.org/datasets/SAA/ontology/HuwelijkseVoorwaarden ; rdf:_3 https://data.goldenagents.org/datasets/SAA/ontology/registrationDate .
PREDICATE SELECTED [TARGET] N0: 12¶
http://data.goldenagents.org/resource/PredicateSelection-PH715d032180bd40c a ll:PropertySelection ; ll:has-entity-selection http://data.goldenagents.org/resource/EntitySelection-PH769c39438419b10 ; ll:has-predicate http://data.goldenagents.org/resource/PH956023596f37d1b .
http://data.goldenagents.org/resource/PH956023596f37d1b a ll:SequenceSelection ; rdf:_1 https://data.goldenagents.org/datasets/SAA/ontology/isInRecord ; rdf:_2 https://data.goldenagents.org/datasets/SAA/ontology/Boedelinventaris ; rdf:_3 https://data.goldenagents.org/datasets/SAA/ontology/registrationDate .
PREDICATE SELECTED [TARGET] N0: 13¶
http://data.goldenagents.org/resource/PredicateSelection-PHe349eb18e5ba638 a ll:PropertySelection ; ll:has-entity-selection http://data.goldenagents.org/resource/EntitySelection-PH769c39438419b10 ; ll:has-predicate http://data.goldenagents.org/resource/PH0ef342da86f3226 .
http://data.goldenagents.org/resource/PH0ef342da86f3226 a ll:SequenceSelection ; rdf:_1 https://data.goldenagents.org/datasets/SAA/ontology/isInRecord ; rdf:_2 https://data.goldenagents.org/datasets/SAA/ontology/Boedelscheiding ; rdf:_3 https://data.goldenagents.org/datasets/SAA/ontology/registrationDate .
ANNOTATED LINKSET¶
linkset:Getty
{
<
• • •
}
```