Skip to content

Understanding Matching Results 🤔

Work in progress

This section addresses the problem of understanding the results of Matching Algorithms and Matching Methods supported in the Lenticular Lens tool and also how to correctly combine them. The former corresponds to a set of rules followed by a computer for finding pairs of matching resources. The latter applies the former to generate matching results. One is expected to (i) choose a similarity algorithm, (ii) provide the required input (datasets, entity-type and property-value restrictions, matching properties…) and (iii) provide the conditions (threshold) under which a matched pair is outputted along with its score/weight. Matching methods involving a single entity matching algorithm are distinguished from those involving more than one: simple versus complex method (not to be confused with the complexity of the underlying the matching algorithm). The latter, naturally, requires the results to be combined.

Although combining matching results may seem relatively easy, deciding on the final score and the annotations of links with weights requires some extra thoughts. First of all, it requires deciding what a matching score is about: degree of truth or degree of confidence? In order words, is this a problem of Vagueness or Uncertainty or both? (see Sect. 1). Answering the later question opens the doors to Section 2 where we unveil our take on “how to combine scores?”. In particular, we distinguish the problems into (i) “how to combine degrees of truth?”, (ii) “how to combine degrees of confidence?” and finally (iii) “how/when to transition from degree of truth to degree of confidence?”.

1. Vagueness vs. Uncertainty

This section addresses the issue of “how to interpret the various scores in the process of entity matching?” by digging into “What is vagueness and degrees of truth and how are they different from uncertainty and degrees of confidence?”. Roughly, truth value or degrees of truth (the tomato is ripe with degree of truth 0.7) is not to be confused with degrees of certainty / confidence (I am 0.9 certain that the tomato is ripe with degree of truth 0.7) as the latter is not an assignment of truth value as opposed to the former, but rather an evaluation of a weighted proposition regardless of its truth value space ({0, 1} or [0, 1]).

In the next subsections, the distinction between Vagueness and Uncertainty is outlined based on the interpretation we associate to the output scores of similarity algorithms and matching methods.

1.1 Scores of similarity algorithms

Vagueness & Degrees of Truth    Similarity algorithms (Levenshtein, Soundex, Cosine…) meant for computing the overlap between a pair of input-arguments (property-value overlap) output values in the unit interval [0, 1] or values convertible in the unit interval (normalisation). These values are truth values / degrees of truth. For example, the input-arguments “Rembrand van Rijn” and “Rembrandt Harmensz van Rijn” have a similarity degree of truth of 0.63 using the Levenshtein algorithm and a similarity degree of truth of 0.74 using the Soundex algorithm. Transitioning from assigning boolean truth values {0, 1} to assigning continuous truth values in the unit interval [0, 1] to propositions (events) is clearly moving from classical logic to fuzzy logic which is in the modelling paradigm of many-valued Logics. The latter truth space (unit interval) is motivated by the presence of vague concepts in a proposition (the use of ripe in the proposition “the tomato is ripe”), making it hard or sometimes even impossible to establish whether the proposition is completely true or false [Lukasiewicz2008].

1.2 Scores of matching methods

Uncertainty & Degree of confidence    Whatever truth value is assigned to a proposition or event, sometimes, one may wonder about the likelihood that the event will occur or has occurred [Kovalerchuk2017]. This reflects the uncertainty regarding the statement due to, for example, lack of information. In these cases, the use of theories such as Probabilistic or Possibilistic Logics can be considered for the evaluation of the likelihood of a proposition. In other words, the use of degree of (un)certainty, belief or confidence is to emphasise a confidence-evaluation on the assignment of a truth value to a proposition, which may or may not qualify as vague (the tomato is ripe, the sky is blue, the bottle is full…). For example, how certain are we in asserting that the proposition “the tomato is ripe” is true? As [Lukasiewicz2008] illustrates, asserting that “John is a teacher” and “John is a student” with respectively the values 0.3 and 0.7 as degrees of certainty is roughly saying that “John is either a teacher or a student, but more likely a student”. However, the vague statement “John is tall” with the assignment of 0.9 as degree of truth can be roughly translated as “John is quite tall” rather than as “John is likely tall”.

Entity matching methods1 output links optionally annotated with matching scores reflecting (i) the level of confidence of a method (the strength of the evidence) for assimilating a pair of resources to be co-referents and (ii) the lower boundary under which two resources can be viewed as co-referents. For example, given the resources e1e_1 and e2e_2 respectively labeled Rembrand van Rijn and Rembrandt Harmensz van Rijn, e1e_1 and e2e_2 can be linked with a matching score of 0.63 using the Levenshtein algorithm, provided that it is acceptable doing so at a matching score above 0.60 (more details on how this confidence score can be calculated is provided in Section 2.3).

2. Combination of the Scores

Even though degree of truth (scores of similarity algorithms) and degree of confidence (scores of matching methods) are not be confused, they may be combined (multiple truth values or multiple degrees of confidence) or they me be subject to transition. In fact, it may be almost unthinkable to solve real-life-problems without doing so. Considering the famous abductive reasoning example of the duck test where something is probably is a duck if it (i) looks like a duck, (ii) swims like a duck, (iii) and quacks like a duck. This requires first a combination of truth values associated to propositions (i), (ii) and (iii) as conjunction (see Sect. 2.1), followed by a transition into a confidence value concluding that it is probably a duck (see Sect. 2.3).

2.1 Combining Truth Values

One thing we know and agreed on is that similarity algorithms output boolean or fuzzy truth values in the range [0, 1]. This allows to make use of logic combination functions offers in conventional logic (Subsection 2.1.1) of fuzzy logic (Subsection 2.1.2) depending on whether we expect the solution space to be {0, 1} or [0, 1].

2.1.1 Classic Logic

The two standard logic operators or combination functions traditionally used are the classical Boolean Conjunction (\land) and Disjunction (\lor). The first takes the minimum strength and the latter takes the maximum strength. This applies for both classic values (True -1- or False -0-) and fuzzy values ( between 0 and 1 ).

Since the results from matching methods are assigned fuzzy values in the interval ]0,1], the table bellow illustrates the default behaviour of the Lenticular Lens when combining them.

Example 18: Standard logic operations over conjunction (min) and disjunction (max).
Source                      Target                       Levenshtein  Soundex      OR(max)    AND(min)
------------------------------------------------------------------------------------------------------
Jasper Cornelisz. Lodder    Jaspar Cornelisz Lodder             0.92     1.00        1.00         0.92
Rembrand van Rijn           Rembrandt Harmensz van Rijn         0.63     0.74        0.74         0.63
Barent Teunis               Barent Teunisz gen. Drent           0.52     0.47        0.52         0.47

2.1.2 Fuzzy Logic

As provided in the next subsections, sophisticated combination functions such as T-norms (\otimes) and S-norms (\oplus), developed by scholars like Łukasiewicz, Gödel, Goguen, Zadeh and others can also be used as alternatives for respectively Boolean Disjunction (\lor) and Conjunction (\land).

2.1.2.1 T-norms

A list of six different operations can be applied when dealing with methods combined by Conjunction. Here, we present them:

  • Minimum t-norm

    min(a,b)=min(a,b)(1) ⊤_{min} (a, b) = min(a, b) \tag{1}

  • Product t-norm

    prod(a,b)=a . b(2) ⊤_{prod} (a, b) = a \text{ . } b \tag{2}

  • Łukasiewicz t-norm

    Luk(a,b)=max(0,a+b1)(3) ⊤_{Luk} (a, b) = max(0, a + b - 1) \tag{3}

  • Drastic

D(a,b)={bif a=1aif b=10otherwise(4) ⊤_D(a, b) = \begin{cases} b &\text{if } a = 1 \\ a &\text{if } b = 1 \\ 0 &\text{otherwise} \end{cases} \tag{4}

  • Nilpotent minimum

nM(a,b)={min(a,b)if a+b>10otherwise(5) ⊤_{nM}(a, b) = \begin{cases} min(a, b) &\text{if } a + b > 1 \\ 0 &\text{otherwise} \end{cases} \tag{5}

  • Hamacher product

H0(a,b)={0if a=b=0aba+babotherwise(6) ⊤_{H_0}(a, b) = \begin{cases} 0 &\text{if } a = b = 0 \\ \dfrac{ab}{a + b - ab} &\text{otherwise} \end{cases} \tag{6}


The following table provides three case studies to illustrate the application of each of the aforementioned T-norm binary operations. They are presented in order from the less strict (min) to the most strict (D).

Source, Target
Levenshtein, Soundex min H0 prod nM Luk D
Src: Jasper Cornelisz. Lodder
Trg: Jaspar Cornelisz Lodder
0.92, 1.00 0.920 0.920 0.920 0.920 0.920 0.920
Src: Rembrand van Rijn
Trg: Rembrandt Harmensz van Rijn
0.63, 0.74 0.630 0.516 0.466 0.630 0.370 0.000
Src: Barent Teunis
Trg: Barent Teunisz gen. Drent
0.52, 0.47 0.470 0.328 0.244 0.000 0.000 0.000
2.1.2.2 S-norms

A list of six different operations can also be applied when dealing with methods combined by Disjunction. Here, we present them:

  • Maximum S-norm

    max(a,b)=max(a,b)(7) ⊥_{max} (a, b) = max(a, b) \tag{7}

  • Probabilistic sum

    sum(a,b)=a+ba . b(8) ⊥_{sum} (a, b) = a + b - a \text{ . } b \tag{8}

  • Bounded sum

    Luk(a,b)=min(a+b,1)(9) ⊥_{Luk} (a, b) = min(a + b, 1) \tag{9}

  • Drastic S-norm

D(a,b)={bif a=0aif b=01otherwise(10) ⊥_D(a, b) = \begin{cases} b &\text{if } a = 0 \\ a &\text{if } b = 0 \\ 1 &\text{otherwise} \end{cases} \tag{10}

  • Nilpotent maximum

nM(a,b)={max(a,b)if a+b<11otherwise(11) ⊥_{nM}(a, b) = \begin{cases} max(a, b) &\text{if } a + b < 1 \\ 1 &\text{otherwise} \end{cases} \tag{11}


  • Einstein sum

    H2(a,b)=a+b1+ab(12) ⊥_{H_2} (a, b) = \dfrac{a + b} {1 + ab} \tag{12}


Source, Target
Levenshtein, Soundex D Luk H2 sum nM max
Src: Jasper Cornelisz. Lodder
Trg: Jaspar Cornelisz Lodder
0.92, 1.00 1.000 1.000 1.000 1.000 1.000 1.000
Src: Rembrand van Rijn
Trg: Rembrandt Harmensz van Rijn
0.63, 0.74 1.000 1.000 0.934 0.904 1.000 0.740
Src: Barent Teunis
Trg: Barent Teunisz gen. Drent
0.52, 0.47 1.000 0.990 0.796 0.746 0.520 0.520

2.1.3 Examples

Suppose that, two data items E-1 and E-2 have the following information:

  • E-1

    • Name: Titus Rembrandtsz. van Rijn
    • Mother: Saskia Uylenburgh
    • Father: Rembrand van Rijn
    • Parent’s Marriage date: 1644-06-22
  • E-2

    • Name: T. Rembrandtszoon van Rijn
    • Mother: Saske van Uijlenburg
    • Father: Rembrandt Harmensz van Rijn
    • Baptism date:1641-09-22

To interpret E-1 and E-2 as representing co-referent persons, the following four tests are proposed.

Test-1 OR

Here, the names of E-1 and E-2 are to be compared using the Levenshtein and Soundex algorithms at a threshold of at least 0.7.

MATCHING RESULTS
 - Levenshtein(Titus Rembrandtsz van Rijn, T. Rembrandtszoon van Rijn)  => 0.73 ✅
 - sdx_1 = Soundex(Titus Rembrandtsz van Rijn) = T320 R516 V500 R250
 - sdx_2 = Soundex(T. Rembrandtszoon van Rijn) = T000 R516 V500 R250
 - Levenshtein(sdx_1, sdx_2)                                            => 0.89 ✅

DISJUNCTIONS RESULTS
 - names similarity   = S-norm(0.73, 0.88, 'MAXIMUM')                 => 0.89 ✅
 - names similarity   = S-norm(0.73, 0.88, 'PROBABILISTIC')           => 0.97 ✅
Test-2 AND

Names of the postulated mothers and fathers are to be similar at a threshold of at least 0.6 using the Levenshtein algorithm.

MATCHING RESULTS
 - Levenshtein(Saskia Uylenburgh, Saske van Uijlenburg)                 => 0.65 ✅
 - Levenshtein(Rembrand van Rijn, Rembrandt Harmensz van Rijn)          => 0.63 ✅

CONJUNCTION RESULTS
 - Parent's names similarity  = t_norm(0.65, 0.63, 'MINIMUM')           => 0.63 ✅
 - Parent's names similarity  = t_norm(0.65, 0.63, 'HAMACHER')          => 0.47 ❌
Test-3

The period between the parent’s marriage date on the one side and the child’s baptism date on the other side are to be no more than 25 years apart.

MATCHING RESULTS
 - Delta(1668-02-28, 1669-03-22, 25)                                       => 1.00 ✅
Test-4 AND

Combining all above three tests considering a the conjunction fuzzy operator should result in a similarity score above or equal to 0.8.

--------------------------------------------------------------------------------------
FINAL CONJUNCTIONS WITH A TRUTH VALUE LIST OF [0.850, 0.63, 1] 
--------------------------------------------------------------------------------------
  - t_norm_list([0.850, 0.63, 1], 'MINIMIUM')                           => 0.63 ❌
  - t_norm_list([0.850, 0.63, 1], 'HAMACHER')                           => 0.58 ❌
  - t_norm_list([0.850, 0.63, 1], 'PRODUCT')                            => 0.56 ❌
  - t_norm_list([0.850, 0.63, 1], 'NILPOTENT')                          => 0.63 ❌
  - t_norm_list([0.850, 0.63, 1], 'Łuk')                                => 0.52 ❌
  - t_norm_list([0.850, 0.63, 1], 'DRASTIC')                            => 0.00 ❌

--------------------------------------------------------------------------------------
CONJUNCTIONS WITH A DIFFERENT LIST OF TRUTH VALUES [0.89, 0.82, 1]
--------------------------------------------------------------------------------------
 - t_norm_list([0.89, 0.82, 1], "MINIMUM")                              => 0.82 ✅
 - t_norm_list([0.89, 0.82, 1], "HAMACHER")                             => 0.74 ❌
 - t_norm_list([0.89, 0.82, 1], "PRODUCT")                              => 0.73 ❌
 - t_norm_list([0.89, 0.82, 1], "NILPOTENYT")                           => 0.82 ✅
 - t_norm_list([0.89, 0.82, 1], "LUK")                                  => 0.71 ❌
 - t_norm_list([0.89, 0.82, 1], "DRASTIC")                              => 0.0  ❌

--------------------------------------------------------------------------------------
EXAMPLE USING MORE THAN ONE FUZZY LOGIC OPERATOR
--------------------------------------------------------------------------------------
  - Ops.t_norm(Ops.t_norm(0.850, 0.63, 'HAMACHER'), 1, 'MINIMIUM')      => 0.57 ❌
  - Ops.t_norm(Ops.t_norm(0.850, 0.63, 'MINIMIUM'), 1, 'HAMACHER')      => 0.63 ❌

Conclusion: Given the evidence provided for E-1 and E-2 and the rules described above, the interpretation resulting from the chosen fuzzy logic operations leads to the conclusion that there is no sufficient evidence to infer that the underlying data items are co-referent. This rejection is mainly due to the low similarity of the parents’ names. If the resulting similarity were above 0.8, there would then be a better chance for the data items to be co-referent. Keep in mind that our conjectured rule asserts an identity relation for combination of scores only when above 0.8. A better data or more advanced algorithm could have helped.

2.2 Combining Confidence Values

Understanding how to combine uncertain events starts by a better understanding of uncertainty itself. Sentz et al. provide two important distinctions of uncertainty: Aleatory (Objective Uncertainty originated from random behaviour) or Epistemic (Subjective Uncertainty originated from ignorance or lack of knowledge).

Whereas traditional probability is clearly applicable to deal with Aleatory Uncertainty, researchers claim its inability to deal with Epistemic Uncertainty. In short this is because the latter neither implies knowing probability of all relevant events, nor its uniform distribution, not even the axiom of additivity (i.e. all probabilities summing up to 1). This leads to the emergence of more general representations of uncertainty as alternatives to the traditional probability theory, such as imprecise probabilities, possibility theory and evidence theory. Nonetheless, at present, there is no clear best representation of uncertainty [Sentz2002].

This section introduces alternative representations of uncertainty that are planned to be implemented in the Lenticular Lens. Although their choice can be ultimately a choice of the user, we consider the problem of co-reference search by applying multiple matching methods to be a case of Epistemic Uncertainty nicely approached in evidence theory.

2.2.1 Probabilistic Logic

Using probability for combining confidence values with the logic operators “AND” and “OR” can be done in the context of link manipulation in theory with the equations (13) and (14) respectively with the strong assumption that the events to be combined are independent (the occurrence of one event has no effect on the probability of the occurrence of the other).

P(A and B)=P(A)P(B)(13) P(\text{A and B}) = P(A) \: \cdotp P(B) \tag{13}

P(A or B)=P(A)+P(B)P(A and B)where P(A and B)=0 if A and B are mutually exclusive events meaning that these events have no outcomes in common.(14) P(\text{A or B}) = P(A) + P(B) - P(\text{A and B}) \tag{14} \\ \footnotesize \text{where } P(\text{A and B}) = 0 \text{ if A and B are mutually exclusive events} \\ \text{ meaning that these events have no outcomes in common.}

On the one hand, assuming that events A and B are independent, equations (13) and (14) are somewhat similar to \otimes (prod(a,b)=a . b)\big(⊤_{prod}(a, b) = a \text{ . } b\big) and \oplus (sum(a,b)=a+ba . b)\big(⊥_{sum}(a, b) = a + b - a \text{ . } b\big) of Product Logic hence straightforward to implement when in the need of manipulating links such as applying an intersection or a union operator to two sets of links for example.

On the other hand, in the event that A and B are not independent, the value of P(A and B)P(\text{A and B}) should be observable n(A and B)n(Sample)\frac{n(A\text{ and } B)}{n(Sample)}, provided or computed using conditional probability in Equation 15.

𝑃(A and B)=𝑃(𝐴) . 𝑃(𝐵𝐴)(15) 𝑃(\text{A and B})=𝑃(𝐴)\text{ . }𝑃(𝐵|𝐴) \tag{15}

In the context of the Lenticular Lens, the computed confidence values are independent (the computation of a confidence value by method-1 has no effect on the computation of a confidence value by method-2). This being said, equation (15) is not applicable.

2.2.2 Possibilistic Logic

Possibility2 is compositional with respect to the union operator as the possibility of the union is deducible from the possibility of each component. Note however that it is not compositional with respect to the intersection operator.

pos(AB)=max(pos(A),pos(B))for any subsets A and B(16) pos(A \cup B) = max(pos(A), \: pos(B)) \:\:\:\: \text{\footnotesize for any subsets A and B} \tag{16}

pos(AB)min(pos(A),pos(B))max(pos(A),pos(B))(17) pos(A \cap B) \leq min(pos(A), \: pos(B)) \leq max(pos(A), \: pos(B)) \tag{17}

2.2.3 Evidence Theory

Evidence theory also provide different ways for combining uncertain scores. We present here two of them: starting from the first proposal, namely Dempster-Shafer, which is shown to have limitations, followed by a more accepted approach called average.

2.2.3.1 Dempster-Shafer

Combining or aggregating confidence values associated with evidence is made possible with the Dempster-Shafer conjunctive combination rule as presented in Equation 18. Here too, the assumption of independence among sources providing supporting or conflicting assessments for the same frame of discernment [Sentz2002]) is of key-importance and the basic assumption supporting the Dempster-Shafer combination rule. However, a crucial context dependent limitation of this rule as pointed out by [Zadeh, 1984], occurs in cases with significant conflicts because the denominator in the Dempster-Shafer combination rule has the effect of completely ignoring conflict while the numerator emphasis agreement, thereby yielding result inconsistency (unintuitive).

m12(A)=(m1m2)(A)=supporting evidence1conflicting evidence=BC=Am1(B)m2(C)1BC=m1(B)m2(C)(18) m_{12}(A) = (m_1 \oplus m_2)(A) = \footnotesize{\frac{\text{supporting evidence}}{1 - \text{conflicting evidence}}} = \frac{ \displaystyle\sum_{B \cap C = A \ne \emptyset} m_1(B) m_2(C) }{1 - \displaystyle\sum_{B \cap C = \emptyset} m_1(B) m_2(C) } \tag{18}

where {mBasic probability assignment (bpa) function for the universal set P(X) to [0, 1].m(A)The bpa value or the mass of a given set A but not for a particular subset of A.m1m2Two given basic probability assignments.m12(A)The combination a.k.a the joint m12.m()=0The mass of the empty set is zero.AP(X)m(A)=1The masses of all the members of the power set add up to a total of 1. \scriptsize \text{where } \begin{cases} m &\text{Basic probability assignment (bpa) function for the universal set P(X) to [0, 1]}. \\ m(A) &\text{The bpa value or the mass of a given set A but not for a particular subset of A}. \\ m_1 \text{, } m_2 &\text{Two given basic probability assignments}. \\ m_{12}(A) &\text{The combination a.k.a the joint } m_{12}. \\ m(\emptyset) = 0 &\text{The mass of the empty set is zero.} \\ \displaystyle\sum_{A\in P(X)} m(A) = 1 &\text{The masses of all the members of the power set add up to a total of 1.} \end{cases}

This inconsistency is higgled in Fig. 1: Patient Diagnose (1) where the joint agreement on the patient’s condition results to a 1 using Dempster’s combination rule although the doctors agreed that the patient is less-likely to suffer from a brain tumour. Has it been the opposite scenario as in Fig. 1: Patient Diagnose (2) (Dr. Green and Dr. House assigning 0.99 as the basic probability for the patience suffering from a brain tumour), the result of m12(brainTumor)=1\scriptsize m_{12}(brainTumor)=1 would be consistent with our intuition.

Generic
Fig. 1: Combining two doctors’ diagnosis of a patient using the Dempster-Shafer combination rule. The results in (1) and (2) highlight the context dependent inconsistency of the implementation of the rule.

2.2.3.2 Averaging

Many alternatives to Equation 18 have been proposed by scholars such as Yager (modified Dempster’s rule), Inagaki (modified combination rule), Zhang (center combination rule), Dubois and Prade (disjunctive consensus rule), Ferson and Kreinovich (averaging), etc. One particular approach called Averaging (Equation 19) is considered to produced better outcomes [Choi2009] and therefore more likely to be implemented in the Lenticular Lens for combining confidence values, such as lens operations union and intersection. It provides means to calculate an average of several (n) sources while also taking into account possible reliability weights attributed to each source.

m1...n(A)=1ni=1nwimi(A)(19) m_{1...n}(A) = \frac{1}{n} \sum_{i=1}^{n} w_im_i(A) \tag{19}

Where {nNumber of sources.wiReliability weight of the source.miBasic probability assignment of a body of evidence. \scriptsize \text{Where } \begin{cases} n &\text{Number of sources}. \\ w_i &\text{Reliability weight of the source}. \\ m_i &\text{Basic probability assignment of a body of evidence}. \\ \end{cases}

Applying Equation 19 to the two scenarios illustrated in Fig. 1 will yield the following results:

  • Patient Diagnosis (1): m1,2(brainTumor)=0.01+0.012=0.01\small m_{1,2}(brainTumor) = \frac{0.01 + 0.01}{2} = 0.01
  • Patient Diagnosis (2): m1,2(brainTumor)=0.99+0.992=0.99\small m_{1,2}(brainTumor) = \frac{0.99 + 0.99}{2} = 0.99

2.3 From Truth to Confidence

Similarly to the duck test, the modelling of ways to find supporting evidence for isolating potential entity matching candidates is crucial to inferring identity for a pair of resources. Section 2.1 already covers our take on how to combine truth values and Section 2.2 covers the combination of degrees of confidence. What now remains is to understand “How to transition from truth value to uncertainty / degree of confidence?”. For illustration purpose, this means for example, how to move from [looks like a duck (0.8), swims like a duck (1.0), and quacks like a duck (0.95)] to [it probably is a duck (???)]. Applying prod⊤_{prod} to the evidence truth values results in a truth value of 0.76. Assuming that a transition from the evidence truth value to the degree of confidence carries a weight of 1, we argue that it is now possible to extrapolate a confidence of 0.76 for asserting that the entity that looks like a duck (0.8), swims like a duck (1.0), and quacks like a duck (0.95) is probably (0.76) a duck. Similarly, if this transition weight is now reset to 0.9 for example, it is also reasonable that the degree of confidence computed for the entity being a duck drops to 0.684.

2.4 Implementation

There are several operations in the Lenticular Lens in which one or more of the above discussed values and their combinations take place. We summarise them here, including also the possibilities of future improvements to allow for more control over the produced values.

Link Construction

Simple Macthing Method

  • Transition from Truth to Confidence Degree
  • Example: If the entities’ names sound alike with degree of truth above 0.6, then the resources are probably the same with the sound-alike * 1 as degree of confidence.

Complex Macthing Method

  • Combination of truth values via logic-boxes followed by Transition from Truth to Confidence Degree
  • Example: If the entities’ names sound alike with a degree of truth above 0.6 OR look alike with a degree of truth above 0.7, AND the date of birth is the same, then final degree of truth using classical OR/AND combinations is min(max(sound-alike, look-alike), same-birth) and the resources are probably the same with the final degree of truth as degree of confidence.

Currenlty a fix transition-weight of 1 applied, so that the final score (currenlty only one is outputted) reflects both degrees of truth and confidence.

Improvements:

  • Explicit output degrees of truth, confidence and transition-weight;
  • Allow for the user to decide on the transition-weight applied, so that low-power identity criteria such as the example in simple method would not conclude that high name similarity means high confidence.
Link Manipulation

1. Union

  • Disjunction of the final degrees of truth, followed by the re-assignment of the confidence value given a transition-weight.

    • Possible combinations:
  • Disjunction of the final degrees confidence

2. Intersection

  • Conjunction of the final degrees of truth, followed by the re-assignment of the confidence value given a transition-weight

    • Possible combinations:
  • Conjunction of the final degrees confidence

3. Difference

This operation does not require combination of values, but simply selects the matches that do not occur in another link-set.

4. Composition

  • Transitivity over the final degress of truth ???

  • Composition of confidence attributed to independent events ???

5. In Set

This operation does not require combination of values, but simply selects the matches of which resources do occur in a given resource-set.

Currently, only the combination of the truth values are implemented, with the transition-weight equal to 1.

Improvements: allowing for the user to decide which values to combine and how, plus what transition-weight to (re)apply if needed would render the system more flexible.

Link Validation

On top of the automatically calculated dregre of confidence discussed so far, the manual validation allows for the user to attribute its own confidence.

Currently, such manual attribution consists of simply accepting or rejecting the produced link (which means manual confidence of 1 or 0).

Improvements: allowing for the user to attribute a confidence (increasing or decreasing the automatically calculated one) will render the system more flexible to handle matches that cannot be easily accepted or rejected, by allowing for example several experts’ opinion to be registered and the final user to decide whether to take it as acceptable or not.

This transition weight can easily be applied in the generation and manipulation of links. In a simplest setting where the transition weight is set to 1, discovered links can be annotated with an estimated degree of confidence by extrapolation of the evidence’s truth value using an appropriate combination function. Things become a bit more complicated when dealing with the manipulation of links because we have options that range from classic or fuzzy logics to possibilistic or probabilistic logics or evidence theory. In the score combination examples illustrated below, ex:lens-1 is the result of the union of of ex:linkset-1 and ex:linkset-2 using fuzzy logic over the respective evidence’s truth value of links being united while in ex:lens-2 and ex:lens-3 the degree of confidence of a link is computed using evidence theory and probabilistic logic.

Combining Uncertainty in Identity
### Linkset meta-data ###
#########################

ex:linkset-1 
    ex:combination-function      ex:t-norm-product ;
    ex:transition-weight        1.0 .

ex:linkset-2 
    ex:combination-function      ex:s-norm-max ;
    ex:transition-weight        1.0 .

### Annotated Linkset ###
#########################

ex:linkset-1
{ 
    <<ex:e1 owl:sameAs ex:e2>>
        ex:degree-of-truth      0.76 ;
        ex:degree-of-confidence 0.76 .
}

ex:linkset-2 
{   
    <<ex:e1 owl:sameAs ex:e2>>
        ex:degree-of-truth      0.9 ;
        ex:degree-of-confidence 0.9 .
}

#################################################################
### lens-1: Obtaining a degree of confidence by combining    ####
### scores with a UNION operation based on truth values      ####
### using s-norm-sum fuzzy logic operator.                   ####
#################################################################

ex:lens-1
    ex:operator                 ex:UNION ;
    ex:target                   ex:linkset-1, ex:linkset-2 ;
    ex:combination-function     ex:s-norm-sum ;
    ex:transition-weight        1.0 .

ex:lens-1
{       
    <<ex:e1 owl:sameAs ex:e2>>
        ex:degree-of-truth      0.976 ;
        ex:degree-of-confidence 0.976 .
}

#################################################################
### lens-2: Obtaining a degree of confidence by combining    ####
### scores with a UNION operation based on confidence values ####
### using the event averaging operator.                      ####
#################################################################

ex:lens-2
    ex:operator                 ex:UNION ;
    ex:target                   ex:linkset-1, ex:linkset-2 ;
    ex:combination-function     ex:averaging .

ex:lens-2
{
    <<ex:e1 owl:sameAs ex:e2>>
        ex:degree-of-confidence 0.83 .
}

#################################################################
### lens-2: Obtaining a degree of confidence by combining    ####
### scores with a UNION operation based on confidence values ####
### using probabilistic logic.                               ####
#################################################################

ex:lens-3
    ex:operator                 ex:UNION ;
    ex:combination-function     ex:Probabilistic .

ex:lens-3
{
    <<ex:e1 owl:sameAs ex:e2>>
        ex:degree-of-confidence 0.976 .
}
Alex
Ben
G

3. Conclusion

We have shed light on the ambiguity surrounding the concepts of vagueness and uncertainty and their corresponding scores degree of truth and degree of confidence. This enables us to understand the nature of the computed scores and provide appropriate labels to scores obtained by matching algorithms as degree of truth (property value comparisons) and scores assigned to identity links generated by machines (matching methods or combination) or human (curation) as degree of confidence. As a consequence, different options for aggregating/combining these scores are presented, depending on whether one is dealing with degree of truth or degree of confidence.


  1. Matching methods make explicit all arguments/pre-requisites (datasets, entity-type and property-value restrictions, matching properties…) of a matching algorithm including the conditions in which the algorithm is to accept a discovered link (threshold). 

  2. For intellectual curiosity, see wikipedia for more information.