Um blog sobre nada

Um conjunto de inutilidades que podem vir a ser úteis

Understanding the similarity between sets used by the Fuzzy lookup component in excel

Posted by Diego em Julho 17, 2013


Use the set (A, B, C) and (A, C, D)

 

To make sure they are applied the correct weights I SET the “CustonTokenWeightsRowSetName” to an excel table. The values are the same as the ones on the pdf that comes with the Fuzzy lookup download.

clip_image001[4]

 

So, if you read the explanation on the pdf, you will see that the Similarity is calculated by the value of the intersection divided by the value of the union of the sets, which in this case would be:

(A+C) / (A + B + C + D) 

5 / 17

0.294

 

But if you run it, you will get the value of 0.4588.

The reason is that the pdf leaves out the explanation on the concept of the ContainmentBias.

 

The actual formula to determine how similar 2 sets are is the following:

 

B*C + (1-B)*J

 

Where:

B: ContainmentBias
C:  Jaccard Containment
(size of the intersection divided by the size of left set)
J:
Jaccard similarity (size of the intersection divided by the size of the union of the two sets) – the 0.294 we saw before

 

Not that, if you apply a bias of 0, you will annul the (B * C) and since (1 – B) will result in 1, the final result will be the value of J:

clip_image002[4]

 

By setting the default value of 0.8, we would have the following:

B: 0.8
C: (A + C) / (A + B + C)    
à 5/10 à 0.5
J:  (A + C) / (A + B + C + D)
à 5/17 à
0.294

 

B*C + (1-B)*J

0.8 * 0.5 + 0.2 * 0.294

0.4 + 0.0588

0.4588

 

clip_image003[4]

 

Details:

<ComparisonResult similarity="0.458823529411765">

  <LeftTokens>

    <Token domain="Default" id="1" weight="2">a</Token>

    <Token domain="Default" id="4" weight="5">b</Token>

    <Token domain="Default" id="2" weight="3">c</Token>

  </LeftTokens>

  <RightTokens>

    <Token domain="Default" id="1" weight="2">a</Token>

    <Token domain="Default" id="2" weight="3">c</Token>

    <Token domain="Default" id="3" weight="7">d</Token>

  </RightTokens>

</ComparisonResult>

 

 

 If we run the comparison with ContainmentBias = 1, we get a result of 0.5

You can easily see that as close to 1 the ContainmentBias is, more similar the sets look like. That’s because the ContainmentBias is actually a penalty for tokens in the right set who are not present in the left set where 1 means no penalty (more similarity) and 0, full penalty (no similarity).

Bias

Similarity

0

0.294

0.8

0.45

1

0.5

 

Deixe uma Resposta

Preencha os seus detalhes abaixo ou clique num ícone para iniciar sessão:

Logótipo da WordPress.com

Está a comentar usando a sua conta WordPress.com Terminar Sessão / Alterar )

Imagem do Twitter

Está a comentar usando a sua conta Twitter Terminar Sessão / Alterar )

Facebook photo

Está a comentar usando a sua conta Facebook Terminar Sessão / Alterar )

Google+ photo

Está a comentar usando a sua conta Google+ Terminar Sessão / Alterar )

Connecting to %s

 
%d bloggers like this: