{Saket Choudhary}

Primer to LSH

The statistic is characterized by number of matches of length k between any two sequences and allowing upto mis matches. Sequences A and B are represented as respectively.

More generally is characterized by where: = Total length of nucleotides in sequence A = Total length of nucleotides in sequence B k = length of matches t = length of allowable mismatches

Jaccard Similarity

Jaccard similarity: Notion of similarity that associates similarity of sets with the relative size of their intersection A comman example would be determining if a document is a plagiarised copy of another, if not identical. The technique used to convert these documents to a set intersection problem is referred to as “shingling”

Jaccard similarity = $$ jSIM(S,T) = \frac{ S \cap T }{ S \cup T } $$

Let and then

Jaccard distance

and is a distance metric.

Cosine Similarity

As the name suggests, cosine similarity relies on taking cosines between two given vectors to infer the similarity. The range for cosine is [-1,1] When used in positive space it is [0,1]

Thinking in terms of document matching, a document can be represented as a vector of words, where the “words” form the unit vectorsand the associated scalar is the frequency of the word.

Cosine distance

but is not a metric since it does not follow the triangular inequality , neither does it follow the coincidence axiom since cosine(x,y)=0 if

where -1 means entirely opposite, 1 means ‘exactly similar’. In a positive space, such as when using the frequencies of words, the range is limited to

Angular similarity

For vectors in positive+negative space:

For vectors in positive space(only):

Angular difference coefficient

and is a distance metric

Tanimoto similarity

A computer program for Classifying plants describes Tanimoto similarity using bit arrays where each bit is a representative of the presence(1) or absence(0) of the plant.

Thus tanimato similarity is simply defined as the ratio of total number of common bits set to 1 to the total number of bits set to 1 in either(or both) of the vectors.

Tanimato similarity v/s Jacard similarity

A = [“mary”, “little”, “girl”] B = [“mary”, “had”, “a”, “little”, “lamb”, “which”, “bleat”]

Now just similar to the plants context for Tanimato similarity, our plants are: [“mary”, “had”, “a”, “little”, “lamb”, “which”, “bleat”, “girl”] and thus, the corresponding bit vectors are: A = [1,0,0,1,0,0,0,1] B = [1,1,1,1,1,1,1,0]

Also note, $$ A = 3 B = 7 A\cap B = 2 $$

So are in fact one and the same

In fact if jacard similarity were to be written for bit arrays:

where denominator is trivial.

More generally, tanimato or jacard simialruty is given by:

Tanimato distance

To quote wikipedia Wikipedia

This coefficient is, deliberately, not a distance metric. It is chosen to allow the possibility of two specimens, which are quite different from each other, to both be similar to a third. It is easy to construct an example which disproves the property of triangle inequality.

Wrong definition:

Tanimato distance is often confused with Jacard distance and maybe wrongly defined as , thus leading to the wrong conclusion that tanimato distance is indeed a distance metric, when it is not.

A warmup problem

Taken from Chapter 3 of Ullman’s Mining Massive Datasets

Suppose we have a universal set U of n elements, and we
choose two subsets S and T at random, each with m of the n elements. What
is the expected value of the Jaccard similarity of S and T