# Primer to LSH

The statistic is characterized by number of matches of length k between any two sequences and allowing upto mis matches. Sequences A and B are represented as respectively.

More generally is characterized by where: = Total length of nucleotides in sequence A = Total length of nucleotides in sequence B k = length of matches t = length of allowable mismatches

## Jaccard Similarity

Jaccard similarity: Notion of similarity that associates similarity of sets with the relative size of their intersection A comman example would be determining if a document is a plagiarised copy of another, if not identical. The technique used to convert these documents to a set intersection problem is referred to as “shingling”

Jaccard similarity = $$ jSIM(S,T) = \frac{ | S \cap T | }{ | S \cup T | } $$ |

Let and then

#### Jaccard distance

and is a distance metric.

## Cosine Similarity

As the name suggests, cosine similarity relies on taking cosines between two given vectors to infer the similarity. The range for cosine is [-1,1] When used in positive space it is [0,1]

Thinking in terms of document matching, a document can be represented as a vector of words, where the “words” form the unit vectorsand the associated scalar is the frequency of the word.

### Cosine distance

but is not a metric since it does not follow the triangular inequality , neither does it follow the coincidence axiom since cosine(x,y)=0 if

where -1 means entirely opposite, 1 means ‘exactly similar’. In a positive space, such as when using the frequencies of words, the range is limited to

## Angular similarity

For vectors in positive+negative space:

For vectors in positive space(only):

### Angular difference coefficient

and *is* a distance metric

## Tanimoto similarity

A computer program for Classifying plants describes Tanimoto similarity using bit arrays where each bit is a representative of the presence(1) or absence(0) of the plant.

Thus tanimato similarity is simply defined as the ratio of total number of common bits set to 1 to the total number of bits set to 1 in either(or both) of the vectors.

## Tanimato similarity v/s Jacard similarity

A = [“mary”, “little”, “girl”] B = [“mary”, “had”, “a”, “little”, “lamb”, “which”, “bleat”]

Now just similar to the plants context for Tanimato similarity, our plants are: [“mary”, “had”, “a”, “little”, “lamb”, “which”, “bleat”, “girl”] and thus, the corresponding bit vectors are: A = [1,0,0,1,0,0,0,1] B = [1,1,1,1,1,1,1,0]

Also note, $$ | A | = 3 | B | = 7 | A\cap B | = 2 $$ |

So are in fact *one and the same*

In fact if jacard similarity were to be written for bit arrays:

where denominator is trivial.

More generally, tanimato or jacard simialruty is given by:

### Tanimato distance

To quote wikipedia Wikipedia

This coefficient is, deliberately, not a distance metric. It is chosen to allow the possibility of two specimens, which are quite different from each other, to both be similar to a third. It is easy to construct an example which disproves the property of triangle inequality.

*Wrong* definition:

Tanimato distance is often confused with Jacard distance and maybe wrongly defined as , thus leading to the wrong conclusion that tanimato distance is indeed a distance metric, when it is not.

## A warmup problem

Taken from Chapter 3 of Ullman’s Mining Massive Datasets

Suppose we have a universal set U of n elements, and we

choose two subsets S and T at random, each with m of the n elements. What

is the expected value of the Jaccard similarity of S and T

### Solution:

Thus