# Primer to LSH

## $D_2$

The $D_2$ statistic is characterized by number of matches of length k between any two sequences $A$ and $B$ allowing upto $t$ mis matches. Sequences A and B are represented as $(A_1, A_2, A_3, ... A_n_A) and (B_1, B_2, B_3, ...., B_n_B)$ respectively.

More generally $D_2$ is characterized by $(n_A, n_B, k, t, \eta)$ where: $n_A$ = Total length of nucleotides in sequence A $n_B$ = Total length of nucleotides in sequence B k = length of matches t = length of allowable mismatches

## Jaccard Similarity

Jaccard similarity: Notion of similarity that associates similarity of sets with the relative size of their intersection A comman example would be determining if a document is a plagiarised copy of another, if not identical. The technique used to convert these documents to a set intersection problem is referred to as “shingling”

 Jaccard similarity = $$jSIM(S,T) = \frac{ S \cap T }{ S \cup T }$$

Let $S=[1,2,3]$ and $T=[2,3,4,5,6]$ then $SIM(S,T)=\frac{2}{6}$

#### Jaccard distance

$D_J(A,B) = 1 - jSIM(A,B)$ and is a distance metric.

## Cosine Similarity

As the name suggests, cosine similarity relies on taking cosines between two given vectors to infer the similarity. The range for cosine is [-1,1] When used in positive space it is [0,1]

Thinking in terms of document matching, a document can be represented as a vector of words, where the “words” form the unit vectorsand the associated scalar is the frequency of the word.

### Cosine distance

$D_C(A,B)=1-S_C(A,B)$ but is not a metric since it does not follow the triangular inequality $d(x,z) \leq d(x,y)+d(y,z)$, neither does it follow the coincidence axiom $d(x,y)=0 iff\ x=y$ since cosine(x,y)=0 if $x \perp y$

$Range(cSIM(A,B))=[-1,1]$ where -1 means entirely opposite, 1 means ‘exactly similar’. In a positive space, such as when using the frequencies of words, the range is limited to $[0,1]$

## Angular similarity

For vectors in positive+negative space:

For vectors in positive space(only):

### Angular difference coefficient

$D_A(A,B) = 1-aSIM(A,B)$ and is a distance metric

## Tanimoto similarity

A computer program for Classifying plants describes Tanimoto similarity using bit arrays where each bit is a representative of the presence(1) or absence(0) of the plant.

Thus tanimato similarity is simply defined as the ratio of total number of common bits set to 1 to the total number of bits set to 1 in either(or both) of the vectors.

## Tanimato similarity v/s Jacard similarity

A = [“mary”, “little”, “girl”] B = [“mary”, “had”, “a”, “little”, “lamb”, “which”, “bleat”]

Now just similar to the plants context for Tanimato similarity, our plants are: [“mary”, “had”, “a”, “little”, “lamb”, “which”, “bleat”, “girl”] and thus, the corresponding bit vectors are: A = [1,0,0,1,0,0,0,1] B = [1,1,1,1,1,1,1,0]

 Also note, $$A = 3$,$ B = 7$, and$ A\cap B = 2$$

So $jSIM, tSIM$ are in fact one and the same

In fact if jacard similarity were to be written for bit arrays:

where $A.B = \sum_i(A_iB_i) = A \cup B$ denominator is trivial.

More generally, tanimato or jacard simialruty is given by:

### Tanimato distance

To quote wikipedia Wikipedia

This coefficient is, deliberately, not a distance metric. It is chosen to allow the possibility of two specimens, which are quite different from each other, to both be similar to a third. It is easy to construct an example which disproves the property of triangle inequality.

Wrong definition:

Tanimato distance is often confused with Jacard distance and maybe wrongly defined as $D_T(A,B)=1-tSIM(A,B)$, thus leading to the wrong conclusion that tanimato distance is indeed a distance metric, when it is not.

## A warmup problem

Suppose we have a universal set U of n elements, and we
choose two subsets S and T at random, each with m of the n elements. What
is the expected value of the Jaccard similarity of S and T

### Solution:

Thus $E(jSIM(S,T)) = \sum_k jSIM_k * P(S\cap T = k) = \sum_{k=1}^{m} \frac{k}{2m-k} * \frac{ {m \choose k} * {(n-m) \choose (m-k)} }{ {n \choose m} * {n \choose m} }$