# Mmd | Week2

## Set Simialarity

- Many data mining problems boil down to finding similar sets

## Document Simialarity

- Document => Shingling(Set of strings of length k) => Minhashing(Signtaures, short integeres representing words) => LSH
- shingles: K-shingle = Sequence of k characters that appears in the document
- k = 2, abcab = {ab, bc, ca}i
- Changing a word affects only k shingles within k distance from the word
- Reordering paragraphs also would affect the 2k si

## Simarilty

- Jaccard similarity: Sim(C1,C2) =$$\frac{C1 \cap C2}{C1 \cup C2 }
- Rows =elements of universal set(set of all k-shingles)
- Columns=setssi
- 1 in (i,j) if i is a member of j
- Column similarity = Jaccard similarity of sets of rows with entry 1
- E.g C1=[0,1,1,1,1,0] C2=[1,0,0,1,1,1]. sim(C1,C2)=2/5
- Type of rows: row C1 C2 a 1 1 b 1 0 c 0 1 d 0 0
- Sim (C1,C2)= a/a+b+c

## Minhashing

- Minhash = h(C) = Number of first row (among permuted rows) such that Column ‘C’ has 1
- Signaure of a column = Use several hash functions to create a signature for each column
- Signaure Matrix: Rows represent min hash values of columns represented

## Minhashing example

3 1 0 0 1 2 0 1 1 0 1 1 0 0 0 4 0 1 1 1

Think! Minhash 1 2 2 3

- Property: h(C1)=h(C2) = Sim(C1,C2) probability wise = a/a+b+c
- Proof for above is trivial(based on the type of rows we assigned)
- Similairty of signaures = fraction of minhash functions in which they agree

## Problems with minhashing:

- Hard to pick random permutation of 1 billion+ rows
- Representing permutations rquires 1 billoion extra rows

## Aliter Implementation of minhashing

- Pick 100 hash functions
- M(i,C) = Slot for ith hash function for Cth column
- M(i,C) =min() for whcih column C has 1 in row r
- is not necessarily a permutation. h_i can maps two or more rows to same values. But if we increase the number of buckets M theye probability of collision can be ignored
- TODO: Example?

## Locality Sensitive Hashing

- Select small list of candidate pairs for evaluating similiarity
- Hash columns to many buckets. Elements of same buckets for pair
- Pick similiarty threshold t <1. We want the columns c1 and c2 of signature matrix M to be a candiate pair iff their singatures agree in atleast fraction t of the rows
- Divide Matrix M into b bands of rows
- For each band, hash its portion of each column to a hash table with k buckets making k large
- Candidate pairs: One those hash to the same bucketi
- ToDO: Example for comparing probabilites of false positives

## Entity Resolution

- Entity resolution: Examine set of records to determine if they belong to same “entity” and merge

## Fingerprint Comparison

- Represent fingerprint by set of positions of minutiae
- Minutiae: Features of fingerprint such as points where ridges meet
- Place a grid on fingerprint space, normalize
- Set of grid squares where minituae are located represents the fingerprint