# Mmd | Week2

## Set Simialarity

• Many data mining problems boil down to finding similar sets

## Document Simialarity

• Document => Shingling(Set of strings of length k) => Minhashing(Signtaures, short integeres representing words) => LSH
• shingles: K-shingle = Sequence of k characters that appears in the document
• k = 2, abcab = {ab, bc, ca}i
• Changing a word affects only k shingles within k distance from the word
• Reordering paragraphs also would affect the 2k si

## Simarilty

• Jaccard similarity: Sim(C1,C2) =\frac{C1 \cap C2}{C1 \cup C2 }
• Rows =elements of universal set(set of all k-shingles)
• Columns=setssi
• 1 in (i,j) if i is a member of j
• Column similarity = Jaccard similarity of sets of rows with entry 1
• E.g C1=[0,1,1,1,1,0] C2=[1,0,0,1,1,1]. sim(C1,C2)=2/5
• Type of rows: row C1 C2 a 1 1 b 1 0 c 0 1 d 0 0
• Sim (C1,C2)= a/a+b+c

## Minhashing

• Minhash = h(C) = Number of first row (among permuted rows) such that Column ‘C’ has 1
• Signaure of a column = Use several hash functions to create a signature for each column
• Signaure Matrix: Rows represent min hash values of columns represented

## Minhashing example

3 1 0 0 1 2 0 1 1 0 1 1 0 0 0 4 0 1 1 1

Think! Minhash 1 2 2 3

• Property: h(C1)=h(C2) = Sim(C1,C2) probability wise = a/a+b+c
• Proof for above is trivial(based on the type of rows we assigned)
• Similairty of signaures = fraction of minhash functions in which they agree

## Problems with minhashing:

• Hard to pick random permutation of 1 billion+ rows
• Representing permutations rquires 1 billoion extra rows

## Aliter Implementation of minhashing

• Pick 100 hash functions
• M(i,C) = Slot for ith hash function for Cth column
• M(i,C) =min($h_i(r)$) for whcih column C has 1 in row r
• $h_i$ is not necessarily a permutation. h_i can maps two or more rows to same values. But if we increase the number of buckets M theye probability of collision can be ignored
• TODO: Example?

## Locality Sensitive Hashing

• Select small list of candidate pairs for evaluating similiarity
• Hash columns to many buckets. Elements of same buckets for pair
• Pick similiarty threshold t <1. We want the columns c1 and c2 of signature matrix M to be a candiate pair iff their singatures agree in atleast fraction t of the rows
• Divide Matrix M into b bands of $k$ rows
• For each band, hash its portion of each column to a hash table with k buckets making k large
• Candidate pairs: One those hash to the same bucketi
• ToDO: Example for comparing probabilites of false positives

## Entity Resolution

• Entity resolution: Examine set of records to determine if they belong to same “entity” and merge

## Fingerprint Comparison

• Represent fingerprint by set of positions of minutiae
• Minutiae: Features of fingerprint such as points where ridges meet
• Place a grid on fingerprint space, normalize
• Set of grid squares where minituae are located represents the fingerprint