{Saket Choudhary}

Mmd | Week2

Set Simialarity

  • Many data mining problems boil down to finding similar sets

Document Simialarity

  • Document => Shingling(Set of strings of length k) => Minhashing(Signtaures, short integeres representing words) => LSH
  • shingles: K-shingle = Sequence of k characters that appears in the document
  • k = 2, abcab = {ab, bc, ca}i
  • Changing a word affects only k shingles within k distance from the word
  • Reordering paragraphs also would affect the 2k si


  • Jaccard similarity: Sim(C1,C2) =$$\frac{C1 \cap C2}{C1 \cup C2 }
  • Rows =elements of universal set(set of all k-shingles)
  • Columns=setssi
  • 1 in (i,j) if i is a member of j
  • Column similarity = Jaccard similarity of sets of rows with entry 1
  • E.g C1=[0,1,1,1,1,0] C2=[1,0,0,1,1,1]. sim(C1,C2)=2/5
  • Type of rows: row C1 C2 a 1 1 b 1 0 c 0 1 d 0 0
  • Sim (C1,C2)= a/a+b+c


  • Minhash = h(C) = Number of first row (among permuted rows) such that Column ‘C’ has 1
  • Signaure of a column = Use several hash functions to create a signature for each column
  • Signaure Matrix: Rows represent min hash values of columns represented

Minhashing example

3 1 0 0 1 2 0 1 1 0 1 1 0 0 0 4 0 1 1 1

Think! Minhash 1 2 2 3

  • Property: h(C1)=h(C2) = Sim(C1,C2) probability wise = a/a+b+c
  • Proof for above is trivial(based on the type of rows we assigned)
  • Similairty of signaures = fraction of minhash functions in which they agree

Problems with minhashing:

  • Hard to pick random permutation of 1 billion+ rows
  • Representing permutations rquires 1 billoion extra rows

Aliter Implementation of minhashing

  • Pick 100 hash functions
  • M(i,C) = Slot for ith hash function for Cth column
  • M(i,C) =min() for whcih column C has 1 in row r
  • is not necessarily a permutation. h_i can maps two or more rows to same values. But if we increase the number of buckets M theye probability of collision can be ignored
  • TODO: Example?

Locality Sensitive Hashing

  • Select small list of candidate pairs for evaluating similiarity
  • Hash columns to many buckets. Elements of same buckets for pair
  • Pick similiarty threshold t <1. We want the columns c1 and c2 of signature matrix M to be a candiate pair iff their singatures agree in atleast fraction t of the rows
  • Divide Matrix M into b bands of rows
  • For each band, hash its portion of each column to a hash table with k buckets making k large
  • Candidate pairs: One those hash to the same bucketi
  • ToDO: Example for comparing probabilites of false positives

Entity Resolution

  • Entity resolution: Examine set of records to determine if they belong to same “entity” and merge

Fingerprint Comparison

  • Represent fingerprint by set of positions of minutiae
  • Minutiae: Features of fingerprint such as points where ridges meet
  • Place a grid on fingerprint space, normalize
  • Set of grid squares where minituae are located represents the fingerprint