CSO Programming Assignments- Fall 2015

Introduction

In many scenarios, we would like to know how ``similar'' two documents are are to each other. For example, many pages on the web are similar but not entirely identical to each other. Search engines like Google or Bing need to group similar web pages together such that only one among a group of similar documents is displayed in the search result. We refer to this process of measuring document similarity as approximate matching. In this lab, you will write a program to approximately match one input file against another file. The goal is to improve your skills in programming using C, e.g. manipulating arrays, pointers, number and character representation, bit operations etc.

Obtaining the lab code

As usual, do a git pull in your cso-labs directory. This lab's files are located in the rklab/ subdirectory. The files you will be modifying are rkmatch.c and bloom.c.

Part I: Exact Matching

We start by solving the simpler problem of checking whether a document is an exact duplicate of another. However, we will do something slightly more sophisticated by matching documents' raw contents character-by-character. In particular, we ``normalize'' (i.e. clean up) a document by doing the following:

Convert all upper case letters to lower case ones.
Convert different white space characters (e.g. carriage return, tab,...) into the space character,
Shrink any sequence of two or more space characters to exactly one space.
Remove any whitespace at the beginning and end of the text.

As an example, consider a string "I⋄am⋄A⋄⋄⋄⋄Dog" where ⋄ is some white space character such as spaces, tabs, newlines etc. the normalized content whould be ``i am a dog''. We consider document X and Y to be an exact match if their corresponding normalized contents are identical.

Your job: Implement the perfect matching algorithm. The rkmatch.c file already contains a code skeleton with a number of helper procedures. Read rkmatch.c and make sure you understand the basic structure of the program. To implement the exact matching algorithm, you need to complete the procedures exact_match and normalize.

In rkmatch.c, the `main` procedure first invokes normalize to normalize the content of files X and Y. It then invokes exact_match. The invocation passes in---as part of the procedure arguments---the pointer to X's content (const char *qdoc), X's length, the pointer to Y's content (const char *doc), and Y's length. If X and Y are an exact match, the procedure returns 1, otherwise, it returns 0.

Please read all of our scaffolding comments before coding.

Testing: Run the given tester program ./rktest.py -t exact

Simple Approximate Matching

Determining similarity is a much trickier business than performing exact matching. Let us consider a naive algorithm that measures how well document X (of size m) approximately-matches another document Y (of size n). The algorithm considers every substring of length k in X and checks if that substring appears in document Y. For example, if the content of X is "abcd" and k=2, then the algorithm tries to match 3 substrings ("ab", "bc", "cd") in Y. Suppose the algorithm counts ctr number of matched substrings. Since there are total m-k+1 substrings to match, the fraction of matched substrings is ctr/(m-k+1). We can use the calculated fraction as the approximate match score. Intuitively, the more similar file X is to Y, the higher its final score. In particular, if file X is identical to Y, the final score would be 1.0.

Our naive algorithm uses a subroutine to determine whether a shorter string (called the query string) appears as a substring in some other longer string (called the destination string). For the naive algorithm, the query string is a k character-long substring from X and the destination string is the normalized content of Y. The naive way to check for substring match to compare the query string with every substring of length k in Y starting at position 0,1,2,...,(n-k). Thus, each query string takes O(k*n) time. Since there are a total of m-k+1 substrings of X to be matched in Y, the total runtime of our approximate matching algorithm would be O(k*m*n). This runtime is pretty bad and we will ask you to implement a faster version of this algorithm that we refer to as simple approximate matching.

Simple Approximate Matching

We observe that it is not necessary to do substring matching for all m-k+1 substrings of X, as is done in the naive algorithm. Rather, we can "chop" X into m/k non-overlapping substrings (called chunks) and try to match each chunk in Y. For example, if the content of X is "abcd" and k=2, the optimized algorithm only matches 2 chunks ("ab", "cd") instead of 3 substrings as in the original naive algorithm. Doing so cuts down the runtime by a factor of k to O(m*n) Note that this simple algorithm does not produce the exact same score as the naive algorithm in all scenarios, but the resulting score is close enough for practical purposes. We refer to this version as the simple algorithm.

Your job: Implement the simple approximate matching algorithm. You only need to add your code to rkmatch.c. Specifically, you need to complete the procedures simple_substr_match.

In rkmatch.c, the main procedure considers each chunk of X in turn and invokes simple_substr_match to find a match in file Y. The invocation passes in---as part of the procedure arguments---the pointer to the chunk, chunk's length (k), the pointer to Y's content and Y's length. If a match is found, the procedure returns 1, otherwise, it returns 0. (If a chunk of X appears many times in Y, the return value should still be 1.)

Please read all of our scaffolding comments before coding.

Testing: Run the given tester program ./rktest.py -t simple

Rabin-Karp Approximate Matching

Our next optimization comes from using Rabin-Karp substring matching algorithm (RK for short), invented in the eighties by two renowned computer scientists, Michael Rabin and Richard Karp.

RK checks if a given query string P appears as a substring in Y. RK uses the idea of hashing: A hash function turns a string of arbitary length into a b-bit hash value with the property that collision (two different strings having the same hash value) is unlikely. At a high level, RK works by computing a hash for the query string, hash(P), as well as a hash for each of the n-k+1 substrings of length k in Y, hash(Y[0...k-1]), hash(Y[1...k]), ..., hash(Y[n-k...n-1]), where Y[0...k-1] is the first k characters of Y and Y[1...k] is the second k characters of Y and so on. By comparing hash(P) with each of the n-k+1 hashes from Y, we can determine if P appears as a substring in Y. There are many nice hash functions out there (such as MD5, SHA-1), but unfortunately, none of them would make RK any faster since it takes O(k) time to compute each of the n-k+1 hashes from Y.

RK's magical ingredient is its invention of a ``rolling'' hash function. Specifically, given hash(Y[i...i+k-1]), RK takes only constant time instead of O(k) time to compute hash(Y[i+1...i+k]).

We first describe how RK hashes a string. Let's treat each byte as a digit in base-d notation, therefore, we are using base d=256. For example, the string 'ab' corresponds to two digits with one being 97 (the ASCII value of 'a'), and the other being 98 (the ASCII value of 'b'). The decimal value of 'ab' in base-256 can be calculated as 256*97+98 = 24930. The hash of a string P in RK is hash(P[0...k-1]) = 256^k-1*P[0]+256^k-2*P[1]+...+256¹*P[k-2]+P[k-1].

We now see how to do a rolling calculation of the hash values for consecutive substrings of Y. Let y_i=hash(Y[i...i+k-1]). We can compute y_i+1 from y_i in constant time, by observing that

y_i+1 = 256^k-1*Y[i+1] + 256^k-2*Y[i+2] + ... + Y[i+k]
     = 256 * ( 256^k-2*Y[i+1] + ... Y[i+k-1]) + Y[i+k]
     = 256 * ((256^k-1*Y[i] + 256^k-2*Y[i+1] + ... Y[i+k-1]) - 256^k-1*Y[i]) + Y[i+k]
     = 256 * ( y_i - 256^k-1 * Y[i]) + Y[i+k]

In order to achieve constant time calculation, we have to remember the value of 256^k-1 in a variable instead of re-computing it each time.

Now we've seen how rolling hash works. The only fly in the ointment is that these base-256 hash values are too huge to work with efficiently and conveniently. Therefore, we perform all the computation in modulo q, where q is chosen to be a large prime (food for thought: why a prime instead of non-prime?). Hash collisions are infrequent, but still possible. Therefore once we detect some y_i = hash(P), we should compare the actual strings Y[i...i+k-1] and P[0...k-1] to see if they are indeed identical.

Since RK speeds up substring matching to O(n) (assuming hash collusion is unlikely) instead of O(n*k) as in the simple algorithm. However, we still need to run RK m/k times for each of the m/k chunks of X to be matched in Y. Thus, our approximate matching algorithm using RK has an overall runtime of O((m/k)*n).

Your job: Implement the RK substring matching algorithm by completing the rabin_karp_match procedure. When calculating the hash values, you should use the given modulo arithmatic functions, madd, mdel, mmul.

As with simple_substr_match, the main procedure will invoke rabin_karp_match for each chunk of X to be matched. rabin_karp_match has the same interface as simple_match and should return 1 if the chunk appears as a substring in Y or 0 if otherwise.

Read the code comments carefully to guide your implementation.

Testing: Run the given tester program ./rktest.py -t rk

RK Approximate Matching with a Bloom Filter

Our RK-based approximate matching algorithm has a runtime of O((m/k)*n). Now we will boost its speed further by using a Bloom filter.

A Bloom filter is a bitmap of h bits initialized to zeros in the beginning. We insert all m/k RK hash values of X that we want to match into the ``filter''.

To insert a value v into the bitmap, we use f hash functions to map v into f positions in the bitmap and set each position to be one. For example, starting with a 10-bit bitmap and f=2, if v is mapped to positions 1,5 and v' is mapped to 3,9, the bitmap after inserting v,v' would be 0101010001. There are many reasonable conventions with which we can use a byte array to represent a bitmap. In this lab, we expect you follow one specific convention and interpret bits in the Big-Endian convention within a byte. For example, we use a 2 byte array to represent a 10-bit bitmap. The first byte represents the first 8-bit of the bitmap. And the most-significant-bit of the first byte represents the first bit of the bitmap. The most-significant-bit of the second byte represents the 9-th bit of the bitmap. Suppose the bitmap is 0101010001, then the array content is {0x54, 0x40}.

After we have inserted all m/k hash values of X, we proceed to calculate every y_i in Y and check whether it *may* match any of the elements in the filter. This is done by mapping each y_i into f positions using the same f hash functions and checking whether *all* f positions contain value 1. If so, y_i is a probable match. We say the y_i's match is probable because Bloom filter incurs false positives in that y_i may be considered to equal to some of the m/k hash values even though it is not. Note that the correctness of this approach relies on the important property of Bloom filter that it never incurs false negatives. Thus, to confirm that y_i is a real match, we check whether Y[i...i+k-1] is indeed identical to any of the X[0...k-1]_,_X[k...2k-1]... strings.

Using a Bloom filter, our enhanced algorithm has a total runtime of O(m+n) (assuming there are not too many false positives), significantly faster than our previous two versions of approximate matching!

You should complete this part of the lab in two steps. First, implement the bloom filter functions.

Your job: First implement the Bloom filter functions by implementing the bloom_init, bloom_query, and bloom_add in the source file bloom.c.

To help you implement bloom_add and bloom_query, we provide you with a particular hash function implementation for Bloom filter, int hash_i(int i, long long x). This function will hash a given Rabin-Karp hash value x into an int value according to the i-th hash function. The number of hash functions that you should use is given by BLOOM_HASH_NUM, a global constant defined in file bloom.c.

Please read all of our scaffolding comments before coding.

Testing: After you are done with the bloom filter implementation, test its correctness with our test script, using the command ./rktest.py -t bloom The test script invokes bloom_test (see its implementation in bloom_test.c) to test your bloom filter implementation and it compares your output to the correct answer.

Next, implement the RK algorithm using the Bloom filter functions that you just implemeneted.

Your job:Complete the rabin_karp_batchmatch procedure in rkmatch.c. In the template code, we tell you the size of the bitmap to use (in bits) as a procedure argument (bsz). In the rabin_karp_batchmatch procedure, you should invoke the auxilary function bloom_init to allocate and initialize a byte array (i.e. character array) that will pack bsz bits for use as the bloom filter's bitmap. You should then compute all m/k RK hash values corresponding to X's chunks and insert them into the Bloom filter using bloom_add. Subsequently, you scan Y's content and compute a RK value for each of its n-k+1 substrings and query its presence in the Bloom filter using bloom_query. If bloom_query indicates a potential match, you proceed to check that the corresponding substring of Y indeed matches one of the m/k chunks of X.

Keep in mind that unlike simple_substr_match or rabin_karp_match, returning 1 or 0 is no longer sufficient; rabin_karp_batchmatch is invoked only once and therefore must return the total number of chunks of X that have appeared in Y.

Please read all of our scaffolding comments before coding.

Testing: Run the given tester program ./rktest.py -t rkbatch to test your batch Rabin-Karp method. Once you've finished all parts of the lab, you should run the full suite of tests by running the test program without arguments, ./rktest.py.

Style

We will grade your coding style in this lab, please read this style guide carefully and follow the advice.

Evaluation

Your score will be computed out of a maximum of 60 points based on the following distribution:

Correctness points (50 total), each version of the algorithm carries 10 points.
Style points (10 total). We reserve 5 points for a subjective evaluation of the style of your solutions and your commenting. Your solutions should be as clean and straightforward as possible. Your comments should be informative, but they need not be extensive.

Note while the testing script will give you a good idea of your final lab grade, its score does NOT constitute an assurance. In particular, just because they pass the tests does not necessarily mean that they implemented everything perfectly.

Hand-in

Run the following command to generate lab2.tar.gz:

make handin

Submit here

RK Lab: Approximate document matching Due: 10/11 11:59pm

Introduction

Obtaining the lab code

Part I: Exact Matching

Simple Approximate Matching

Simple Approximate Matching

Rabin-Karp Approximate Matching

RK Approximate Matching with a Bloom Filter

Style

Evaluation

Hand-in

RK Lab: Approximate document matching
Due: 10/11 11:59pm