Due: 10/11 11:59pm

- Convert all upper case letters to lower case ones.
- Convert different white space characters (e.g. carriage return, tab,...) into the space character,
- Shrink any sequence of two or more space characters to exactly one space.
- Remove any whitespace at the beginning and end of the text.

In `rkmatch.c`, the `main` procedure first invokes `normalize` to
normalize the content of files X and Y. It then invokes `exact_match`. The invocation
passes in---as part of the procedure arguments---the pointer to X's content (`const char *qdoc`), X's length, the pointer to Y's content (`const char *doc`), and Y's length.
If X and Y are an exact match, the procedure returns 1, otherwise, it returns 0.

Please read all of our scaffolding comments before coding.

**Testing:** Run the given tester program `./rktest.py -t exact`

Determining similarity is a much trickier business than performing exact matching.
Let us consider a naive algorithm that measures how
well document X (of size *m*) approximately-matches another document Y (of
size *n*). The algorithm considers every substring of length *k* in X and
checks if that substring appears in document Y. For example, if the content of
X is "abcd" and *k=2*, then the algorithm tries to match *3* substrings
("ab", "bc", "cd") in Y. Suppose the algorithm counts *ctr* number of matched
substrings. Since there are total *m-k+1* substrings to match, the
fraction of matched substrings is *ctr/(m-k+1)*. We can use the
calculated fraction as the approximate match score. Intuitively, the more
similar file X is to Y, the higher its final score. In particular, if file X
is identical to Y, the final score would be *1.0*.

Our naive algorithm uses a subroutine to determine whether a shorter
string (called the query string) appears as a substring in some other longer
string (called the destination string). For the naive algorithm, the query string is a *k*
character-long substring from X and the destination string is the normalized content of Y.
The naive way to check for substring match to compare the query string with
**every** substring of length *k* in Y starting at position *0,1,2,...,(n-k)*.
Thus, each query string takes *O(k*n)* time. Since there are a
total of *m-k+1* substrings of X to be matched in Y, the total runtime of our
approximate matching algorithm would be *O(k*m*n)*. This runtime is
pretty bad and we will ask you to implement a faster version of this algorithm
that we refer to as simple approximate matching.

We observe that it is not necessary to do substring
matching for all *m-k+1* substrings of X, as is done in the naive algorithm. Rather, we can "chop" X
into m/k non-overlapping substrings (called chunks)
and try to match each chunk in Y.
For example, if the content of X is "abcd" and k=2, the optimized algorithm
only matches 2 chunks ("ab", "cd") instead of 3 substrings as in the original naive algorithm.
Doing so cuts down the runtime by a factor of k to *O(m*n)* Note that this
simple algorithm does not produce the exact same score as the naive algorithm
in all scenarios, but the resulting score is close enough for practical
purposes. We refer to this version as the simple algorithm.

In `rkmatch.c`, the `main` procedure considers each chunk of X in turn and invokes
`simple_substr_match` to find a match in file Y. The invocation passes in---as part of
the procedure arguments---the pointer to the chunk, chunk's length (k),
the pointer to Y's content and Y's length. If a match is found, the
procedure returns 1, otherwise, it returns 0. (If a chunk of X appears many
times in Y, the return value should still be 1.)

Please read all of our scaffolding comments before coding.

**Testing: ** Run the given tester program `./rktest.py -t simple`

Our next optimization comes from using Rabin-Karp substring matching algorithm (RK for short), invented in the eighties by two renowned computer scientists, Michael Rabin and Richard Karp.

RK checks if a given query string P appears as a substring in Y. RK uses the idea of hashing: A hash function turns a string of arbitary length into a b-bit hash value with the property that collision (two different strings having the same hash value) is unlikely. At a high level, RK works by computing a hash for the query string, hash(P), as well as a hash for each of the n-k+1 substrings of length k in Y, hash(Y[0...k-1]), hash(Y[1...k]), ..., hash(Y[n-k...n-1]), where Y[0...k-1] is the first k characters of Y and Y[1...k] is the second k characters of Y and so on. By comparing hash(P) with each of the n-k+1 hashes from Y, we can determine if P appears as a substring in Y. There are many nice hash functions out there (such as MD5, SHA-1), but unfortunately, none of them would make RK any faster since it takes O(k) time to compute each of the n-k+1 hashes from Y.

RK's magical ingredient is its invention of a ``rolling'' hash function. Specifically, given hash(Y[i...i+k-1]), RK takes only constant time instead of O(k) time to compute hash(Y[i+1...i+k]).

We first describe how RK hashes a string.
Let's treat each byte as a digit in base-d notation, therefore, we are using base d=256.
For example, the string 'ab' corresponds to two digits with one being 97 (the
ASCII value of 'a'), and the other being 98 (the ASCII value of 'b'). The
decimal value of 'ab' in base-256 can be calculated as 256*97+98 = 24930. The
hash of a string P in RK is hash(P[0...k-1]) =
256^{k-1}*P[0]+256^{k-2}*P[1]+...+256^{1}*P[k-2]+P[k-1].

We now see how to
do a rolling calculation of the hash values for consecutive substrings of Y. Let
y_{i}=hash(Y[i...i+k-1]). We can
compute y_{i+1} from y_{i} in constant time, by observing that

yIn order to achieve constant time calculation, we have to remember the value of 256_{i+1}= 256^{k-1}*Y[i+1] + 256^{k-2}*Y[i+2] + ... + Y[i+k] = 256 * ( 256^{k-2}*Y[i+1] + ... Y[i+k-1]) + Y[i+k] = 256 * ((256^{k-1}*Y[i] + 256^{k-2}*Y[i+1] + ... Y[i+k-1]) - 256^{k-1}*Y[i]) + Y[i+k] = 256 * ( y_{i}- 256^{k-1}* Y[i]) + Y[i+k]

Now we've seen how rolling hash works. The only fly in the ointment is that these base-256 hash values are too
huge to work with efficiently and conveniently. Therefore, we perform
all the computation in modulo q, where q is chosen to be a
large prime (food for thought: why a prime instead of non-prime?).
Hash collisions are infrequent, but still possible. Therefore
once we detect some y_{i} = hash(P), we should compare the actual strings
Y[i...i+k-1] and P[0...k-1] to see if they are indeed identical.

Since RK speeds up substring matching to O(n) (assuming hash collusion is unlikely) instead of O(n*k) as in the simple algorithm. However, we still need to run RK m/k times for each of the m/k chunks of X to be matched in Y. Thus, our approximate matching algorithm using RK has an overall runtime of O((m/k)*n).

**Your job:** Implement the RK substring matching algorithm by completing
the `rabin_karp_match` procedure.
When calculating the hash values, you should use the given modulo arithmatic
functions, `madd`, `mdel`, `mmul`.

As with `simple_substr_match`, the `main` procedure will invoke
`rabin_karp_match` for each chunk of X to be matched.
`rabin_karp_match` has the same interface as `simple_match` and should
return 1 if the chunk appears as a substring in Y or 0 if otherwise.

Read the code comments carefully to guide your implementation.

**Testing:** Run the given tester program `./rktest.py -t rk`

A Bloom filter is a bitmap of h bits initialized to zeros in the beginning. We insert all m/k RK hash values of X that we want to match into the ``filter''.

To insert a value v into the bitmap,
we use *f* hash functions to map v into f positions
in the bitmap and set each position to be one. For example, starting with a
10-bit bitmap and f=2, if v is mapped to positions 1,5 and v' is mapped
to 3,9, the bitmap after inserting v,v' would be 0101010001. There are
many reasonable conventions with which we can use a byte array to represent a bitmap.
In this lab, we expect you follow one specific convention and interpret bits in
the Big-Endian convention within a byte. For example, we use a 2 byte array to
represent a 10-bit bitmap. The first byte represents the first 8-bit of the
bitmap. And the most-significant-bit of the first byte represents the first bit
of the bitmap. The most-significant-bit of the second byte represents the 9-th
bit of the bitmap. Suppose the bitmap is 0101010001, then the array content is
{0x54, 0x40}.

After we have
inserted all m/k hash values of X, we proceed to calculate every y_{i}
in Y and check whether it *may* match any of the elements in the filter.
This is done by mapping each y_{i} into f positions using the same f hash functions and checking
whether *all* f positions contain value 1. If so, y_{i} is a probable
match. We say the y_{i}'s match is probable because Bloom filter
incurs false positives in that y_{i} may be considered to equal to some of the
m/k hash values even though it is not. Note that the correctness of this approach relies on
the important property of Bloom filter that it never incurs false negatives. Thus,
to confirm that y_{i} is a real match, we check whether Y[i...i+k-1] is
indeed identical to any of the X[0...k-1]_,_X[k...2k-1]... strings.

Using a Bloom filter, our enhanced algorithm has a total runtime of O(m+n) (assuming there are not too many false positives), significantly faster than our previous two versions of approximate matching!

You should complete this part of the lab in two steps. First, implement the bloom filter functions.

**Your job:** First implement the Bloom filter functions by
implementing the `bloom_init`, `bloom_query`,
and `bloom_add` in the source file `bloom.c`.

To help you implement ` bloom_add` and `bloom_query`,
we provide you with a particular hash function implementation for Bloom filter,
`int hash_i(int i, long long x)`. This function will hash a given
Rabin-Karp hash value x into an `int` value according to the i-th hash
function. The number of hash functions that you should use is given by `
BLOOM_HASH_NUM`, a global constant defined in file ` bloom.c`.

Please read all of our scaffolding comments before coding.

**Testing:** After you are done with the bloom filter implementation, test its correctness with
our test script, using the command `./rktest.py -t bloom`
The test script invokes `bloom_test` (see its implementation in ` bloom_test.c`)
to test your bloom filter implementation and it compares your output to the correct answer.

Next, implement the RK algorithm using the Bloom filter functions that you just implemeneted.

**Your job:**Complete the ` rabin_karp_batchmatch`
procedure in `rkmatch.c`. In the template code, we tell you the size of the bitmap to use (in bits) as a
procedure argument (`bsz`). In the `rabin_karp_batchmatch` procedure, you
should invoke the auxilary function `bloom_init` to allocate and
initialize a byte array (i.e. character array) that will pack `bsz` bits for use as the bloom
filter's bitmap. You should then compute all m/k RK
hash values corresponding to X's chunks and insert them into the Bloom filter
using `bloom_add`. Subsequently, you scan Y's content and compute a RK value for
each of its n-k+1 substrings and query its presence in the Bloom filter using
`bloom_query`. If `bloom_query` indicates a potential match, you
proceed to check that the corresponding substring of Y indeed matches one of the m/k chunks of X.

Keep in mind that unlike `simple_substr_match` or `rabin_karp_match`,
returning 1 or 0 is no longer sufficient; `rabin_karp_batchmatch` is
invoked only once and therefore must return the total number of chunks of X
that have appeared in Y.

Please read all of our scaffolding comments before coding.

**Testing: ** Run the given tester program `./rktest.py -t rkbatch` to test
your batch Rabin-Karp method. Once you've finished all parts of the lab,
you should run the full suite of tests by running the test program without
arguments, `./rktest.py`.

- Correctness points (50 total), each version of the algorithm carries 10 points.
- Style points (10 total). We reserve 5 points for a subjective evaluation of the style of your solutions and your commenting. Your solutions should be as clean and straightforward as possible. Your comments should be informative, but they need not be extensive.

Note while the testing script will give you a good idea of your final lab grade, its score does NOT constitute an assurance. In particular, just because they pass the tests does not necessarily mean that they implemented everything perfectly.

make handinSubmit