String search in O(n + m)

Friday, 08 Jan 2010
search • code • string • personal • it • fast • substring • complexity • english • cool • matching • hashing
1047 words

Searching a string inside another string is a very easy task. Given a string A and a string B, start a loop over A from left to right and, for every letter, start an internal loop to see if there is a match with the letters of B.

A ball of string

Pseudo code would look something like this:

function NaiveSearch(string s[1..n], string sub[1..m])
   for i from 1 to n-m+1
      for j from 1 to m
         if s[i+j-1] ≠ sub[j]
            jump to next iteration of outer loop
      return i
   return not found

Time Complexity? O(n * m), where n is the size of A and m is the size of B.

Wonna do better? Follow me.

Hashing is the key

Let’s build an hashing function that generates a quite big number for every string. It has to depend on multiple factors:

Size of the Alphabet
Position of every char within the string
A function that for a Chars of the Alphabet gives Numbers

Think of Binary Numbers as strings made of 0 and 1, with an alphabet of only 2 characters: 0,1. How do we convert a Binary Number in a Decimal?

[ 1 0 1 1] = 2^3*1 + 2^2*0 + 2^1*1 + 2^0*1 = 8 + 0 + 2 + 1 => 11

Isn’t that the kind of hashing we need?

Size of the Alphabet => 2, made of the Characters {0,1}
Position of every char within the string => The power applied to the 2 depends on the position of that char
A function that for a Chars of the Alphabet gives Numbers => In this case, 0 and 1 have already a value ;)

Why all this stuff about Hashing? Bear with me and I’ll explain.

Our hashing function

We want to search B in A, so let’s take:

'S' is the String in which we are searching (i.e. 'A')
'j' is position of the ~~last~~ first (thanks [John](http://www.whatsthebeef.org/)) char in 'S' included in the Hashing
'alpha' is the size of the Alphabet
'm' is the size of the string being searched (i.e. the size of 'B')

The hashing function will be something like:

Hashing(S, j) = SUMMATION [i=0 -> m-1] { alpha^(m -(i+1)) * char(S[i+j]) }

What’s great about this hashing?

Now comes the great part. Once we calculate the hashing of a string K, and then we REMOVE the first character, APPEND a new one, how do we calculate the hashing of this new L string? Like this:

H(S, j+1) = alpha * ( H(S, j) - alpha^(m-1) * char(S[j]) ) + char(S[j+m])

2 MULTIPLICATIONS, 1 SUBTRACTION and 1 ADDITION. That takes O(1). See where am I heading?

Searching by Hashing

Given the B_size size of B, we calculate the hashing of the first B_size characters of A: an “hashing window” over A. If B is not found, we proceed on moving this window, 1 character to the right, 1 at a time; until either we find (the hashing of) B, or the string finishes.

Here is the code:

#include <stdio.h>
#include <string.h>
#include <math.h>

#define NORM_CHAR(A)    (A - ' ')
#define ALPHABET_SIZE   (NORM_CHAR(255)+1)
#define LLU_POW(x, y)   (unsigned long long)pow(x, y)

unsigned long long hash_char(const char* string, unsigned int pos, unsigned int size) {
    #ifdef DEBUG
    printf("HASHING of char '%c' (of value %d) in position #%d of a string of size %d is: %llu\n",
        string[pos],
        NORM_CHAR(string[pos]),
        pos,
        size,
        LLU_POW(ALPHABET_SIZE, size - (pos + 1)) * NORM_CHAR(string[pos])
        );
    #endif
        
    return LLU_POW(ALPHABET_SIZE, size - (pos + 1)) * NORM_CHAR(string[pos]);
}

/**
Given that:
- 'S' is the String in which we are searching (i.e. 'A')
- 'j' is position of the last char in 'S' included in the Hashing
- 'alpha' is the size of the Alphabet
- 'm' is the size of the string being searched (i.e. 'B_size')

=> Hashing(S, j) = SUMMATION [i=0 -> m-1] { alpha^(m -(i+1)) * char(S[i+j]) }
*/
unsigned long long hash_string(const char* string, unsigned int size) {
    static unsigned int i;
    static unsigned long long hash;

    hash = 0;
    for ( i = 0; i < size; ++i ) {
        hash += hash_char(string, i, size);
    }
    
    return hash;
}

int main(int argc, char** argv)
{
    char* A;
    char* B;
    unsigned long long A_hash;
    unsigned long long B_hash;
    unsigned int B_size;
    unsigned int i;
    
    // Check the Input
	if ( argc == 3 ) {
	   A = argv[1];
	   B = argv[2];
	}
	else
	   return (1);

    // Calculate initial Hashings
    B_size = strlen(B);
    A_hash = hash_string(A, B_size);
    B_hash = hash_string(B, B_size);
    printf("HASHING of B('%s') is: %llu\n", B, B_hash);
    #ifdef DEBUG
    printf("Starting from char in position %d, the HASHING of A for length %d is: %llu\n",
        0, B_size, A_hash);
    #endif

    i = B_size;
    while ( A[i-1] != '\0' ) {
        if ( A_hash == B_hash ) {
            printf("YES! B is a Substring of A, in the range { %d, %d}\n\n",
                i-B_size, i-1);
            return(0);
        } else {
            // 'S' is the String in which we are searching (i.e. 'A')
            // 'j' is position of the last char in 'S' included in the Hashing
            // 'alpha' is the size of the Alphabet
            // 'm' is the size of the string being searched (i.e. 'B_size')
            //
            // H(S, j+1) = alpha * ( H(S, j) - alpha^(m-1) * char(S[j]) ) + char(S[j+m])
            A_hash = ALPHABET_SIZE
                * (
                    A_hash
                    - LLU_POW(ALPHABET_SIZE, (B_size-1))
                    * NORM_CHAR( A[i-B_size] )
                )
                + NORM_CHAR( A[i++] );
            #ifdef DEBUG
            printf("Starting from char in position %d, the HASHING of A for length %d is: %llu\n",
                i-B_size, B_size, A_hash);
            #endif
        }
    }
    printf("NO. B is not a Substring of A\n\n", B_hash, A_hash);
}

I used unsigned long long so that I could handle the largest numbers possible. Check out large strings to see how easily you can reach enormous values.

Complexity

O( 2 * hashing(B) + (size_of_A - size_of_B) * hashing_window_moves ) = O( 2m + (n-m) * O(1) ) = O( 2m + n -m ) = O( n + m )

Pretty neat, eh?.

Where does it come from?

I didn’t create the algorithm myself: I “only” coded it in C. Officially this algorithm is called Rabin–Karp string search algorithm. But, as for every other things I code, it’s great when you actually spend time coding it: you get a great understanding of the problem and the proposed algorithm.

Too much coffee
too little time

Ivan De Marino