String search in O(n + m)

  • search • code • string • personal • it • fast • substring • complexity • english • cool • matching • hashing
  • 1047 words

Searching a string inside another string is a very easy task. Given a string A and a string B, start a loop over A from left to right and, for every letter, start an internal loop to see if there is a match with the letters of B.

String A ball of string

Pseudo code would look something like this:

function NaiveSearch(string s[1..n], string sub[1..m])
   for i from 1 to n-m+1
      for j from 1 to m
         if s[i+j-1]  sub[j]
            jump to next iteration of outer loop
      return i
   return not found

Time Complexity? O(n * m), where n is the size of A and m is the size of B.

Wonna do better? Follow me.

Hashing is the key

Let’s build an hashing function that generates a quite big number for every string. It has to depend on multiple factors:

  • Size of the Alphabet
  • Position of every char within the string
  • A function that for a Chars of the Alphabet gives Numbers

Think of Binary Numbers as strings made of 0 and 1, with an alphabet of only 2 characters: 0,1. How do we convert a Binary Number in a Decimal?

[ 1 0 1 1] = 2^3*1 + 2^2*0 + 2^1*1 + 2^0*1 = 8 + 0 + 2 + 1 => 11

Isn’t that the kind of hashing we need?

  • Size of the Alphabet => 2, made of the Characters {0,1}
  • Position of every char within the string => The power applied to the 2 depends on the position of that char
  • A function that for a Chars of the Alphabet gives Numbers => In this case, 0 and 1 have already a value ;)

Why all this stuff about Hashing? Bear with me and I’ll explain.

Our hashing function

We want to search B in A, so let’s take:

  • 'S' is the String in which we are searching (i.e. 'A')
  • 'j' is position of the last first (thanks [John](http://www.whatsthebeef.org/)) char in 'S' included in the Hashing
  • 'alpha' is the size of the Alphabet
  • 'm' is the size of the string being searched (i.e. the size of 'B')

The hashing function will be something like:

Hashing(S, j) = SUMMATION [i=0 -> m-1] { alpha^(m -(i+1)) * char(S[i+j]) }

What’s great about this hashing?

Now comes the great part. Once we calculate the hashing of a string K, and then we REMOVE the first character, APPEND a new one, how do we calculate the hashing of this new L string? Like this:

H(S, j+1) = alpha * ( H(S, j) - alpha^(m-1) * char(S[j]) ) + char(S[j+m])

2 MULTIPLICATIONS, 1 SUBTRACTION and 1 ADDITION. That takes O(1). See where am I heading?

Searching by Hashing

Given the B_size size of B, we calculate the hashing of the first B_size characters of A: an “hashing window” over A. If B is not found, we proceed on moving this window, 1 character to the right, 1 at a time; until either we find (the hashing of) B, or the string finishes.

Here is the code:

#include <stdio.h>
#include <string.h>
#include <math.h>

#define NORM_CHAR(A)    (A - ' ')
#define ALPHABET_SIZE   (NORM_CHAR(255)+1)
#define LLU_POW(x, y)   (unsigned long long)pow(x, y)

unsigned long long hash_char(const char* string, unsigned int pos, unsigned int size) {
    #ifdef DEBUG
    printf("HASHING of char '%c' (of value %d) in position #%d of a string of size %d is: %llu\n",
        string[pos],
        NORM_CHAR(string[pos]),
        pos,
        size,
        LLU_POW(ALPHABET_SIZE, size - (pos + 1)) * NORM_CHAR(string[pos])
        );
    #endif
        
    return LLU_POW(ALPHABET_SIZE, size - (pos + 1)) * NORM_CHAR(string[pos]);
}

/**
Given that:
- 'S' is the String in which we are searching (i.e. 'A')
- 'j' is position of the last char in 'S' included in the Hashing
- 'alpha' is the size of the Alphabet
- 'm' is the size of the string being searched (i.e. 'B_size')

=> Hashing(S, j) = SUMMATION [i=0 -> m-1] { alpha^(m -(i+1)) * char(S[i+j]) }
*/
unsigned long long hash_string(const char* string, unsigned int size) {
    static unsigned int i;
    static unsigned long long hash;

    hash = 0;
    for ( i = 0; i < size; ++i ) {
        hash += hash_char(string, i, size);
    }
    
    return hash;
}

int main(int argc, char** argv)
{
    char* A;
    char* B;
    unsigned long long A_hash;
    unsigned long long B_hash;
    unsigned int B_size;
    unsigned int i;
    
    // Check the Input
	if ( argc == 3 ) {
	   A = argv[1];
	   B = argv[2];
	}
	else
	   return (1);

    // Calculate initial Hashings
    B_size = strlen(B);
    A_hash = hash_string(A, B_size);
    B_hash = hash_string(B, B_size);
    printf("HASHING of B('%s') is: %llu\n", B, B_hash);
    #ifdef DEBUG
    printf("Starting from char in position %d, the HASHING of A for length %d is: %llu\n",
        0, B_size, A_hash);
    #endif

    i = B_size;
    while ( A[i-1] != '\0' ) {
        if ( A_hash == B_hash ) {
            printf("YES! B is a Substring of A, in the range { %d, %d}\n\n",
                i-B_size, i-1);
            return(0);
        } else {
            // 'S' is the String in which we are searching (i.e. 'A')
            // 'j' is position of the last char in 'S' included in the Hashing
            // 'alpha' is the size of the Alphabet
            // 'm' is the size of the string being searched (i.e. 'B_size')
            //
            // H(S, j+1) = alpha * ( H(S, j) - alpha^(m-1) * char(S[j]) ) + char(S[j+m])
            A_hash = ALPHABET_SIZE
                * (
                    A_hash
                    - LLU_POW(ALPHABET_SIZE, (B_size-1))
                    * NORM_CHAR( A[i-B_size] )
                )
                + NORM_CHAR( A[i++] );
            #ifdef DEBUG
            printf("Starting from char in position %d, the HASHING of A for length %d is: %llu\n",
                i-B_size, B_size, A_hash);
            #endif
        }
    }
    printf("NO. B is not a Substring of A\n\n", B_hash, A_hash);
}

I used unsigned long long so that I could handle the largest numbers possible. Check out large strings to see how easily you can reach enormous values.

Complexity

O( 2 * hashing(B) + (size_of_A - size_of_B) * hashing_window_moves ) = O( 2m + (n-m) * O(1) ) = O( 2m + n -m ) = O( n + m )

Pretty neat, eh?.

Where does it come from?

I didn’t create the algorithm myself: I “only” coded it in C. Officially this algorithm is called Rabin–Karp string search algorithm. But, as for every other things I code, it’s great when you actually spend time coding it: you get a great understanding of the problem and the proposed algorithm.