# String search in O(n + m)

• search • code • string • personal • it • fast • substring • complexity • english • cool • matching • hashing
• 1047 words

Searching a string inside another string is a very easy task. Given a string `A` and a string `B`, start a loop over `A` from left to right and, for every letter, start an internal loop to see if there is a match with the letters of `B`. A ball of string

Pseudo code would look something like this:

``````function NaiveSearch(string s[1..n], string sub[1..m])
for i from 1 to n-m+1
for j from 1 to m
if s[i+j-1] ≠ sub[j]
return i
``````

Time Complexity? O(n * m), where `n` is the size of `A` and `m` is the size of `B`.

## Hashing is the key

Let's build an hashing function that generates a quite big number for every string. It has to depend on multiple factors:

• Size of the Alphabet
• Position of every char within the string
• A function that for a Chars of the Alphabet gives Numbers

Think of Binary Numbers as strings made of 0 and 1, with an alphabet of only 2 characters: `0,1`. How do we convert a Binary Number in a Decimal?

``````[ 1 0 1 1] = 2^3*1 + 2^2*0 + 2^1*1 + 2^0*1 = 8 + 0 + 2 + 1 => 11
``````

Isn't that the kind of hashing we need?

• Size of the Alphabet => 2, made of the Characters {0,1}
• Position of every char within the string => The power applied to the 2 depends on the position of that char
• A function that for a Chars of the Alphabet gives Numbers => In this case, 0 and 1 have already a value ;)

Why all this stuff about Hashing? Bear with me and I'll explain.

## Our hashing function

We want to search B in A, so let's take:

• `'S'` is the String in which we are searching (i.e. 'A')
• `'j'` is position of the last first (thanks John) char in `'S'` included in the Hashing
• `'alpha'` is the size of the Alphabet
• `'m'` is the size of the string being searched (i.e. the size of 'B')

The hashing function will be something like:

``````Hashing(S, j) = SUMMATION [i=0 -> m-1] { alpha^(m -(i+1)) * char(S[i+j]) }
``````

Now comes the great part. Once we calculate the hashing of a string K, and then we REMOVE the first character, APPEND a new one, how do we calculate the hashing of this new L string? Like this:

``````H(S, j+1) = alpha * ( H(S, j) - alpha^(m-1) * char(S[j]) ) + char(S[j+m])
``````

2 MULTIPLICATIONS, 1 SUBTRACTION and 1 ADDITION. That takes O(1). See where am I heading?

## Searching by Hashing

Given the `Bsize` size of `B`, we calculate the hashing of the first `B``size` characters of `A`: an "hashing window" over A. If B is not found, we proceed on moving this window, 1 character to the right, 1 at a time; until either we find (the hashing of) B, or the string finishes.

Here is the code:

``````#include <stdio.h>
#include <string.h>
#include <math.h>

#define NORM_CHAR(A)    (A - ' ')
#define ALPHABET_SIZE   (NORM_CHAR(255)+1)
#define LLU_POW(x, y)   (unsigned long long)pow(x, y)

unsigned long long hash_char(const char* string, unsigned int pos, unsigned int size) {
#ifdef DEBUG
printf("HASHING of char '%c' (of value %d) in position #%d of a string of size %d is: %llu\n",
string[pos],
NORM_CHAR(string[pos]),
pos,
size,
LLU_POW(ALPHABET_SIZE, size - (pos + 1)) * NORM_CHAR(string[pos])
);
#endif

return LLU_POW(ALPHABET_SIZE, size - (pos + 1)) * NORM_CHAR(string[pos]);
}

/**
Given that:
- 'S' is the String in which we are searching (i.e. 'A')
- 'j' is position of the last char in 'S' included in the Hashing
- 'alpha' is the size of the Alphabet
- 'm' is the size of the string being searched (i.e. 'B_size')

=> Hashing(S, j) = SUMMATION [i=0 -> m-1] { alpha^(m -(i+1)) * char(S[i+j]) }
*/
unsigned long long hash_string(const char* string, unsigned int size) {
static unsigned int i;
static unsigned long long hash;

hash = 0;
for ( i = 0; i < size; ++i ) {
hash += hash_char(string, i, size);
}

return hash;
}

int main(int argc, char** argv)
{
char* A;
char* B;
unsigned long long A_hash;
unsigned long long B_hash;
unsigned int B_size;
unsigned int i;

// Check the Input
if ( argc == 3 ) {
A = argv;
B = argv;
}
else
return (1);

// Calculate initial Hashings
B_size = strlen(B);
A_hash = hash_string(A, B_size);
B_hash = hash_string(B, B_size);
printf("HASHING of B('%s') is: %llu\n", B, B_hash);
#ifdef DEBUG
printf("Starting from char in position %d, the HASHING of A for length %d is: %llu\n",
0, B_size, A_hash);
#endif

i = B_size;
while ( A[i-1] != '\0' ) {
if ( A_hash == B_hash ) {
printf("YES! B is a Substring of A, in the range { %d, %d}\n\n",
i-B_size, i-1);
return(0);
} else {
// 'S' is the String in which we are searching (i.e. 'A')
// 'j' is position of the last char in 'S' included in the Hashing
// 'alpha' is the size of the Alphabet
// 'm' is the size of the string being searched (i.e. 'B_size')
//
// H(S, j+1) = alpha * ( H(S, j) - alpha^(m-1) * char(S[j]) ) + char(S[j+m])
A_hash = ALPHABET_SIZE
* (
A_hash
- LLU_POW(ALPHABET_SIZE, (B_size-1))
* NORM_CHAR( A[i-B_size] )
)
+ NORM_CHAR( A[i++] );
#ifdef DEBUG
printf("Starting from char in position %d, the HASHING of A for length %d is: %llu\n",
i-B_size, B_size, A_hash);
#endif
}
}
printf("NO. B is not a Substring of A\n\n", B_hash, A_hash);
}
``````

I used `unsigned long long` so that I could handle the largest numbers possible. Check out large strings to see how easily you can reach enormous values.

## Complexity

O( 2 * hashing(B) + (sizeofA - sizeofB) * hashingwindowmoves ) = O( 2m + (n-m) * O(1) ) = O( 2m + n -m ) = O( n + m )

Pretty neat, eh?.

## Where does it come from?

I didn't create the algorithm myself: I "only" coded it in C. Officially this algorithm is called Rabin–Karp string search algorithm. But, as for every other things I code, it's great when you actually spend time coding it: you get a great understanding of the problem and the proposed algorithm.