String search in O(n + m)
Searching a string inside another string is a very easy task. Given a string A
and a string B
, start a loop over A
from left to right and, for every letter, start an internal loop to see if there is a match with the letters of B
.
Pseudo code would look something like this:
function NaiveSearch(string s[1..n], string sub[1..m])
for i from 1 to n-m+1
for j from 1 to m
if s[i+j-1] ≠ sub[j]
jump to next iteration of outer loop
return i
return not found
Time Complexity? O(n * m), where n
is the size of A
and m
is the size of B
.
Wonna do better? Follow me.
Hashing is the key
Let’s build an hashing function that generates a quite big number for every string. It has to depend on multiple factors:
- Size of the Alphabet
- Position of every char within the string
- A function that for a Chars of the Alphabet gives Numbers
Think of Binary Numbers as strings made of 0 and 1, with an alphabet of only 2 characters: 0,1
. How do we convert a Binary Number in a Decimal?
[ 1 0 1 1] = 2^3*1 + 2^2*0 + 2^1*1 + 2^0*1 = 8 + 0 + 2 + 1 => 11
Isn’t that the kind of hashing we need?
- Size of the Alphabet => 2, made of the Characters {0,1}
- Position of every char within the string => The power applied to the 2 depends on the position of that char
- A function that for a Chars of the Alphabet gives Numbers => In this case, 0 and 1 have already a value ;)
Why all this stuff about Hashing? Bear with me and I’ll explain.
Our hashing function
We want to search B in A, so let’s take:
'S'
is the String in which we are searching (i.e. 'A')'j'
is position of thelastfirst (thanks [John](http://www.whatsthebeef.org/)) char in'S'
included in the Hashing'alpha'
is the size of the Alphabet'm'
is the size of the string being searched (i.e. the size of 'B')
The hashing function will be something like:
Hashing(S, j) = SUMMATION [i=0 -> m-1] { alpha^(m -(i+1)) * char(S[i+j]) }
What’s great about this hashing?
Now comes the great part. Once we calculate the hashing of a string K, and then we REMOVE the first character, APPEND a new one, how do we calculate the hashing of this new L string? Like this:
H(S, j+1) = alpha * ( H(S, j) - alpha^(m-1) * char(S[j]) ) + char(S[j+m])
2 MULTIPLICATIONS, 1 SUBTRACTION and 1 ADDITION. That takes O(1). See where am I heading?
Searching by Hashing
Given the B_size
size of B
, we calculate the hashing of the first B_size
characters of A
: an “hashing window” over A. If B is not found, we proceed on moving this window, 1 character to the right, 1 at a time; until either we find (the hashing of) B, or the string finishes.
Here is the code:
#include <stdio.h>
#include <string.h>
#include <math.h>
#define NORM_CHAR(A) (A - ' ')
#define ALPHABET_SIZE (NORM_CHAR(255)+1)
#define LLU_POW(x, y) (unsigned long long)pow(x, y)
unsigned long long hash_char(const char* string, unsigned int pos, unsigned int size) {
#ifdef DEBUG
printf("HASHING of char '%c' (of value %d) in position #%d of a string of size %d is: %llu\n",
string[pos],
NORM_CHAR(string[pos]),
pos,
size,
LLU_POW(ALPHABET_SIZE, size - (pos + 1)) * NORM_CHAR(string[pos])
);
#endif
return LLU_POW(ALPHABET_SIZE, size - (pos + 1)) * NORM_CHAR(string[pos]);
}
/**
Given that:
- 'S' is the String in which we are searching (i.e. 'A')
- 'j' is position of the last char in 'S' included in the Hashing
- 'alpha' is the size of the Alphabet
- 'm' is the size of the string being searched (i.e. 'B_size')
=> Hashing(S, j) = SUMMATION [i=0 -> m-1] { alpha^(m -(i+1)) * char(S[i+j]) }
*/
unsigned long long hash_string(const char* string, unsigned int size) {
static unsigned int i;
static unsigned long long hash;
hash = 0;
for ( i = 0; i < size; ++i ) {
hash += hash_char(string, i, size);
}
return hash;
}
int main(int argc, char** argv)
{
char* A;
char* B;
unsigned long long A_hash;
unsigned long long B_hash;
unsigned int B_size;
unsigned int i;
// Check the Input
if ( argc == 3 ) {
A = argv[1];
B = argv[2];
}
else
return (1);
// Calculate initial Hashings
B_size = strlen(B);
A_hash = hash_string(A, B_size);
B_hash = hash_string(B, B_size);
printf("HASHING of B('%s') is: %llu\n", B, B_hash);
#ifdef DEBUG
printf("Starting from char in position %d, the HASHING of A for length %d is: %llu\n",
0, B_size, A_hash);
#endif
i = B_size;
while ( A[i-1] != '\0' ) {
if ( A_hash == B_hash ) {
printf("YES! B is a Substring of A, in the range { %d, %d}\n\n",
i-B_size, i-1);
return(0);
} else {
// 'S' is the String in which we are searching (i.e. 'A')
// 'j' is position of the last char in 'S' included in the Hashing
// 'alpha' is the size of the Alphabet
// 'm' is the size of the string being searched (i.e. 'B_size')
//
// H(S, j+1) = alpha * ( H(S, j) - alpha^(m-1) * char(S[j]) ) + char(S[j+m])
A_hash = ALPHABET_SIZE
* (
A_hash
- LLU_POW(ALPHABET_SIZE, (B_size-1))
* NORM_CHAR( A[i-B_size] )
)
+ NORM_CHAR( A[i++] );
#ifdef DEBUG
printf("Starting from char in position %d, the HASHING of A for length %d is: %llu\n",
i-B_size, B_size, A_hash);
#endif
}
}
printf("NO. B is not a Substring of A\n\n", B_hash, A_hash);
}
I used unsigned long long
so that I could handle the largest numbers possible. Check out large strings to see how easily you can reach enormous values.
Complexity
O( 2 * hashing(B) + (size_of_A - size_of_B) * hashing_window_moves ) = O( 2m + (n-m) * O(1) ) = O( 2m + n -m ) = O( n + m )
Pretty neat, eh?.
Where does it come from?
I didn’t create the algorithm myself: I “only” coded it in C. Officially this algorithm is called Rabin–Karp string search algorithm. But, as for every other things I code, it’s great when you actually spend time coding it: you get a great understanding of the problem and the proposed algorithm.