String search in O(n + m)
Searching a string inside another string is a very easy task. Given a string A and a string B, start a loop over A from left to right and, for every letter, start an internal loop to see if there is a match with the letters of B.
A ball of string
Pseudo code would look something like this:
function NaiveSearch(string s[1..n], string sub[1..m])
for i from 1 to n-m+1
for j from 1 to m
if s[i+j-1] ≠ sub[j]
jump to next iteration of outer loop
return i
return not found
Time Complexity? O(n * m), where n is the size of A and m is the size of B.
Wonna do better? Follow me.
Hashing is the key
Let’s build an hashing function that generates a quite big number for every string. It has to depend on multiple factors:
- Size of the Alphabet
- Position of every char within the string
- A function that for a Chars of the Alphabet gives Numbers
Think of Binary Numbers as strings made of 0 and 1, with an alphabet of only 2 characters: 0,1. How do we convert a Binary Number in a Decimal?
[ 1 0 1 1] = 2^3*1 + 2^2*0 + 2^1*1 + 2^0*1 = 8 + 0 + 2 + 1 => 11
Isn’t that the kind of hashing we need?
- Size of the Alphabet => 2, made of the Characters {0,1}
- Position of every char within the string => The power applied to the 2 depends on the position of that char
- A function that for a Chars of the Alphabet gives Numbers => In this case, 0 and 1 have already a value ;)
Why all this stuff about Hashing? Bear with me and I’ll explain.
Our hashing function
We want to search B in A, so let’s take:
'S'is the String in which we are searching (i.e. 'A')'j'is position of thelastfirst (thanks [John](http://www.whatsthebeef.org/)) char in'S'included in the Hashing'alpha'is the size of the Alphabet'm'is the size of the string being searched (i.e. the size of 'B')
The hashing function will be something like:
Hashing(S, j) = SUMMATION [i=0 -> m-1] { alpha^(m -(i+1)) * char(S[i+j]) }
What’s great about this hashing?
Now comes the great part. Once we calculate the hashing of a string K, and then we REMOVE the first character, APPEND a new one, how do we calculate the hashing of this new L string? Like this:
H(S, j+1) = alpha * ( H(S, j) - alpha^(m-1) * char(S[j]) ) + char(S[j+m])
2 MULTIPLICATIONS, 1 SUBTRACTION and 1 ADDITION. That takes O(1). See where am I heading?
Searching by Hashing
Given the B_size size of B, we calculate the hashing of the first B_size characters of A: an “hashing window” over A. If B is not found, we proceed on moving this window, 1 character to the right, 1 at a time; until either we find (the hashing of) B, or the string finishes.
Here is the code:
#include <stdio.h>
#include <string.h>
#include <math.h>
#define NORM_CHAR(A) (A - ' ')
#define ALPHABET_SIZE (NORM_CHAR(255)+1)
#define LLU_POW(x, y) (unsigned long long)pow(x, y)
unsigned long long hash_char(const char* string, unsigned int pos, unsigned int size) {
#ifdef DEBUG
printf("HASHING of char '%c' (of value %d) in position #%d of a string of size %d is: %llu\n",
string[pos],
NORM_CHAR(string[pos]),
pos,
size,
LLU_POW(ALPHABET_SIZE, size - (pos + 1)) * NORM_CHAR(string[pos])
);
#endif
return LLU_POW(ALPHABET_SIZE, size - (pos + 1)) * NORM_CHAR(string[pos]);
}
/**
Given that:
- 'S' is the String in which we are searching (i.e. 'A')
- 'j' is position of the last char in 'S' included in the Hashing
- 'alpha' is the size of the Alphabet
- 'm' is the size of the string being searched (i.e. 'B_size')
=> Hashing(S, j) = SUMMATION [i=0 -> m-1] { alpha^(m -(i+1)) * char(S[i+j]) }
*/
unsigned long long hash_string(const char* string, unsigned int size) {
static unsigned int i;
static unsigned long long hash;
hash = 0;
for ( i = 0; i < size; ++i ) {
hash += hash_char(string, i, size);
}
return hash;
}
int main(int argc, char** argv)
{
char* A;
char* B;
unsigned long long A_hash;
unsigned long long B_hash;
unsigned int B_size;
unsigned int i;
// Check the Input
if ( argc == 3 ) {
A = argv[1];
B = argv[2];
}
else
return (1);
// Calculate initial Hashings
B_size = strlen(B);
A_hash = hash_string(A, B_size);
B_hash = hash_string(B, B_size);
printf("HASHING of B('%s') is: %llu\n", B, B_hash);
#ifdef DEBUG
printf("Starting from char in position %d, the HASHING of A for length %d is: %llu\n",
0, B_size, A_hash);
#endif
i = B_size;
while ( A[i-1] != '\0' ) {
if ( A_hash == B_hash ) {
printf("YES! B is a Substring of A, in the range { %d, %d}\n\n",
i-B_size, i-1);
return(0);
} else {
// 'S' is the String in which we are searching (i.e. 'A')
// 'j' is position of the last char in 'S' included in the Hashing
// 'alpha' is the size of the Alphabet
// 'm' is the size of the string being searched (i.e. 'B_size')
//
// H(S, j+1) = alpha * ( H(S, j) - alpha^(m-1) * char(S[j]) ) + char(S[j+m])
A_hash = ALPHABET_SIZE
* (
A_hash
- LLU_POW(ALPHABET_SIZE, (B_size-1))
* NORM_CHAR( A[i-B_size] )
)
+ NORM_CHAR( A[i++] );
#ifdef DEBUG
printf("Starting from char in position %d, the HASHING of A for length %d is: %llu\n",
i-B_size, B_size, A_hash);
#endif
}
}
printf("NO. B is not a Substring of A\n\n", B_hash, A_hash);
}
I used unsigned long long so that I could handle the largest numbers possible. Check out large strings to see how easily you can reach enormous values.
Complexity
O( 2 * hashing(B) + (size_of_A - size_of_B) * hashing_window_moves ) = O( 2m + (n-m) * O(1) ) = O( 2m + n -m ) = O( n + m )
Pretty neat, eh?.
Where does it come from?
I didn’t create the algorithm myself: I “only” coded it in C. Officially this algorithm is called Rabin–Karp string search algorithm. But, as for every other things I code, it’s great when you actually spend time coding it: you get a great understanding of the problem and the proposed algorithm.