An Aho-Corasick State Machine allows to implement a fully indexed dictionary of words. A word or pattern is to be considered here in its general meaning, that is a sequence of signs or symbols of a given user defined alphabet (set of symbols).
Once the dictionary has been defined with a list of words, it can be used to efficiently search for all of those words in a text composed of a sequence of signs.
The dictionary is used in two steps:
- Register words in the dictionary, and optionally values associated to words.
- Scan a text once to compare it to words in the dictionary.
The algorithm proposed by Aho and Corasick in a paper of 1975 is efficient when words are composed of several of signs from of a limited alphabet.
Note
If the dictionary is populated with words of only one sign from a large or unbound alphabet, a traditional map should be preferred.
This library is distributed under the terms of the GNU Lesser General Public License, version 3 (see COPYING and COPYING.LESSER.
Therefore, you should give prominent notice, with each copy of your files using the library (modified or not), that the library is used in it and that the library and its use are covered by this license, accompanied with a copy of it and the copyright notice (L. Farhi, 2026).
This project offers an efficient implementation of:
- the original Aho-Corasick algorithm from Aho and Corasick (btw exquisitely clear and well written, see Aho, Alfred V.; Corasick, Margaret J., "Efficient string matching: An aid to bibliographic search", Communications of the ACM. 18 (6): 333–340., June 1975) ;
- supplemented by the modified algorithm from Meyer (Meyer, B., Incremental String Matching, Information Processing Letters 21 (1985): 219-227).
The implementation offers several enhancements to the original proposals:
- It is generic in the sense that it works for any kind of alphabet (of any type) and not only for
char. - It works with alphabets of any number of signs (and not limited to 26 or 256). The alphabet needs to be ordered. It requires a comparison operator.
- The logic is refactored so that it can used iteratively so as not be constrained by any given type of container for words.
- The interface is minimal, complete and easy to use.
- The implementation has low memory footprint and is fast.
The code
- sticks, step by step, to the pseudo-code given in the original papers ;
- is commented and annotated with excerpts from the original text of Aho, Corasick and Meyer.
Hope this helps. Let me know !
The usage of the Aho-Corasick dictionary is straight forward.
Here is a full example explained below.
#include "aho_corasick.h"
int
main (void) {
ACMachine *machine = acm_create (ACM_CMP_DEFAULT, &(size_t){ sizeof (char) }, 0);
ACState *state = acm_initiate (machine);
char *words[] = { "he", "she", "his", "hers" };
for (size_t i = 0; i < sizeof (words) / sizeof (*words); i++) {
for (char *p = words[i]; *p; p++)
acm_insert_letter_of_keyword (&state, p);
acm_insert_end_of_keyword (&state, 0, 0);
}
const char *text = "To ushers: he found his pencil, but she could not find hers.";
printf ("%s\n", text);
MatchHolder matcher;
acm_matcher_init (&matcher);
const ACState *cst_state = acm_initiate (machine);
for (size_t i = 0; text[i]; i++)
for (size_t j = acm_match (&cst_state, &text[i]); j > 0; j--) {
acm_get_match (cst_state, j - 1, &matcher);
printf (" %zu:", i + 2 - matcher.length);
for (size_t k = 0; k < matcher.length; k++)
printf ("%c", *(const char *)matcher.letters[k]);
}
printf ("\n");
acm_matcher_release (&matcher);
acm_release (machine);
}It yields
To ushers: he found his pencil, but she could not find hers.
6:he 5:she 6:hers 12:he 21:his 38:he 37:she 56:he 56:hers
You can also look at an example while you read.
Create and returns a new state machine with acm_create.
typedef int (*CMP_TYPE) (const void *letter_a, const void *letter_b, const void *eq_arg);
typedef void (*DESTROY_TYPE) (void *letter); // Compatible with the signature of free.
ACMachine *acm_create (CMP_TYPE cmp, void *cmp_arg, DESTROY_TYPE dtor);- The first argument is a comparison function
cmpfor the signs that will be used by the dictionary. The comparison functioncmpmust return an integer less than, equal to, or greater than zero if the first argument is considered to be respectively less than, equal to, or greater than the second. - The second optional argument
cmp_argwill be passed to the comparison functioncmpwhen called by the machine.
Note
An comparison function cmp must be provided.
If the signs fed to the machine are automatically or statically allocated (until the machine is releases with acm_release), then 0 must be passed as dtor.
If the signs fed to the machine can not be statically allocated until the machine is releases (with acm_release),
- signs (and possibly its internals) must be dynamically allocated before each call to
acm_insert_letter_of_keyword. - a destructor
dtormust be provided. It must deallocate the sign (and possibly its internals) with as many calls tofreeas calls needed tomallocto allocate a sign (seeacm_insert_letter_of_keyword). - signs will be automatically deallocated by the machine by
dtor. - if signs are allocated by a single call to
malloc,freeis a suitable destructor.
acm_release must be subsequently called when the machine is not needed anymore.
Tip
A handy default comparison function ACM_CMP_DEFAULT (based on memcmp) can be used for a simple lexical comparison of a simple type or structure (of fixed size).
&(size_t){ sizeof ( T ) } might then be passed as second argument of acm_create, where T is the type of the signs.
To fill the dictionary with words :
- Call first
acm_initiateonce for all to initialise a internal state.
ACState *acm_initiate (ACMachine *machine);Then, for each word compose of a sequence of sign,
- Send each sign, one after the other, in sequence, with successive calls to
acm_insert_letter_of_keyword.
void acm_insert_letter_of_keyword (ACState **state, void *letter);Note
A state must have been initialised with acm_initiate before the first call to acm_insert_letter_of_keyword.
A letter must be provided. It is a pointer to an allocated sign that must persist until the machine is release (with acm_release). The letter should not be modified after insertion, otherwise undefined behaviour would happen.
If it was allocated dynamically, it will be automatically deallocated by the machine using the destructor previously passed to acm_create.
- Call
acm_insert_end_of_keyword.
void *acm_insert_end_of_keyword (ACState **state, void *value, void (*dtor) (void *));Note
acm_insert_letter_of_keyword must have been previously called at least once before acm_insert_end_of_keyword is called.
A user-defined value, passed as second argument value, can be associated to the keyword.
Returns 0 if the keyword has not been previously inserted yet or has not already an associated value.
Otherwise, returns the associated value of a previous call to acm_insert_letter_of_keyword for the same keyword.
Note
The value fed to the machine can be allocated statically, automatically or dynamically, as long as it persists until the machine is releases with acm_release.
If value is dynamically allocated:
- A destructor
dtormust be provided (ifvalueis allocated by a single call tomalloc,freeis a suitable destructor.). - If
acm_insert_end_of_keywordreturns0,valuecould later be retrieved by a subsequent call toacm_get_match;valuewill be automatically deallocated withdtorby the machine at release (acm_release).
- Otherwise, the passed
valuewon't be managed by the machine and should be handled and free'd by the caller. For instance by:
void *prev;
if (( prev = acm_insert_end_of_keyword (&state, value, destructor)))
{
*prev = appender (prev, value); // Handled by a user defined appender.
destructor (value); // Deallocated by the caller.
}To search for the words of the dictionary in a text composed of a sequence of signs :
- Initialise a matcher with
acm_matcher_init.
void acm_matcher_init (MatchHolder *matcher);Note
A same MatchHolder can be used to retrieve several matches (with acm_get_match) from different match searches (with acm_match).
a MatchHolder must be release by a subsequent call to acm_match_release after use.
- Call
acm_initiateonce for each text to scan to initialise a constant internal state.
Note
The very same previously used to feed the dictionary with words.
Then,
- make successive calls to
acm_matchfor each sign of the text, in sequence, from beginning to end of the text.
size_t acm_match (const ACState **state, void *letter);If one or several matches are found by acm_match while reading the text, acm_match will return a non-zero value of the number of matches found.
Note
A state msut have been initialised with acm_initiate before the first call to acm_match.
A letter must be provided. It is a pointer to a sign that must persist until to call to acm_match has returned.
- Loop on these matches with a call to
acm_get_matchfor each match.
void acm_get_match (const ACState *state, size_t index, MatchHolder *matcher);A matcher is a structure which content is:
{
const void **letters; /* An array of pointers to symbols */
size_t length; /* Length of the array */
void *value; /* Value associated with the matched pattern */
}Note
letters is an array of pointers to already allocated symbols, rather than an array of symbols. Therefore,
- a copy-constructor of symbols is not required ;
*matchershould not be used after the state-machine has been released byacm_release.
If matcher is not null it will be filled with the found match (with a keyword defined by a previous call to acm_insert_end_of_keyword).
Note
state must be the one used by a previous call to acm_match.
The index (starting from 0) must be lower than the number of matches found by the previous call to acm_match.
Note
If matcher is not null,
- it must be initialised by a call to
acm_match_initbefore use ; - it must be released by a call to
acm_matcher_releaseafter use.
void acm_matcher_init (MatchHolder *matcher);
void acm_matcher_release (MatchHolder *matcher);Tip
Adding words to the dictionary and searching can be processed consecutively, alternatively or concurrently (by different threads).
Release the resources uses by the state machine after use with acm_match_release.
void acm_release (ACMachine *machine);Note
The machine must have been initialised by a previous call to acm_create.
Note
The machine must have been initialised by a previous call to acm_create.
size_t acm_nb_keywords (const ACMachine *machine);void acm_foreach_keyword (const ACMachine *machine, void (*operator) (MatchHolder, void *value));Note
The user-declared operator is called successively for each keyword of the dictionary.
aho_corasick.hdefines and fully documents the interface (28 effective lines).aho_corasick.cdefines the fully commented implementation (379 effective lines).
| Description | Function name | |
|---|---|---|
| Dictionary instanciators | ||
| Allocates a dictionary dynamically | acm_create |
|
| Deallocates a dictionary | acm_release |
|
| Keyword management | ||
| Prepares a dictionary for keyword insertion | acm_initiate |
|
| Registers a keyword in a dictionary | acm_insert_letter_of_keyword and acm_insert_end_of_keyword |
|
| Gets the number of registered keywords in a dictionary | acm_nb_keywords |
|
| Calls a callback function for each keyword registered in a dictionary | acm_foreach_keyword |
|
| Keyword matching | ||
| Prepares a dictionary for keyword matching | acm_initiate |
|
| Searches text for matching keywords registered in a dictionary | acm_match |
|
| Initialises a container for registered keywords from a dictionary | acm_matcher_init |
|
| Retrieves one of the found matching keywords | acm_get_match |
|
| Releases a container for registered keywords from a dictionary | acm_matcher_release |
Note
acm_initiate is used both for keyword insertion and searching.
Compared to the implementation proposed by Aho and Corasick, this one adds several original enhancements:
-
First of all, it does not make any assumption on the size of the alphabet used. Particularly, unlike most implementations, the alphabet is not limited to 256 signs of type
char. For instance, if the type of signs is defined aslong long intrather than the usualchar, then the number of possible signs would be 18446744073709551616 !For this to be possible, the three algorithms of the original paper have been adjusted:
-
The assertion "for all a such that g(0, a) =
faildo g(0, a) <- 0" in the Aho-Corasick paper, at the end of algorithm 2 can not be fulfilled because it would require to set g(0, a) for all the values of 'a' in the set of possible values of the alphabet, and thus would require to allocate (if not exhaust) a lot of memory. -
Therefore, rather than being assigned to 0, g(0, a) is kept equal to
fail(i.e. g(0, a) is kept undefined) for all 'a' not being the first sign (g(0, a)) of a registered keyword. -
Nevertheless, for the state machine to work properly, it must behave as if g(0, a) would be equal to 0 whenever g(0, a) = fail after algorithm 2 has ended: thus, algorithms 1 and 3 (called after algorithm 2) must be adapted accordingly (modifications are tagged with [1], [2] and [3] in the code):
- [1] g(0, a) = (resp. !=) 0 must be replaced by: g(0, a) = (resp. !=) fail
- [2] g(state, a) = fail must be replaced by: g(state, a) = fail and state != 0
- [3] s <- g(state, a) must be replaced by: if g(state, a) != fail then s <- g(state, a) else s <- 0
-
-
To reduce the memory footprint, the implementation does not store output keywords associated to states. Instead, it reconstructs the matching keywords by traversing the fail states and then the branch of the tree backward. (Attributes
previousandis_end_of_keywordare added the the state object ACState, see code ofacm_get_match). -
It permits to search for keywords even though all keywords have not been registered yet. In other words, a new keyword can be registered with iterative calls to
acm_insert_letter_of_keywordandacm_insert_end_of_keywordalternatively or concurrently with calls toacm_matchwithout disrupting the current match search. To achieve this, failure states are reconstructed after every registration of a new keyword (seeacm_insert_keywordwhich alternates calls to algorithms 2 and 3.) -
It can associate user allocated and defined values to registered keywords, and retrieve them together with the found keywords:
- two arguments are passed to
acm_insert_end_of_keywordcalls: a pointer to a previously allocated value, and a pointer to function for deallocation of the associated value. This function will be called when the state machine will be release byacm_release. - a fourth argument is passed to
acm_get_matchcalls: the address of a pointer to an associated value. The pointer to associated value is modified byacm_get_matchto the address of the value associated to the keyword.
- two arguments are passed to
-
All functions are thread-safe. Therefore, a given shared Aho-Corasick machine can be used by multiple threads concurrently to insert keywords or to scan different texts for matching keywords.
-
It is short (400 effective lines of code.)
-
Last but not least, it is very fast. On my slow HD and slow CPU old computer, it takes 0.92 seconds to register 370,099 keywords with a total of 3,864,776 characters, and 0.12 seconds to find (and count occurencies of) those keywords in a text of 376,617 characters.
Step 3 uses:
- either, by default, an incremental reconstruction of the failure states (as proposed by B. Meyer in his paper) each time a new keyword is entered, with
one correction to make
Complete_failurealso correctly computeF[T[n,c]](set to state0) for the initial statenequal to state0(this is required byEnter_child) ; - or, when the macro
NMEYER_85is defined at compile time ofaho_corasick.c, a full reconstruction of the failure states (as proposed by Aho and Corasick in their paper) before a match search (the Aho Corasick state machine is made stable: new keywords can be registered (acm_insert_letter_of_keyword) while scanning is already in progress (acm_match)) ;
The implementation makes use of sets and maps provided by the minimaps library.
Type make to run the example.
examples/aho_corasick_generic_test.cgives a complete and commented example.- The novel Mrs Dalloway of Virgina Woolf
mrs_dalloway.txtserves as an input file used by the example.
Have fun !
