aho_corasick

Struct AhoCorasickBuilder

Source
pub struct AhoCorasickBuilder { /* private fields */ }
Expand description

A builder for configuring an Aho-Corasick automaton.

Implementations§

Source§

impl AhoCorasickBuilder

Source

pub fn new() -> AhoCorasickBuilder

Create a new builder for configuring an Aho-Corasick automaton.

If you don’t need fine grained configuration or aren’t sure which knobs to set, try using AhoCorasick::new_auto_configured instead.

Source

pub fn build<I, P>(&self, patterns: I) -> AhoCorasick
where I: IntoIterator<Item = P>, P: AsRef<[u8]>,

Build an Aho-Corasick automaton using the configuration set on this builder.

A builder may be reused to create more automatons.

This method will use the default for representing internal state identifiers, which is usize. This guarantees that building the automaton will succeed and is generally a good default, but can make the size of the automaton 2-8 times bigger than it needs to be, depending on your target platform.

§Examples

Basic usage:

use aho_corasick::AhoCorasickBuilder;

let patterns = &["foo", "bar", "baz"];
let ac = AhoCorasickBuilder::new()
    .build(patterns);
assert_eq!(Some(1), ac.find("xxx bar xxx").map(|m| m.pattern()));
Source

pub fn build_with_size<S, I, P>( &self, patterns: I, ) -> Result<AhoCorasick<S>, Error>
where S: StateID, I: IntoIterator<Item = P>, P: AsRef<[u8]>,

Build an Aho-Corasick automaton using the configuration set on this builder with a specific state identifier representation. This only has an effect when the dfa option is enabled.

Generally, the choices for a state identifier representation are u8, u16, u32, u64 or usize, with usize being the default. The advantage of choosing a smaller state identifier representation is that the automaton produced will be smaller. This might be beneficial for just generally using less space, or might even allow it to fit more of the automaton in your CPU’s cache, leading to overall better search performance.

Unlike the standard build method, this can report an error if the state identifier representation cannot support the size of the automaton.

Note that the state identifier representation is determined by the S type variable. This requires a type hint of some sort, either by specifying the return type or using the turbofish, e.g., build_with_size::<u16, _, _>(...).

§Examples

Basic usage:

use aho_corasick::{AhoCorasick, AhoCorasickBuilder};

let patterns = &["foo", "bar", "baz"];
let ac: AhoCorasick<u8> = AhoCorasickBuilder::new()
    .build_with_size(patterns)?;
assert_eq!(Some(1), ac.find("xxx bar xxx").map(|m| m.pattern()));

Or alternatively, with turbofish:

use aho_corasick::AhoCorasickBuilder;

let patterns = &["foo", "bar", "baz"];
let ac = AhoCorasickBuilder::new()
    .build_with_size::<u8, _, _>(patterns)?;
assert_eq!(Some(1), ac.find("xxx bar xxx").map(|m| m.pattern()));
Source

pub fn auto_configure<B: AsRef<[u8]>>( &mut self, patterns: &[B], ) -> &mut AhoCorasickBuilder

Automatically configure the settings on this builder according to the patterns that will be used to construct the automaton.

The idea here is to balance space and time automatically. That is, when searching a small number of patterns, this will attempt to use the fastest possible configuration since the total space required will be small anyway. As the number of patterns grows, this will fall back to slower configurations that use less space.

This is guaranteed to never set match_kind, but any other option may be overridden.

§Examples

Basic usage:

use aho_corasick::AhoCorasickBuilder;

let patterns = &["foo", "bar", "baz"];
let ac = AhoCorasickBuilder::new()
    .auto_configure(patterns)
    .build(patterns);
assert_eq!(Some(1), ac.find("xxx bar xxx").map(|m| m.pattern()));
Source

pub fn match_kind(&mut self, kind: MatchKind) -> &mut AhoCorasickBuilder

Set the desired match semantics.

The default is MatchKind::Standard, which corresponds to the match semantics supported by the standard textbook description of the Aho-Corasick algorithm. Namely, matches are reported as soon as they are found. Moreover, this is the only way to get overlapping matches or do stream searching.

The other kinds of match semantics that are supported are MatchKind::LeftmostFirst and MatchKind::LeftmostLongest. The former corresponds to the match you would get if you were to try to match each pattern at each position in the haystack in the same order that you give to the automaton. That is, it returns the leftmost match corresponding the earliest pattern given to the automaton. The latter corresponds to finding the longest possible match among all leftmost matches.

For more details on match semantics, see the documentation for MatchKind.

§Examples

In these examples, we demonstrate the differences between match semantics for a particular set of patterns in a specific order: b, abc, abcd.

Standard semantics:

use aho_corasick::{AhoCorasickBuilder, MatchKind};

let patterns = &["b", "abc", "abcd"];
let haystack = "abcd";

let ac = AhoCorasickBuilder::new()
    .match_kind(MatchKind::Standard) // default, not necessary
    .build(patterns);
let mat = ac.find(haystack).expect("should have a match");
assert_eq!("b", &haystack[mat.start()..mat.end()]);

Leftmost-first semantics:

use aho_corasick::{AhoCorasickBuilder, MatchKind};

let patterns = &["b", "abc", "abcd"];
let haystack = "abcd";

let ac = AhoCorasickBuilder::new()
    .match_kind(MatchKind::LeftmostFirst)
    .build(patterns);
let mat = ac.find(haystack).expect("should have a match");
assert_eq!("abc", &haystack[mat.start()..mat.end()]);

Leftmost-longest semantics:

use aho_corasick::{AhoCorasickBuilder, MatchKind};

let patterns = &["b", "abc", "abcd"];
let haystack = "abcd";

let ac = AhoCorasickBuilder::new()
    .match_kind(MatchKind::LeftmostLongest)
    .build(patterns);
let mat = ac.find(haystack).expect("should have a match");
assert_eq!("abcd", &haystack[mat.start()..mat.end()]);
Source

pub fn anchored(&mut self, yes: bool) -> &mut AhoCorasickBuilder

Enable anchored mode, which requires all matches to start at the first position in a haystack.

This option is disabled by default.

§Examples

Basic usage:

use aho_corasick::AhoCorasickBuilder;

let patterns = &["foo", "bar"];
let haystack = "foobar";

let ac = AhoCorasickBuilder::new()
    .anchored(true)
    .build(patterns);
assert_eq!(1, ac.find_iter(haystack).count());

When searching for overlapping matches, all matches that start at the beginning of a haystack will be reported:

use aho_corasick::AhoCorasickBuilder;

let patterns = &["foo", "foofoo"];
let haystack = "foofoo";

let ac = AhoCorasickBuilder::new()
    .anchored(true)
    .build(patterns);
assert_eq!(2, ac.find_overlapping_iter(haystack).count());
// A non-anchored search would return 3 matches.
Source

pub fn ascii_case_insensitive(&mut self, yes: bool) -> &mut AhoCorasickBuilder

Enable ASCII-aware case insensitive matching.

When this option is enabled, searching will be performed without respect to case for ASCII letters (a-z and A-Z) only.

Enabling this option does not change the search algorithm, but it may increase the size of the automaton.

NOTE: In the future, support for full Unicode case insensitivity may be added, but ASCII case insensitivity is comparatively much simpler to add.

§Examples

Basic usage:

use aho_corasick::AhoCorasickBuilder;

let patterns = &["FOO", "bAr", "BaZ"];
let haystack = "foo bar baz";

let ac = AhoCorasickBuilder::new()
    .ascii_case_insensitive(true)
    .build(patterns);
assert_eq!(3, ac.find_iter(haystack).count());
Source

pub fn dense_depth(&mut self, depth: usize) -> &mut AhoCorasickBuilder

Set the limit on how many NFA states use a dense representation for their transitions.

A dense representation uses more space, but supports faster access to transitions at search time. Thus, this setting permits the control of a space vs time trade off when using the NFA variant of Aho-Corasick.

This limit is expressed in terms of the depth of a state, i.e., the number of transitions from the starting state of the NFA. The idea is that most of the time searching will be spent near the starting state of the automaton, so states near the start state should use a dense representation. States further away from the start state would then use a sparse representation, which uses less space but is slower to access transitions at search time.

By default, this is set to a low but non-zero number.

This setting has no effect if the dfa option is enabled.

Source

pub fn dfa(&mut self, yes: bool) -> &mut AhoCorasickBuilder

Compile the standard Aho-Corasick automaton into a deterministic finite automaton (DFA).

When this is disabled (which is the default), then a non-deterministic finite automaton (NFA) is used instead.

The main benefit to a DFA is that it can execute searches more quickly than a NFA (perhaps 2-4 times as fast). The main drawback is that the DFA uses more space and can take much longer to build.

Enabling this option does not change the time complexity for constructing the Aho-Corasick automaton (which is O(p) where p is the total number of patterns being compiled). Enabling this option does however reduce the time complexity of non-overlapping searches from O(n + p) to O(n), where n is the length of the haystack.

In general, it’s a good idea to enable this if you’re searching a small number of fairly short patterns (~1000), or if you want the fastest possible search without regard to compilation time or space usage.

Source

pub fn prefilter(&mut self, yes: bool) -> &mut AhoCorasickBuilder

Enable heuristic prefilter optimizations.

When enabled, searching will attempt to quickly skip to match candidates using specialized literal search routines. A prefilter cannot always be used, and is generally treated as a heuristic. It can be useful to disable this if the prefilter is observed to be sub-optimal for a particular workload.

This is enabled by default.

Source

pub fn byte_classes(&mut self, yes: bool) -> &mut AhoCorasickBuilder

👎Deprecated since 0.7.16: not carrying its weight, will be always enabled, see: https://github.com/BurntSushi/aho-corasick/issues/57

Shrink the size of the transition alphabet by mapping bytes to their equivalence classes. This only has an effect when the dfa option is enabled.

When enabled, each a DFA will use a map from all possible bytes to their corresponding equivalence class. Each equivalence class represents a set of bytes that does not discriminate between a match and a non-match in the DFA. For example, the patterns bar and baz have at least five equivalence classes: singleton sets of b, a, r and z, and a final set that contains every other byte.

The advantage of this map is that the size of the transition table can be reduced drastically from #states * 256 * sizeof(id) to #states * k * sizeof(id) where k is the number of equivalence classes. As a result, total space usage can decrease substantially. Moreover, since a smaller alphabet is used, compilation becomes faster as well.

The disadvantage of this map is that every byte searched must be passed through this map before it can be used to determine the next transition. This has a small match time performance cost. However, if the DFA is otherwise very large without byte classes, then using byte classes can greatly improve memory locality and thus lead to better overall performance.

This option is enabled by default.

Source

pub fn premultiply(&mut self, yes: bool) -> &mut AhoCorasickBuilder

👎Deprecated since 0.7.16: not carrying its weight, will be always enabled, see: https://github.com/BurntSushi/aho-corasick/issues/57

Premultiply state identifiers in the transition table. This only has an effect when the dfa option is enabled.

When enabled, state identifiers are premultiplied to point to their corresponding row in the transition table. That is, given the ith state, its corresponding premultiplied identifier is i * k where k is the alphabet size of the automaton. (The alphabet size is at most 256, but is in practice smaller if byte classes is enabled.)

When state identifiers are not premultiplied, then the identifier of the ith state is i.

The advantage of premultiplying state identifiers is that is saves a multiplication instruction per byte when searching with a DFA. This has been observed to lead to a 20% performance benefit in micro-benchmarks.

The primary disadvantage of premultiplying state identifiers is that they require a larger integer size to represent. For example, if the DFA has 200 states, then its premultiplied form requires 16 bits to represent every possible state identifier, where as its non-premultiplied form only requires 8 bits.

This option is enabled by default.

Trait Implementations§

Source§

impl Clone for AhoCorasickBuilder

Source§

fn clone(&self) -> AhoCorasickBuilder

Returns a copy of the value. Read more
1.0.0 · Source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
Source§

impl Debug for AhoCorasickBuilder

Source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more
Source§

impl Default for AhoCorasickBuilder

Source§

fn default() -> AhoCorasickBuilder

Returns the “default value” for a type. Read more

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> CloneToUninit for T
where T: Clone,

Source§

unsafe fn clone_to_uninit(&self, dst: *mut T)

🔬This is a nightly-only experimental API. (clone_to_uninit)
Performs copy-assignment from self to dst. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> ToOwned for T
where T: Clone,

Source§

type Owned = T

The resulting type after obtaining ownership.
Source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
Source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.