regex_automata::dense

Enum DenseDFA

Source
pub enum DenseDFA<T: AsRef<[S]>, S: StateID> {
    Standard(Standard<T, S>),
    ByteClass(ByteClass<T, S>),
    Premultiplied(Premultiplied<T, S>),
    PremultipliedByteClass(PremultipliedByteClass<T, S>),
    // some variants omitted
}
Expand description

A dense table-based deterministic finite automaton (DFA).

A dense DFA represents the core matching primitive in this crate. That is, logically, all DFAs have a single start state, one or more match states and a transition table that maps the current state and the current byte of input to the next state. A DFA can use this information to implement fast searching. In particular, the use of a dense DFA generally makes the trade off that match speed is the most valuable characteristic, even if building the regex may take significant time and space. As such, the processing of every byte of input is done with a small constant number of operations that does not vary with the pattern, its size or the size of the alphabet. If your needs don’t line up with this trade off, then a dense DFA may not be an adequate solution to your problem.

In contrast, a sparse DFA makes the opposite trade off: it uses less space but will execute a variable number of instructions per byte at match time, which makes it slower for matching.

A DFA can be built using the default configuration via the DenseDFA::new constructor. Otherwise, one can configure various aspects via the dense::Builder.

A single DFA fundamentally supports the following operations:

  1. Detection of a match.
  2. Location of the end of the first possible match.
  3. Location of the end of the leftmost-first match.

A notable absence from the above list of capabilities is the location of the start of a match. In order to provide both the start and end of a match, two DFAs are required. This functionality is provided by a Regex, which can be built with its basic constructor, Regex::new, or with a RegexBuilder.

§State size

A DenseDFA has two type parameters, T and S. T corresponds to the type of the DFA’s transition table while S corresponds to the representation used for the DFA’s state identifiers as described by the StateID trait. This type parameter is typically usize, but other valid choices provided by this crate include u8, u16, u32 and u64. The primary reason for choosing a different state identifier representation than the default is to reduce the amount of memory used by a DFA. Note though, that if the chosen representation cannot accommodate the size of your DFA, then building the DFA will fail and return an error.

While the reduction in heap memory used by a DFA is one reason for choosing a smaller state identifier representation, another possible reason is for decreasing the serialization size of a DFA, as returned by to_bytes_little_endian, to_bytes_big_endian or to_bytes_native_endian.

The type of the transition table is typically either Vec<S> or &[S], depending on where the transition table is stored.

§Variants

This DFA is defined as a non-exhaustive enumeration of different types of dense DFAs. All of these dense DFAs use the same internal representation for the transition table, but they vary in how the transition table is read. A DFA’s specific variant depends on the configuration options set via dense::Builder. The default variant is PremultipliedByteClass.

§The DFA trait

This type implements the DFA trait, which means it can be used for searching. For example:

use regex_automata::{DFA, DenseDFA};

let dfa = DenseDFA::new("foo[0-9]+")?;
assert_eq!(Some(8), dfa.find(b"foo12345"));

The DFA trait also provides an assortment of other lower level methods for DFAs, such as start_state and next_state. While these are correctly implemented, it is an anti-pattern to use them in performance sensitive code on the DenseDFA type directly. Namely, each implementation requires a branch to determine which type of dense DFA is being used. Instead, this branch should be pushed up a layer in the code since walking the transitions of a DFA is usually a hot path. If you do need to use these lower level methods in performance critical code, then you should match on the variants of this DFA and use each variant’s implementation of the DFA trait directly.

Variants§

§

Standard(Standard<T, S>)

A standard DFA that does not use premultiplication or byte classes.

§

ByteClass(ByteClass<T, S>)

A DFA that shrinks its alphabet to a set of equivalence classes instead of using all possible byte values. Any two bytes belong to the same equivalence class if and only if they can be used interchangeably anywhere in the DFA while never discriminating between a match and a non-match.

This type of DFA can result in significant space reduction with a very small match time performance penalty.

§

Premultiplied(Premultiplied<T, S>)

A DFA that premultiplies all of its state identifiers in its transition table. This saves an instruction per byte at match time which improves search performance.

The only downside of premultiplication is that it may prevent one from using a smaller state identifier representation than you otherwise could.

§

PremultipliedByteClass(PremultipliedByteClass<T, S>)

The default configuration of a DFA, which uses byte classes and premultiplies its state identifiers.

Implementations§

Source§

impl DenseDFA<Vec<usize>, usize>

Source

pub fn new(pattern: &str) -> Result<DenseDFA<Vec<usize>, usize>, Error>

Parse the given regular expression using a default configuration and return the corresponding DFA.

The default configuration uses usize for state IDs, premultiplies them and reduces the alphabet size by splitting bytes into equivalence classes. The DFA is not minimized.

If you want a non-default configuration, then use the dense::Builder to set your own configuration.

§Example
use regex_automata::{DFA, DenseDFA};

let dfa = DenseDFA::new("foo[0-9]+bar")?;
assert_eq!(Some(11), dfa.find(b"foo12345bar"));
Source§

impl<S: StateID> DenseDFA<Vec<S>, S>

Source

pub fn empty() -> DenseDFA<Vec<S>, S>

Create a new empty DFA that never matches any input.

§Example

In order to build an empty DFA, callers must provide a type hint indicating their choice of state identifier representation.

use regex_automata::{DFA, DenseDFA};

let dfa: DenseDFA<Vec<usize>, usize> = DenseDFA::empty();
assert_eq!(None, dfa.find(b""));
assert_eq!(None, dfa.find(b"foo"));
Source§

impl<T: AsRef<[S]>, S: StateID> DenseDFA<T, S>

Source

pub fn as_ref<'a>(&'a self) -> DenseDFA<&'a [S], S>

Cheaply return a borrowed version of this dense DFA. Specifically, the DFA returned always uses &[S] for its transition table while keeping the same state identifier representation.

Source

pub fn to_owned(&self) -> DenseDFA<Vec<S>, S>

Return an owned version of this sparse DFA. Specifically, the DFA returned always uses Vec<u8> for its transition table while keeping the same state identifier representation.

Effectively, this returns a sparse DFA whose transition table lives on the heap.

Source

pub fn memory_usage(&self) -> usize

Returns the memory usage, in bytes, of this DFA.

The memory usage is computed based on the number of bytes used to represent this DFA’s transition table. This corresponds to heap memory usage.

This does not include the stack size used up by this DFA. To compute that, used std::mem::size_of::<DenseDFA>().

Source§

impl<T: AsRef<[S]>, S: StateID> DenseDFA<T, S>

Routines for converting a dense DFA to other representations, such as sparse DFAs, smaller state identifiers or raw bytes suitable for persistent storage.

Source

pub fn to_sparse(&self) -> Result<SparseDFA<Vec<u8>, S>, Error>

Convert this dense DFA to a sparse DFA.

This is a convenience routine for to_sparse_sized that fixes the state identifier representation of the sparse DFA to the same representation used for this dense DFA.

If the chosen state identifier representation is too small to represent all states in the sparse DFA, then this returns an error. In most cases, if a dense DFA is constructable with S then a sparse DFA will be as well. However, it is not guaranteed.

§Example
use regex_automata::{DFA, DenseDFA};

let dense = DenseDFA::new("foo[0-9]+")?;
let sparse = dense.to_sparse()?;
assert_eq!(Some(8), sparse.find(b"foo12345"));
Source

pub fn to_sparse_sized<A: StateID>( &self, ) -> Result<SparseDFA<Vec<u8>, A>, Error>

Convert this dense DFA to a sparse DFA.

Using this routine requires supplying a type hint to choose the state identifier representation for the resulting sparse DFA.

If the chosen state identifier representation is too small to represent all states in the sparse DFA, then this returns an error.

§Example
use regex_automata::{DFA, DenseDFA};

let dense = DenseDFA::new("foo[0-9]+")?;
let sparse = dense.to_sparse_sized::<u8>()?;
assert_eq!(Some(8), sparse.find(b"foo12345"));
Source

pub fn to_u8(&self) -> Result<DenseDFA<Vec<u8>, u8>, Error>

Create a new DFA whose match semantics are equivalent to this DFA, but attempt to use u8 for the representation of state identifiers. If u8 is insufficient to represent all state identifiers in this DFA, then this returns an error.

This is a convenience routine for to_sized::<u8>().

Source

pub fn to_u16(&self) -> Result<DenseDFA<Vec<u16>, u16>, Error>

Create a new DFA whose match semantics are equivalent to this DFA, but attempt to use u16 for the representation of state identifiers. If u16 is insufficient to represent all state identifiers in this DFA, then this returns an error.

This is a convenience routine for to_sized::<u16>().

Source

pub fn to_u32(&self) -> Result<DenseDFA<Vec<u32>, u32>, Error>

Create a new DFA whose match semantics are equivalent to this DFA, but attempt to use u32 for the representation of state identifiers. If u32 is insufficient to represent all state identifiers in this DFA, then this returns an error.

This is a convenience routine for to_sized::<u32>().

Source

pub fn to_u64(&self) -> Result<DenseDFA<Vec<u64>, u64>, Error>

Create a new DFA whose match semantics are equivalent to this DFA, but attempt to use u64 for the representation of state identifiers. If u64 is insufficient to represent all state identifiers in this DFA, then this returns an error.

This is a convenience routine for to_sized::<u64>().

Source

pub fn to_sized<A: StateID>(&self) -> Result<DenseDFA<Vec<A>, A>, Error>

Create a new DFA whose match semantics are equivalent to this DFA, but attempt to use A for the representation of state identifiers. If A is insufficient to represent all state identifiers in this DFA, then this returns an error.

An alternative way to construct such a DFA is to use dense::Builder::build_with_size. In general, using the builder is preferred since it will use the given state identifier representation throughout determinization (and minimization, if done), and thereby using less memory throughout the entire construction process. However, these routines are necessary in cases where, say, a minimized DFA could fit in a smaller state identifier representation, but the initial determinized DFA would not.

Source

pub fn to_bytes_little_endian(&self) -> Result<Vec<u8>, Error>

Serialize a DFA to raw bytes, aligned to an 8 byte boundary, in little endian format.

If the state identifier representation of this DFA has a size different than 1, 2, 4 or 8 bytes, then this returns an error. All implementations of StateID provided by this crate satisfy this requirement.

Source

pub fn to_bytes_big_endian(&self) -> Result<Vec<u8>, Error>

Serialize a DFA to raw bytes, aligned to an 8 byte boundary, in big endian format.

If the state identifier representation of this DFA has a size different than 1, 2, 4 or 8 bytes, then this returns an error. All implementations of StateID provided by this crate satisfy this requirement.

Source

pub fn to_bytes_native_endian(&self) -> Result<Vec<u8>, Error>

Serialize a DFA to raw bytes, aligned to an 8 byte boundary, in native endian format. Generally, it is better to pick an explicit endianness using either to_bytes_little_endian or to_bytes_big_endian. This routine is useful in tests where the DFA is serialized and deserialized on the same platform.

If the state identifier representation of this DFA has a size different than 1, 2, 4 or 8 bytes, then this returns an error. All implementations of StateID provided by this crate satisfy this requirement.

Source§

impl<'a, S: StateID> DenseDFA<&'a [S], S>

Source

pub unsafe fn from_bytes(buf: &'a [u8]) -> DenseDFA<&'a [S], S>

Deserialize a DFA with a specific state identifier representation.

Deserializing a DFA using this routine will never allocate heap memory. This is also guaranteed to be a constant time operation that does not vary with the size of the DFA.

The bytes given should be generated by the serialization of a DFA with either the to_bytes_little_endian method or the to_bytes_big_endian endian, depending on the endianness of the machine you are deserializing this DFA from.

If the state identifier representation is usize, then deserialization is dependent on the pointer size. For this reason, it is best to serialize DFAs using a fixed size representation for your state identifiers, such as u8, u16, u32 or u64.

§Panics

The bytes given should be trusted. In particular, if the bytes are not a valid serialization of a DFA, or if the given bytes are not aligned to an 8 byte boundary, or if the endianness of the serialized bytes is different than the endianness of the machine that is deserializing the DFA, then this routine will panic. Moreover, it is possible for this deserialization routine to succeed even if the given bytes do not represent a valid serialized dense DFA.

§Safety

This routine is unsafe because it permits callers to provide an arbitrary transition table with possibly incorrect transitions. While the various serialization routines will never return an incorrect transition table, there is no guarantee that the bytes provided here are correct. While deserialization does many checks (as documented above in the panic conditions), this routine does not check that the transition table is correct. Given an incorrect transition table, it is possible for the search routines to access out-of-bounds memory because of explicit bounds check elision.

§Example

This example shows how to serialize a DFA to raw bytes, deserialize it and then use it for searching. Note that we first convert the DFA to using u16 for its state identifier representation before serializing it. While this isn’t strictly necessary, it’s good practice in order to decrease the size of the DFA and to avoid platform specific pitfalls such as differing pointer sizes.

use regex_automata::{DFA, DenseDFA};

let initial = DenseDFA::new("foo[0-9]+")?;
let bytes = initial.to_u16()?.to_bytes_native_endian()?;
let dfa: DenseDFA<&[u16], u16> = unsafe {
    DenseDFA::from_bytes(&bytes)
};

assert_eq!(Some(8), dfa.find(b"foo12345"));

Trait Implementations§

Source§

impl<T: Clone + AsRef<[S]>, S: Clone + StateID> Clone for DenseDFA<T, S>

Source§

fn clone(&self) -> DenseDFA<T, S>

Returns a copy of the value. Read more
1.0.0 · Source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
Source§

impl<T: AsRef<[S]>, S: StateID> DFA for DenseDFA<T, S>

Source§

type ID = S

The representation used for state identifiers in this DFA. Read more
Source§

fn start_state(&self) -> S

Return the identifier of this DFA’s start state.
Source§

fn is_match_state(&self, id: S) -> bool

Returns true if and only if the given identifier corresponds to a match state.
Source§

fn is_dead_state(&self, id: S) -> bool

Returns true if and only if the given identifier corresponds to a dead state. When a DFA enters a dead state, it is impossible to leave and thus can never lead to a match.
Source§

fn is_match_or_dead_state(&self, id: S) -> bool

Returns true if and only if the given identifier corresponds to either a dead state or a match state, such that one of is_match_state(id) or is_dead_state(id) must return true. Read more
Source§

fn is_anchored(&self) -> bool

Returns true if and only if this DFA is anchored. Read more
Source§

fn next_state(&self, current: S, input: u8) -> S

Given the current state that this DFA is in and the next input byte, this method returns the identifier of the next state. The identifier returned is always valid, but it may correspond to a dead state.
Source§

unsafe fn next_state_unchecked(&self, current: S, input: u8) -> S

Like next_state, but its implementation may look up the next state without memory safety checks such as bounds checks. As such, callers must ensure that the given identifier corresponds to a valid DFA state. Implementors must, in turn, ensure that this routine is safe for all valid state identifiers and for all possible u8 values.
Source§

fn is_match_at(&self, bytes: &[u8], start: usize) -> bool

Returns the same as is_match, but starts the search at the given offset. Read more
Source§

fn shortest_match_at(&self, bytes: &[u8], start: usize) -> Option<usize>

Returns the same as shortest_match, but starts the search at the given offset. Read more
Source§

fn find_at(&self, bytes: &[u8], start: usize) -> Option<usize>

Returns the same as find, but starts the search at the given offset. Read more
Source§

fn rfind_at(&self, bytes: &[u8], start: usize) -> Option<usize>

Returns the same as rfind, but starts the search at the given offset. Read more
Source§

fn is_match(&self, bytes: &[u8]) -> bool

Returns true if and only if the given bytes match this DFA. Read more
Source§

fn shortest_match(&self, bytes: &[u8]) -> Option<usize>

Returns the first position at which a match is found. Read more
Source§

fn find(&self, bytes: &[u8]) -> Option<usize>

Returns the end offset of the longest match. If no match exists, then None is returned. Read more
Source§

fn rfind(&self, bytes: &[u8]) -> Option<usize>

Returns the start offset of the longest match in reverse, by searching from the end of the input towards the start of the input. If no match exists, then None is returned. In other words, this has the same match semantics as find, but in reverse. Read more
Source§

impl<T: Debug + AsRef<[S]>, S: Debug + StateID> Debug for DenseDFA<T, S>

Source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more

Auto Trait Implementations§

§

impl<T, S> Freeze for DenseDFA<T, S>
where S: Freeze, T: Freeze,

§

impl<T, S> RefUnwindSafe for DenseDFA<T, S>

§

impl<T, S> Send for DenseDFA<T, S>
where S: Send, T: Send,

§

impl<T, S> Sync for DenseDFA<T, S>
where S: Sync, T: Sync,

§

impl<T, S> Unpin for DenseDFA<T, S>
where S: Unpin, T: Unpin,

§

impl<T, S> UnwindSafe for DenseDFA<T, S>
where S: UnwindSafe, T: UnwindSafe,

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> CloneToUninit for T
where T: Clone,

Source§

unsafe fn clone_to_uninit(&self, dst: *mut T)

🔬This is a nightly-only experimental API. (clone_to_uninit)
Performs copy-assignment from self to dst. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> ToOwned for T
where T: Clone,

Source§

type Owned = T

The resulting type after obtaining ownership.
Source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
Source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.