bstr

Function decode_utf8

Source
pub fn decode_utf8<B: AsRef<[u8]>>(slice: B) -> (Option<char>, usize)
Expand description

UTF-8 decode a single Unicode scalar value from the beginning of a slice.

When successful, the corresponding Unicode scalar value is returned along with the number of bytes it was encoded with. The number of bytes consumed for a successful decode is always between 1 and 4, inclusive.

When unsuccessful, None is returned along with the number of bytes that make up a maximal prefix of a valid UTF-8 code unit sequence. In this case, the number of bytes consumed is always between 0 and 3, inclusive, where 0 is only returned when slice is empty.

ยงExamples

Basic usage:

use bstr::decode_utf8;

// Decoding a valid codepoint.
let (ch, size) = decode_utf8(b"\xE2\x98\x83");
assert_eq!(Some('โ˜ƒ'), ch);
assert_eq!(3, size);

// Decoding an incomplete codepoint.
let (ch, size) = decode_utf8(b"\xE2\x98");
assert_eq!(None, ch);
assert_eq!(2, size);

This example shows how to iterate over all codepoints in UTF-8 encoded bytes, while replacing invalid UTF-8 sequences with the replacement codepoint:

use bstr::{B, decode_utf8};

let mut bytes = B(b"\xE2\x98\x83\xFF\xF0\x9D\x9E\x83\xE2\x98\x61");
let mut chars = vec![];
while !bytes.is_empty() {
    let (ch, size) = decode_utf8(bytes);
    bytes = &bytes[size..];
    chars.push(ch.unwrap_or('\u{FFFD}'));
}
assert_eq!(vec!['โ˜ƒ', '\u{FFFD}', '๐žƒ', '\u{FFFD}', 'a'], chars);