This is a pitch for some low-level Unicode operations to enable libraries to implement their own String-like functionality and types. The purpose of this API is to present low-level core components out of which libraries can make types and higher-level API (i.e. tools for tool-makers).
I'm interested in hearing about capabilities that libraries may need and getting design feedback.
String
performs many Unicode operations using a mix of internal and publicly available functionality in the stdlib. The stdlib provides some limited general-purpose Unicode API, though this is a very dusty corner of the stdlib. We want to enable libraries to vend their own String
-like types and functionality.
One interesting use case to consider are ephemeral byte strings backed by chunks of data in contiguous memory. These chunks of data could be synchronous (i.e. Sequence
/Iterator
) or asynchronous (i.e. AsyncSequence
/AsyncIterator
). Their buffers are ephemeral, meaning there is no general way to fully reset the stream back to an earlier state (i.e. there is no Index
). The buffers might not be segmented along code unit, scalar, or grapheme cluster boundaries, meaning that a produced value might span segments.
We're aiming for a solution that provides:
- Simple, composable pieces to build up abstraction hierarchies
- Efficient, safe buffer-based interfaces to drill down through abstraction hierarchies
- A strategy for library-extensibility as well as compatibility with existing stdlib API and concepts
Note: Many of the approaches proposed are dependent on recently developed features such as typed throws and non-escapable values, whose ABI impacts may not be fully fleshed out. This may motivate splitting functionality across multiple proposals and releases
Validation API and producing errors as part of decoding is an often-requested feature. Below are errors related to Unicode encodings:
extension Unicode.UTF8 {
public enum DecodingError: Error {
case expectedStarter
case expectedContinuation
case overlongEncoding
case invalidCodePoint
case invalidStarterByte
}
}
extension Unicode.UTF16 {
public enum DecodingError: Error {
case expectedTrailingSurrogate
case unexpectedTrailingSurrogate
}
}
extension Unicode.UTF32 {
public enum DecodingError: Error {
case invalidCodePoint
}
}
Alternative: a single error enum, noting that some error cases are irrelevant in some encodings
Alternative / Investigation: making the error type an associated type on a protocol for library-provided encodings to customize
When it comes to validating bytes, the byte source might be ephemeral (e.g. a single-pass Sequence
) or it might be possible to re-visit contents by position (e.g. a Collection
).
extension Unicode.UTFX { // `UTFX` meaning UTF8, UTF16, and UTF32
public struct CollectionDecodingError<Index: Comparable>: Error {
public var kind: Unicode.UTFX.DecodingError
public var range: Range<Index>
}
public struct ByteStreamDecodingError: Error {
public var kind: Unicode.UTFX.DecodingError
public var bytes: (UInt8, UInt8?, UInt8?)
}
}
Knowing the kind of encoding error and bytes involved can be very helpful. Overlong encodings are often an intentional attempt to compromise security. Some systems may want to use custom error-correction (i.e. 128 distinct replacement characters) such that the corrected bytes are valid Unicode while also preserving the original bits. Additionally, knowing the encoding error can be helpful for debugging.
Validation API and concerns are further discussed later in this document.
Endianness (byte-ordering or memory ordering), denotes whether the first byte received contains the high bits or low bits of a code unit. It is relevant for multi-byte encodings (UTF-16 and UTF-32, but not UTF-8).
public enum Endianness {
case little
case big
/// The platform's native byte-ordering in memory
public static var native: Self
}
Alternative: Endianness could be considered in combination with the encoding, yielding e.g. UTF16BE
and UTF16LE
encodings. Or, it could be considered a property of a serialization format.
Alternative: An alternate name for Endianness
could be ByteOrder
.
The stdlib has some existing functionality along these lines, but it's inadequate for byte stream validation and decoding. Instance methods UnicodeCodec.decode and parseScalar do not produce meaningful error information and they operate only in terms of fully-formed code units and complete scalars.
We propose stateful ByteStreamDecoder
structs. These are statically associated with an encoding and, at initialization time, a byte order. They store a small internal buffer of bytes until enough bytes have been seen to produce a Unicode scalar. They can be fed data in a byte at a time, which allows developers to feed data in as it is received and not worry about handling scalar alignment.
While all names in this rough draft are strawperson names, the method name consume
below is a particularly strawy name. It is fairly unpalatable and meant to be a placeholder. Alternate names such as read
, receive
, input
(as a present-tense verb), streamIn
, next
, feed
, feedIn
, etc., are not much better. The name decode
doesn't carry the implication of an in-progress or suspended operation.
public protocol UnicodeByteStreamDecoder {
/// Input a byte, returns a finished scalar or `nil`.
/// Throws a decoding error.
mutating func consume(
_ byte: UInt8
) throws -> Unicode.Scalar?
/// We've reached the end of input. If there's an unfinished
/// scalar in progress, throws the appropriate encoding error
func finalize() throws
// Customization points:
/// Read bytes until yielding a decoded scalar.
///
/// Throws validation errors.
///
/// Returns `nil` when `bytes` is done. `bytes` may not
/// have finished a scalar and `self` may contain some
/// bytes of an in-progress scalar value.
public mutating func consume(
_ bytes: inout some IteratorProtocol<UInt8>
) throws -> Unicode.Scalar?
/// Read bytes asynchronously until yielding a decoded scalar.
///
/// Throws validation errors and rethrows upstream errors.
///
/// Returns `nil` when `bytes` is done. `bytes` may not
/// have finished a scalar and `self` may contain some
/// bytes of an in-progress scalar value.
public mutating func consume<AI: AsyncIteratorProtocol>(
_ bytes: inout AI
) async throws -> Unicode.Scalar?
where AI.Element == UInt8
/// Read bytes starting from `position` and yielding a decoded scalar
/// and the position of the start of the next scalar.
///
/// Throws validation errors.
///
/// Returns `nil` when `bytes` is done. `bytes` may not
/// have finished a scalar and `self` may contain some
/// bytes of an in-progress scalar value.
///
/// INVESTIGATE: take `position` inout so that it gets updated rather
/// than requiring the caller to update a local `var`.
///
/// INVESTIGATE: Alternative: take a slice `inout`, but
/// we'd want to make sure it makes sense for non-copyable
/// slices
public mutating func consume<C: Collection<UInt8>>(
_ bytes: C,
startingFrom position: C.Index
) throws -> (Unicode.Scalar, scalarEnd: C.Index)?
/// INVESTIGATE: API to access the internal buffer, such as
/// whether it is empty, it's contents, clearing it, etc
}
extension Unicode.UTF8 {
public struct ByteStreamDecoder: UnicodeByteStreamDecoder {
public init()
/// Input a single
public mutating func consume(
_ byte: UInt8
) throws -> Unicode.Scalar?
public func finalize() throws
}
}
extension Unicode.UTF16 {
public struct ByteStreamDecoder: UnicodeByteStreamDecoder {
public init(byteOrder: Endianness)
public mutating func consume(
_ byte: UInt8
) throws -> Unicode.Scalar?
public func finalize() throws
}
}
extension Unicode.UTF32 {
public struct ByteStreamDecoder: UnicodeByteStreamDecoder {
public init(byteOrder: Endianness)
public mutating func consume(
_ x: UInt8
) throws -> Unicode.Scalar?
public func finalize() throws
}
}
// Default implementations
extension UnicodeByteStreamDecoder {
/// Read bytes until yielding a decoded scalar.
///
/// Throws validation errors.
///
/// Returns `nil` when `bytes` is done. `bytes` may not
/// have finished a scalar and `self` may contain some
/// bytes of an in-progress scalar value.
public mutating func consume(
_ bytes: inout some IteratorProtocol<UInt8>
) throws -> Unicode.Scalar?
/// Read bytes asynchronously until yielding a decoded scalar.
///
/// Throws validation errors and rethrows upstream errors.
///
/// Returns `nil` when `bytes` is done. `bytes` may not
/// have finished a scalar and `self` may contain some
/// bytes of an in-progress scalar value.
public mutating func consume<AI: AsyncIteratorProtocol>(
_ bytes: inout AI
) async throws -> Unicode.Scalar?
where AI.Element == UInt8
/// Read bytes starting from `position` and yielding a decoded scalar
/// and the position of the start of the next scalar.
///
/// Throws validation errors.
///
/// Returns `nil` when `bytes` is done. `bytes` may not
/// have finished a scalar and `self` may contain some
/// bytes of an in-progress scalar value.
///
/// INVESTIGATE: take `position` inout so that it gets updated rather
/// than requiring the caller to update a local `var`.
///
/// INVESTIGATE: Alternative: take a slice `inout`, but
/// we'd want to make sure it makes sense for non-copyable
/// slices
public mutating func consume<C: Collection<UInt8>>(
_ bytes: C,
startingFrom position: C.Index
) throws -> (Unicode.Scalar, scalarEnd: C.Index)?
}
A decoder with undecoded contents left in its buffer could be an indication of programmer error. The finalize
method checks for this. We could also consider exposing some properties or access to the buffer itself.
Alternative: we could attempt to add new byte-stream overloads for decode()
that throw errors and have the ability to resume with a different input source. For the purposes of this pitch, a separate type and namespace lets us explore the different semantics pitched.
Alternative: We could consider options to repair invalid input, possibly by also communicating what errors would have been reported.
Rejected Alternative: consume
receives endianness. This has the downside of a more complex API contract and more branching for an uncommon use case of interleaving mixed-endianness byte streams.
Alternative: Statically separate out endianness, i.e. have a UTF16BE
encoding or alternatively a UTF16BEByteStreamDecoder
. It's not clear this would be an API improvement, and it still needs some benchmarking to demonstrate whether there's a meaningful performance difference.
Alternative: A single dynamically-parameterized decoder type. Such a type could receive the encoding at init
-time and have a suitably large internal buffer for any anticipated encoding. This would result in many more run-time branches, however.
Alternative: A single type-parameterized decoder type. This would reduce the amount of API, though in effect we'd want to specialize for each of the presented encodings anyways. This may end up being in effect a different spelling of what is pitched here.
It likely makes sense to specify errors using typed throws. This may also motivate including decoding error types in the protocol.
Typed throws would be further motivated by the fact that some highly constrained or embedded systems may not have String
available. This could be due to the inability to dynamically allocate memory or not having enough space to bundle the data necessary to implement String
's semantics such as grapheme breaking and canonical equivalence data tables. Such environments could benefit from good UTF-8 decoding API and typed errors would make this API available.
The Swift Collections package defines a BigString type, which provides String-like functionality over rope-like storage. It uses underscored functionality in the stdlib to detect grapheme cluster boundaries.
Grapheme breaking requires looking ahead at the next scalar and keeping a few bits of state along the way. We could surface the underscored interfaces as API:
extension Unicode {
public struct GraphemeBreaker {
public init()
/// Returns whether there was a grapheme break _before_
/// `scalar`. Updates internal state and stores `scalar` for
/// the next call.
public mutating func consume(
_ scalar: Unicode.Scalar
) -> Bool
}
}
To build a Character
-producing stream out of this, the caller either has to buffer scalars themselves or do some bookkeeping to track positions of the scalars fed in.
The following is an example use of the GraphemeBreaker
, and could be additional API or an alternate API for consideration.
public struct GraphemeFormer {
var breaker: GraphemeBreaker = .init()
/// Use String.UnicodeScalarView as our scalar buffer.
///
/// It has a small-form for 15 UTF-8 code units (super common)
/// but can also be dynamically resized as needed
var charInProgress = ""
public init() {}
/// Consumes `scalar`. Returns a completed `Character` if
/// `scalar` was the start of the next grapheme cluster.
public mutating func consume(
_ scalar: Unicode.Scalar
) -> Character? {
guard breaker.consume(scalar) else {
charInProgress.unicodeScalars.append(scalar)
return nil
}
let character = charInProgress.first
charInProgress.removeAll()
charInProgress.unicodeScalars.append(scalar)
return character
}
/// Finishes and returns the in-progress Character
public mutating func flush() -> Character? {
let character = charInProgress.first
charInProgress.removeAll()
breaker = .init()
return character
}
}
Like the decoders, this uses an internal buffer: a String
's UnicodeScalarView
(which starts off using a small-form).
On Apple platforms, Foundation's FileHandle
can asynchronously vend bytes, Unicode scalars, and Characters.
For an example use of the pitched API, let's implement that on System's FileDescriptor.
import System
extension FileDescriptor {
var bytes: AsyncBytes { ... }
struct AsyncBytes: AsyncSequence {
typealias Element = UInt8
...
}
}
extension FileDescriptor.AsyncBytes {
struct AsyncUnicodeScalarSequence: AsyncSequence {
var bytes: FileDescriptor.AsyncBytes
typealias Element = Unicode.Scalar
func makeAsyncIterator() -> AsyncIterator {
.init(bytes: bytes.makeAsyncIterator())
}
struct AsyncIterator: AsyncIteratorProtocol {
typealias Element = Unicode.Scalar
var bytes: FileDescriptor.AsyncBytes.AsyncIterator
var decoder = Unicode.UTF8.ByteStreamDecoder()
mutating func next() async throws -> Unicode.Scalar? {
while let byte = try await bytes.next() {
if let scalar = try decoder.consume(byte) {
return scalar
}
}
// We're choosing to report un-finished scalar
// as an error
try decoder.finalize()
return nil
}
}
}
var unicodeScalars: AsyncUnicodeScalarSequence { .init(bytes: self) }
struct AsyncCharacterSequence: AsyncSequence {
var bytes: FileDescriptor.AsyncBytes
typealias Element = Character
func makeAsyncIterator() -> AsyncIterator {
.init(scalars: bytes.unicodeScalars.makeAsyncIterator())
}
struct AsyncIterator: AsyncIteratorProtocol {
var scalars: FileDescriptor.AsyncBytes.AsyncUnicodeScalarSequence.AsyncIterator
var breaker = Unicode.GraphemeFormer()
mutating func next() async throws -> Character? {
while let scalar = try await scalars.next() {
if let c = breaker.consume(scalar) {
return c
}
}
return breaker.flush()
}
}
}
var characters: AsyncCharacterSequence { .init(bytes: self) }
}
The given code shows a simple use of the pitched API. However, it also shows the need for an efficient approach that can drill through abstraction layers to underlying bytes in buffers.
When the source of Unicode scalars is backed by a chunk of memory containing validly-encoded UTF-8, it is more efficient to work in terms of positions in that chunk.
Alternative naming: Use FooCharacterBar
instead of FooGraphemeBar
or FooGraphemeClusterBar
throughout this proposal
There's a naming spectrum between Unicode's preferred terminology and Swift's.
One one end, if this were a standalone package aimed solely around providing an implementation of the Unicode standard via accelerated routines, there is an affordance to stick solely to Unicode terminology. Such a package could use the term "grapheme cluster" or "extended grapheme cluster" because "character" is not Unicode's terminology and could be ambiguous or confusing.
On the other end, anything in the String
namespace or which vends a String
type (including Character
) would use use the term "character". Similarly anything outside of a dedicated Unicode
package, library, namespace, module, or sub-module would as well.
As for what we are providing, we don't have a separate Unicode
module or sub-module, largely for historical reasons. Unicode
is an empty enum that functions as a namespace. Some precedent so far has found it better to lean towards Unicode
terminology for items under Unicode
. For example, Unicode.Scalar.Property.isGraphemeBase
is the better name than Unicode.Scalar.Property.isCharacterBase
.
Many byte streams are backed by chunks of contiguous memory. Rather than read byte-at-a-time and return scalar-at-a-time using internal buffering, a byte stream decoder could communicate scalar-aligned positions in its upstream's backing buffers.
We explore functionality that reads from any byte source and vends chunks of validly-encoded, and validly-aligned (for a specified alignment) UTF-8. This enables efficient streaming operations, i.e. those that operate over a properly-aligned moving window of validly encoded UTF-8 bytes in contiguous memory. This involves areas of current investigation and could motivate and incorporate the more advanced lifetime management discussed.
Another important consideration is how to handle when a scalar, normalization, or grapheme cluster segment straddles multiple chunks of data. In that case, API may need to return a view into a new buffer which stores these contiguously.
Validation looks at an entire input to ensure it is validly encoded. While it can be performed by decoding and discarding the contents, it can be done more efficiently as its own standalone operation if the original contents are meant to be kept in their original encoding.
extension Unicode.UTF8 {
public static func validate<C: Collection<UInt8>>(
_ bytes: C
) throws
}
// Available on UTF16 and UTF32, where endianness matters and
// where code units are not individual bytes
extension Unicode.UTF[16/32] {
public static func validate<C: Collection<CodeUnit>>(
_ codeUnits: C
) throws
public static func validate<C: Collection<UInt8>>(
_ bytes: C,
endianness: Endianness
) throws
}
Alternative: Concrete functions taking a BufferView
or some BufferViewable
-like protocol
UTF-8 validation is particularly common concern and the subject of a fair amount of research. Once an input is known to be validly encoded UTF-8, subsequent operations such as decoding, grapheme breaking, comparison, etc., can be implemented much more efficiently under this assumption of validity. Swift's String
type's native storage is guaranteed-valid-UTF8 for this reason.
However, if the input isn't actually valid, assuming validity leads to a new class of security concerns.
Memory safety is more nuanced. An ill-formed leading byte can dictate a scalar length that is longer than the memory buffer. The buffer may have bounds associated with it, which differs from the bounds dictated by its contents.
Additionally, a particular scalar value in valid UTF-8 has only one encoding, but invalid UTF-8 could have the same value encoded as an overlong encoding, which would compromise any code that checks for the presence of a scalar value by looking at the encoded bytes.
One approach is to define API that takes a parameter assuming that its contents contain correctly-encoded UTF-8. Today, that is often done via a unsafeAssumingValidUTF8: UnsafeRawBufferPointer
parameter, but this is unsafe in multiple ways that might not be clear to the caller. The UnsafeRawBufferPointer
is memory-unsafe of course, but even if the caller knows the memory itself is safe, the contents might be invalidly encoded in a way that subtly bypasses correct behavior elsewhere in the program.
A type such as BufferView
would help mitigate the memory unsafety of the pointer itself, but not the far more subtle problems of assuming valid UTF-8.
The rest of this pitch is interwoven with on-going investigations into non-escapable values and statically-reasoned lifetimes. As such, it could change depending on when or how that support arrives. This is similar to how the new Atomics API was originally implemented with unsafe constructs before ~Copyable
support was available.
UTF8.ValidBufferView
is a buffer view whose contents are known to be valid UTF-8 as represented in the type system.
extension Unicode.UTF8 {
public struct ValidBufferView {
/// TO INVESTIGATE: This field's lifetime is tied to `self`, i.e. the lifetime
/// of either `owner` or the lexical scope into which it was returned. Any `get`
/// accessors should be non-escapable
public var bytes: BufferView
/// An object that owns the memory, if the API needed to allocate memory.
///
/// This is needed when validation or alignment needs to allocate to ensure
/// the relevant content is is contiguous memory
public var owner: AnyObject?
/// Create from the validated contents of `c`. If `c` contains invalidly encoded
/// UTF-8, throws an error. If `c` is valid and and provide a `BufferView`, will
/// borrow that view. If `c` is valid but does not provide a `BufferView`, will
/// allocate memory to provide a contiguous view.
public init(validating c: some Collection<UInt8>) throws
/// As `validating:`, but repairs any encoding errors. If a repair was made, a
/// new allocation must be made for the corrected content.
public init(repairing c: some Collection<UInt8>)
}
}
There are 3 particularly useful alignments to segment content such that common operations can be performed by only looking at one chunk of data at a time.
- Scalar aligned: decoding and validation
- Normalization-segment aligned: canonical comparison
- Grapheme-cluster aligned: forming
Character
s
Each successive segmentation is broader than the one before: every grapheme-cluster boundary is a normalization-segment boundary and every normalization-segment boundary is a scalar boundary.
Note: Normalization segments being sub-segments of grapheme-clusters is not technically guaranteed by the Unicode standard to always be true in future Unicode versions. Unicode is allowed to change the rules of grapheme breaking in future versions. That being said, a normalization segment is defined to start on a non-combining scalar, and grapheme clusters that break before non-combining scalars are nonsensical. Unicode handles nonsensical cases as degenerate cases and those cases do not break, though Unicode could change its mind in future versions. Because of this, the "sub-alignment" relationship between normalization segments and grapheme clusters should treated as illustrative for the reader and not a formal API guarantee into the future.
extension Unicode.UTF8 {
public struct ValidScalarAlignedBufferView {
public var buffer: ValidBufferView
}
public struct ValidNormalizationSegmentAlignedBufferView {
public var buffer: ValidScalarAlignedBufferView
}
public struct ValidGraphemeClusterAlignedBufferView {
public var buffer: ValidScalarAlignedBufferView
}
}
extension Unicode.UTF8 {
/// Transforms a sequence of buffer views to a sequence of valid buffer views
public struct ValidBufferSource<
Upstream: Sequence<ValidBufferView>
>: Sequence {
public struct Iterator: IteratorProtocol {
public typealias Element = Slice<ValidBufferView>
public mutating func next() -> Element?
}
public func makeIterator() -> Iterator
public struct ScalarAlignedBufferSource: Sequence {
public struct Iterator: IteratorProtocol {
public typealias Element = ValidScalarAlignedBufferView
}
public func makeIterator() -> Iterator
}
public var scalarAligned: ScalarAlignedBufferSource { get }
public struct NormalizationSegmentAlignedBufferSource: Sequence {
public struct Iterator: IteratorProtocol {
public typealias Element = ValidNormalizationSegmentAlignedBufferView
}
public func makeIterator() -> Iterator
}
public var normalizationSegmentAligned: NormalizationSegmentAlignedBufferSource { get }
public struct GraphemeClusterAlignedBufferSource: Sequence {
public struct Iterator: IteratorProtocol {
public typealias Element = ValidGraphemeClusterAlignedBufferView
}
public func makeIterator() -> Iterator
}
public var graphemeClusterAligned: GraphemeClusterAlignedBufferSource { get }
}
/// ... similarly for async
public struct ValidAsyncBufferSource<
Upstream: AsyncSequence
>: AsyncSequence where Upstream.Element == ValidBufferView {
public typealias Element = Slice<ValidBufferView>
public struct AsyncIterator: AsyncIteratorProtocol {
public typealias Element = Slice<ValidBufferView>
public mutating func next() async throws -> Element?
}
public func makeAsyncIterator() -> AsyncIterator
// ... similarly, aligned async sources ...
}
}
Aligning data along these boundaries can be useful for implementing data structures that retain their own copy of the storage. Such a data structure may want to guarantee that it can vend a given view's Element
by inspecting only a single chunk.
Alternative: Type-parameterize based on alignment instead, or even dynamic-value-parameterize based on alignment.
The above API, which provides alignment with normalization segments and grapheme clusters, can also provide a view of the bytes which comprise an individual normalization segment or grapheme cluster:
extension Unicode.UTF8.ValidBufferSource.NormalizationSegmentAlignedBufferSource {
/// A sequence of each individual normalization segment
///
/// TO INVESTIGATE: Unfortunately, the process of normalization can produce
/// sub-segments, so need to carefully design a contract here.
public struct NormalizationSegmentView: Sequence {
public struct Iterator: IteratorProtocol {
public typealias Element = ValidNormalizationSegmentAlignedBufferView
public mutating func next() -> Element?
}
public func makeIterator() -> Iterator
}
}
extension Unicode.UTF8.ValidBufferSource.GraphemeClusterAlignedBufferSource {
/// A sequence of each individual grapheme cluster
public struct GraphemeClusterView: Sequence {
public struct Iterator: IteratorProtocol {
public typealias Element = ValidGraphemeClusterAlignedBufferView
public mutating func next() -> Element?
}
public func makeIterator() -> Iterator
}
}
Normalization segments are particularly tricky to account for, as the normalization process could turn a single segment into multiple ones.
Alternative: Pending BufferView
's final design with respect to self-slicing, a view of the bytes comprising a single normalization-segment or grapheme cluster might be represented using a Slice
.
String's decoding initializers are difficult to discover and use as they make use of metatypes: String(decoding: myBytes, as: UTF8.self)
. Attempts to rectify this have been saddled with compatibility concerns. This may be a good opportunity to make some progress on this. Alternatively, this is severable should it start to bog down the rest of this pitch.
The below String inits are straw-person named and intentionally presented in a naming-vacuum, that is without consideration for existing String API names. This helps us work on enumerating the functionality and presenting the entire API picture without simultaneously juggling some of the current issues in API names.
For example, SE-0405 String Initializers with Encoding Validation takes a stab at improving the story somewhat with nil
-return inits, but it uses the same validating:
name as the error-throwing inits below. Depending on exactly how this pitch takes shape and when it is ready for review, the below could be considered an amendment to SE-0405 or a straw-person naming-vacuum investigation.
// Strawperson assuming typed throws
extension String {
/// Sequence-version of stdlib's `String.init(decoding: x, as: UTF8.self)`
public init(repairingUTF8: some Sequence<UInt8>)
/// Puts contents in stdlib-normal-form for fast comparison.
/// `Character`s are the same, but scalars and code unit views
/// could show different (i.e. normalized) contents
public init(normalizingUTF8: some Sequence<UInt8>)
/// Checks for errors and throws them: Sequence error version
public init(
validatingUTF8: some Sequence<UInt8>
) throws(UTF8.ByteStreamDecodingError)
/// Checks for errors and throws them: Collection error version
public init(
validatingUTF8: some Collection<UInt8>
) throws(UTF8.CollectionDecodingError)
/// This is a convenience spelling for either repairing or normalizing.
/// We can pick/debate which would be better, there are reasonable
/// arguments for either.
public init(utf8: some Sequence<UInt8>)
}
Similarly, API which is parameterized over the encoding, as well as API over byte streams associated with endianness.
extension String {
// Repairing
public init<Encoding: Unicode.Encoding>(
repairing: some Sequence<Encoding.CodeUnit>,
as sourceEncoding: Encoding.Type
)
public init<Encoding: Unicode.Encoding>(
repairing: some Sequence<UInt8>,
as sourceEncoding: Encoding.Type,
endianness: Endianness
)
// Normalizing
public init<Encoding: Unicode.Encoding>(
normalizing: some Sequence<Encoding.CodeUnit>,
as sourceEncoding: Encoding.Type
)
public init<Encoding: Unicode.Encoding>(
normalizing: some Sequence<UInt8>,
as sourceEncoding: Encoding.Type,
endianness: Endianness
)
// Validating
public init<Encoding: Unicode.Encoding>(
validating: some Sequence<Encoding.CodeUnit>,
as sourceEncoding: Encoding.Type
) throws(UTF8.ByteStreamDecodingError)
public init<Encoding: Unicode.Encoding>(
repairing: some Sequence<UInt8>,
as sourceEncoding: Encoding.Type,
endianness: Endianness
) throws(UTF8.ByteStreamDecodingError) // Or CodeUnitStreamDecodingError?
public init<Encoding: Unicode.Encoding>(
validating: some Collection<Encoding.CodeUnit>,
as sourceEncoding: Encoding.Type
) throws(UTF8.CollectionDecodingError)
public init<Encoding: Unicode.Encoding>(
repairing: some Sequence<UInt8>,
as sourceEncoding: Encoding.Type,
endianness: Endianness
) throws(UTF8.CollectionDecodingError) // Or ByteCollectionDecodingError?
The stdlib has existing protocols, though they can be difficult to conform to, difficult to use, and derived operations can be inefficient. More investigation is needed to see how to improve them or else how to fit new improvements into them.
We could consider adding protocols for encoding errors, decoder structs, etc., seeing if there's a good library-extensibility story here.
Good case studies include CESU-8 which uses UTF-16-style surrogate pairs for non-BMP scalars in a UTF-8-like encoding, resulting in up to 6 bytes per Unicode scalar value. Java's modified UTF-8 further extends CESU-8's approach by using an overlong encoding for NUL
. These are not valid UTF-8 encodings, but they are valid Unicode encodings as surrogates must be paired. They tradeoff some of UTF-8's advantages for compatibility benefits.
WTF-8 allows unpaired surrogates and thus is not a valid Unicode encoding. It could be interesting to consider how to help support this kind of invalid encoding by creating individual code points instead whole Unicode scalar values.
There are also encodings that only encode a subset of Unicode, such as ASCII (which UTF-8 is a binary-compatible superset of) and Latin1 (which UTF-8 is not binary compatible with). Supporting these are tricky as transcoding is lossy and otherwise complete functions become partial functions.
The stdlib currently provides ASCII as a Unicode encoding, however as a subset encoding it has some sharp edges and follows different conventions from the actual Unicode encodings. We should consider sunsetting this encoding in favor of UTF-8, which is a strict superset. The stdlib's implementation should detect and fast-path UTF-8 when the contents happen to be only-ASCII anyways. We can provide optimized isASCII
queries on String
, byte buffers and byte streams, etc.
The Swift Collections package defines a BigString type, which provides String-like functionality over rope-like storage.
Foundation's AttributedString is built on BigString
. Additionally, Foundation parses data formats such as plists which are encoded using UTF-16 in big endian byte order.
Foundation also normalizes paths, on some file systems, to a pre-Unicode-3.0 NFD. Unicode version 3.0 is important since it is only afterwards that normalization properties are stable. Other libraries may need to similarly specify a specific Unicode version and bundle their own data tables to drive normalization and, especially, decomposition. This could be done through a data-table provider protocol, though there may be efficiency concerns with working through such an abstraction. Either way, the normalization-segmentation API are helpful for performing custom decomposition.
Relatedly, a server-client library may wish to ensure that both the server and client are using the same version of Unicode for the purposes of canonical equivalence. While the properties for defined code points are stable, it is possible that an undefined code point could normalize differently in future Unicode versions. An alternative approach would be a quick scan for the presence of undefined code points.
We can look at using some of the byte-stream functionality over AsyncIterator
to define (combiners / operators?) such as AsyncUnicodeScalarSequence
, AsyncCharacterSequence
, etc. These could be good API to have in the stdlib proper or in the Swift Async Algorithms package.
The WebURL package does a hefty about of Unicode processing and is a great example of the kinds of libraries that the stdlib should empower. It can serve as a good target for these improvements and many others.
Libraries such as Swift Syntax sometimes roll out their own decoding and would benefit from a standard approach.
I'm interested in hearing about other libraries and potential use cases.
A future direction is for String
, UTF8.ValidBufferView
, etc., to provide lazily-normalized views of their contents under NFC, NFD, as well as forms provided by libraries.
For the buffer-based API, a future direction could include composing and decomposing API, possibly driven by a library's data tables, along the lines of Foundation's path normalization described above.
A byte-order marker (conventionally using U+FEFF
) is sometimes used when serializing content using an endianness-sensitive encoding.
When used this way, the BOM is not part of the content but part of the serialization format. These decoders will correctly decode a BOM (or, the non-character byte-swapped U+FFFE
if given the wrong endianness), which makes implementing encoding-detection using a potential leading BOM easier.
We should add some additional API to perform this BOM analysis. E.g., returning:
enum BOMAnalysisResult {
case utf16BE
case utf16LE
case utf32BE
case utf32LE
}
However, BOM analysis should not be conflated with the decoding API. A BOM is actually just a conventional treatment of the Unicode scalar U+FEFF ZERO WIDTH NO-BREAK SPACE
which can be a perfectly valid member of textual content or it can be an encoding signifier that's not part of the textual content.
Also, BOM analysis would require reading from the input, so any API would need to consume enough bytes to detect one. If there was no BOM, a byte-stream API might need to yield that input back in some fashion
An often desired feature is to have String
or Substring
API available on storage that's owned by another object, e.g. shared substrings or using String
's ABI support.
One direction could be closure-based API on Substring
and UTF8.ValidBufferView
:
func withEphemeralString(_: (@nonescaping String) -> throws T) rethrows -> T
The non-escaping string parameter would share storage with the Substring
or UTF8.ValidBuffer
or slice type. It is marked non-escaping on Substring
because it could be referencing a small portion of a much larger allocation, and to escape the string would keep that larger allocation alive. For UTF8.ValidBuffer
, it is non-escaping because it could be borrowing the content from the lexical scope of the caller.
Another spelling could be a non-escaping computed var
on UTF8.ValidBufferView
or Substring
, pending further investigation into statically-enforced lifetimes and lifetime inference.
This could be useful when an API strictly needs a (non-escaping) String
and performing a copy of the slice is undesirable. However, any function that string is passed into would either also need to take a non-escaping string or else make a copy (which would defeat the original purpose).
UTF8.ValidBufferView
could also have String
's API on it, at least in some fashion. Future work could include Regex
support, etc.