Pitch: Unicode Processing APIs

This is a pitch for some low-level Unicode operations to enable libraries to implement their own String-like functionality and types. The purpose of this API is to present low-level core components out of which libraries can make types and higher-level API (i.e. tools for tool-makers).

I'm interested in hearing about capabilities that libraries may need and getting design feedback.

Introduction

String performs many Unicode operations using a mix of internal and publicly available functionality in the stdlib. The stdlib provides some limited general-purpose Unicode API, though this is a very dusty corner of the stdlib. We want to enable libraries to vend their own String-like types and functionality.

One interesting use case to consider are ephemeral byte strings backed by chunks of data in contiguous memory. These chunks of data could be synchronous (i.e. Sequence/Iterator) or asynchronous (i.e. AsyncSequence/AsyncIterator). Their buffers are ephemeral, meaning there is no general way to fully reset the stream back to an earlier state (i.e. there is no Index). The buffers might not be segmented along code unit, scalar, or grapheme cluster boundaries, meaning that a produced value might span segments.

We're aiming for a solution that provides:

Simple, composable pieces to build up abstraction hierarchies
Efficient, safe buffer-based interfaces to drill down through abstraction hierarchies
A strategy for library-extensibility as well as compatibility with existing stdlib API and concepts

Note: Many of the approaches proposed are dependent on recently developed features such as typed throws and non-escapable values, whose ABI impacts may not be fully fleshed out. This may motivate splitting functionality across multiple proposals and releases

Decoding and `Character` API

Errors

Validation API and producing errors as part of decoding is an often-requested feature. Below are errors related to Unicode encodings:

extension Unicode.UTF8 {
  public enum DecodingError: Error {
    case expectedStarter
    case expectedContinuation
    case overlongEncoding
    case invalidCodePoint
    case invalidStarterByte
  }
}
extension Unicode.UTF16 {
  public enum DecodingError: Error {
    case expectedTrailingSurrogate
    case unexpectedTrailingSurrogate
  }
}
extension Unicode.UTF32 {
  public enum DecodingError: Error {
    case invalidCodePoint
  }
}

Alternative: a single error enum, noting that some error cases are irrelevant in some encodings

Alternative / Investigation: making the error type an associated type on a protocol for library-provided encodings to customize

When it comes to validating bytes, the byte source might be ephemeral (e.g. a single-pass Sequence) or it might be possible to re-visit contents by position (e.g. a Collection).

extension Unicode.UTFX { // `UTFX` meaning UTF8, UTF16, and UTF32
  public struct CollectionDecodingError<Index: Comparable>: Error {
    public var kind: Unicode.UTFX.DecodingError
    public var range: Range<Index>
  }

  public struct ByteStreamDecodingError: Error {
    public var kind: Unicode.UTFX.DecodingError
    public var bytes: (UInt8, UInt8?, UInt8?)
  }
}

Knowing the kind of encoding error and bytes involved can be very helpful. Overlong encodings are often an intentional attempt to compromise security. Some systems may want to use custom error-correction (i.e. 128 distinct replacement characters) such that the corrected bytes are valid Unicode while also preserving the original bits. Additionally, knowing the encoding error can be helpful for debugging.

Validation API and concerns are further discussed later in this document.

Endianness

Endianness (byte-ordering or memory ordering), denotes whether the first byte received contains the high bits or low bits of a code unit. It is relevant for multi-byte encodings (UTF-16 and UTF-32, but not UTF-8).

public enum Endianness {
  case little
  case big

  /// The platform's native byte-ordering in memory 
  public static var native: Self
}

Alternative: Endianness could be considered in combination with the encoding, yielding e.g. UTF16BE and UTF16LE encodings. Or, it could be considered a property of a serialization format.

Alternative: An alternate name for Endianness could be ByteOrder.

Decoding

The stdlib has some existing functionality along these lines, but it's inadequate for byte stream validation and decoding. Instance methods UnicodeCodec.decode and parseScalar do not produce meaningful error information and they operate only in terms of fully-formed code units and complete scalars.

We propose stateful ByteStreamDecoder structs. These are statically associated with an encoding and, at initialization time, a byte order. They store a small internal buffer of bytes until enough bytes have been seen to produce a Unicode scalar. They can be fed data in a byte at a time, which allows developers to feed data in as it is received and not worry about handling scalar alignment.

While all names in this rough draft are strawperson names, the method name consume below is a particularly strawy name. It is fairly unpalatable and meant to be a placeholder. Alternate names such as read, receive, input (as a present-tense verb), streamIn, next, feed, feedIn, etc., are not much better. The name decode doesn't carry the implication of an in-progress or suspended operation.

public protocol UnicodeByteStreamDecoder {
  /// Input a byte, returns a finished scalar or `nil`.
  /// Throws a decoding error.
  mutating func consume(
    _ byte: UInt8
  ) throws -> Unicode.Scalar?

  /// We've reached the end of input. If there's an unfinished
  /// scalar in progress, throws the appropriate encoding error
  func finalize() throws

  // Customization points:

  /// Read bytes until yielding a decoded scalar.
  ///
  /// Throws validation errors.
  ///
  /// Returns `nil` when `bytes` is done. `bytes` may not
  /// have finished a scalar and `self` may contain some
  /// bytes of an in-progress scalar value.
  public mutating func consume(
    _ bytes: inout some IteratorProtocol<UInt8>
  ) throws -> Unicode.Scalar?

  /// Read bytes asynchronously until yielding a decoded scalar.
  ///
  /// Throws validation errors and rethrows upstream errors.
  ///
  /// Returns `nil` when `bytes` is done. `bytes` may not
  /// have finished a scalar and `self` may contain some
  /// bytes of an in-progress scalar value.
  public mutating func consume<AI: AsyncIteratorProtocol>(
    _ bytes: inout AI
  ) async throws -> Unicode.Scalar?
  where AI.Element == UInt8

  /// Read bytes starting from `position` and yielding a decoded scalar
  /// and the position of the start of the next scalar.
  ///
  /// Throws validation errors.
  ///
  /// Returns `nil` when `bytes` is done. `bytes` may not
  /// have finished a scalar and `self` may contain some
  /// bytes of an in-progress scalar value.
  ///
  /// INVESTIGATE: take `position` inout so that it gets updated rather 
  ///   than requiring the caller to update a local `var`.
  ///
  /// INVESTIGATE: Alternative: take a slice `inout`, but
  ///   we'd want to make sure it makes sense for non-copyable
  ///   slices
  public mutating func consume<C: Collection<UInt8>>(
    _ bytes: C,
    startingFrom position: C.Index
  ) throws -> (Unicode.Scalar, scalarEnd: C.Index)?  


  /// INVESTIGATE: API to access the internal buffer, such as
  ///   whether it is empty, it's contents, clearing it, etc
}

extension Unicode.UTF8 {
  public struct ByteStreamDecoder: UnicodeByteStreamDecoder {
    public init()

    /// Input a single
    public mutating func consume(
      _ byte: UInt8
    ) throws -> Unicode.Scalar?

    public func finalize() throws
  }
}

extension Unicode.UTF16 {
  public struct ByteStreamDecoder: UnicodeByteStreamDecoder {
    public init(byteOrder: Endianness)

    public mutating func consume(
      _ byte: UInt8
    ) throws -> Unicode.Scalar?

    public func finalize() throws
  }
}

extension Unicode.UTF32 {
  public struct ByteStreamDecoder: UnicodeByteStreamDecoder {
    public init(byteOrder: Endianness)

    public mutating func consume(
      _ x: UInt8
    ) throws -> Unicode.Scalar?

    public func finalize() throws
  }
}

// Default implementations
extension UnicodeByteStreamDecoder {
  /// Read bytes until yielding a decoded scalar.
  ///
  /// Throws validation errors.
  ///
  /// Returns `nil` when `bytes` is done. `bytes` may not
  /// have finished a scalar and `self` may contain some
  /// bytes of an in-progress scalar value.
  public mutating func consume(
    _ bytes: inout some IteratorProtocol<UInt8>
  ) throws -> Unicode.Scalar?

  /// Read bytes asynchronously until yielding a decoded scalar.
  ///
  /// Throws validation errors and rethrows upstream errors.
  ///
  /// Returns `nil` when `bytes` is done. `bytes` may not
  /// have finished a scalar and `self` may contain some
  /// bytes of an in-progress scalar value.
  public mutating func consume<AI: AsyncIteratorProtocol>(
    _ bytes: inout AI
  ) async throws -> Unicode.Scalar?
  where AI.Element == UInt8

  /// Read bytes starting from `position` and yielding a decoded scalar
  /// and the position of the start of the next scalar.
  ///
  /// Throws validation errors.
  ///
  /// Returns `nil` when `bytes` is done. `bytes` may not
  /// have finished a scalar and `self` may contain some
  /// bytes of an in-progress scalar value.
  ///
  /// INVESTIGATE: take `position` inout so that it gets updated rather 
  ///   than requiring the caller to update a local `var`.
  ///
  /// INVESTIGATE: Alternative: take a slice `inout`, but
  ///   we'd want to make sure it makes sense for non-copyable
  ///   slices
  public mutating func consume<C: Collection<UInt8>>(
    _ bytes: C,
    startingFrom position: C.Index
  ) throws -> (Unicode.Scalar, scalarEnd: C.Index)?

}

A decoder with undecoded contents left in its buffer could be an indication of programmer error. The finalize method checks for this. We could also consider exposing some properties or access to the buffer itself.

Alternative: we could attempt to add new byte-stream overloads for decode() that throw errors and have the ability to resume with a different input source. For the purposes of this pitch, a separate type and namespace lets us explore the different semantics pitched.

Alternative: We could consider options to repair invalid input, possibly by also communicating what errors would have been reported.

Rejected Alternative: consume receives endianness. This has the downside of a more complex API contract and more branching for an uncommon use case of interleaving mixed-endianness byte streams.

Alternative: Statically separate out endianness, i.e. have a UTF16BE encoding or alternatively a UTF16BEByteStreamDecoder. It's not clear this would be an API improvement, and it still needs some benchmarking to demonstrate whether there's a meaningful performance difference.

Alternative: A single dynamically-parameterized decoder type. Such a type could receive the encoding at init-time and have a suitably large internal buffer for any anticipated encoding. This would result in many more run-time branches, however.

Alternative: A single type-parameterized decoder type. This would reduce the amount of API, though in effect we'd want to specialize for each of the presented encodings anyways. This may end up being in effect a different spelling of what is pitched here.

Typed throws

It likely makes sense to specify errors using typed throws. This may also motivate including decoding error types in the protocol.

Typed throws would be further motivated by the fact that some highly constrained or embedded systems may not have String available. This could be due to the inability to dynamically allocate memory or not having enough space to bundle the data necessary to implement String's semantics such as grapheme breaking and canonical equivalence data tables. Such environments could benefit from good UTF-8 decoding API and typed errors would make this API available.

Grapheme-breaking API

The Swift Collections package defines a BigString type, which provides String-like functionality over rope-like storage. It uses underscored functionality in the stdlib to detect grapheme cluster boundaries.

Grapheme breaking requires looking ahead at the next scalar and keeping a few bits of state along the way. We could surface the underscored interfaces as API:

extension Unicode {
  public struct GraphemeBreaker {
    public init()

    /// Returns whether there was a grapheme break _before_
    /// `scalar`. Updates internal state and stores `scalar` for
    /// the next call.
    public mutating func consume(
      _ scalar: Unicode.Scalar
    ) -> Bool
  }
}

To build a Character-producing stream out of this, the caller either has to buffer scalars themselves or do some bookkeeping to track positions of the scalars fed in.

Character streams with buffering

The following is an example use of the GraphemeBreaker, and could be additional API or an alternate API for consideration.

public struct GraphemeFormer {
  var breaker: GraphemeBreaker = .init()

  /// Use String.UnicodeScalarView as our scalar buffer.
  ///
  /// It has a small-form for 15 UTF-8 code units (super common)
  /// but can also be dynamically resized as needed
  var charInProgress = ""

  public init() {}

  /// Consumes `scalar`. Returns a completed `Character` if
  /// `scalar` was the start of the next grapheme cluster.
  public mutating func consume(
    _ scalar: Unicode.Scalar
  ) -> Character? {
    guard breaker.consume(scalar) else {
      charInProgress.unicodeScalars.append(scalar)
      return nil
    }

    let character = charInProgress.first
    charInProgress.removeAll()
    charInProgress.unicodeScalars.append(scalar)
    return character
  }

  /// Finishes and returns the in-progress Character
  public mutating func flush() -> Character? {
    let character = charInProgress.first
    charInProgress.removeAll()
    breaker = .init()
    return character
  }
}

Like the decoders, this uses an internal buffer: a String's UnicodeScalarView (which starts off using a small-form).

Example: Scalars and Characters from `FileDescriptor`

On Apple platforms, Foundation's FileHandle can asynchronously vend bytes, Unicode scalars, and Characters.

For an example use of the pitched API, let's implement that on System's FileDescriptor.

import System
extension FileDescriptor {
  var bytes: AsyncBytes { ... }

  struct AsyncBytes: AsyncSequence {
    typealias Element = UInt8
    ...
  }
}

extension FileDescriptor.AsyncBytes {
  struct AsyncUnicodeScalarSequence: AsyncSequence {
    var bytes: FileDescriptor.AsyncBytes
    typealias Element = Unicode.Scalar

    func makeAsyncIterator() -> AsyncIterator {
      .init(bytes: bytes.makeAsyncIterator())
    }

    struct AsyncIterator: AsyncIteratorProtocol {
      typealias Element = Unicode.Scalar
      var bytes: FileDescriptor.AsyncBytes.AsyncIterator
      var decoder = Unicode.UTF8.ByteStreamDecoder()

      mutating func next() async throws -> Unicode.Scalar? {
        while let byte = try await bytes.next() {
          if let scalar = try decoder.consume(byte) {
            return scalar
          }
        }
        // We're choosing to report un-finished scalar
        // as an error
        try decoder.finalize()
        return nil
      }
    }
  }
  var unicodeScalars: AsyncUnicodeScalarSequence { .init(bytes: self) }

  struct AsyncCharacterSequence: AsyncSequence {
    var bytes: FileDescriptor.AsyncBytes

    typealias Element = Character

    func makeAsyncIterator() -> AsyncIterator {
      .init(scalars: bytes.unicodeScalars.makeAsyncIterator())
    }

    struct AsyncIterator: AsyncIteratorProtocol {
      var scalars: FileDescriptor.AsyncBytes.AsyncUnicodeScalarSequence.AsyncIterator
      var breaker = Unicode.GraphemeFormer()

      mutating func next() async throws -> Character? {
        while let scalar = try await scalars.next() {
          if let c = breaker.consume(scalar) {
            return c
          }
        }
        return breaker.flush()
      }
    }
  }
  var characters: AsyncCharacterSequence { .init(bytes: self) }
}

The given code shows a simple use of the pitched API. However, it also shows the need for an efficient approach that can drill through abstraction layers to underlying bytes in buffers.

When the source of Unicode scalars is backed by a chunk of memory containing validly-encoded UTF-8, it is more efficient to work in terms of positions in that chunk.

Alternative naming: Use `FooCharacterBar` instead of `FooGraphemeBar` or `FooGraphemeClusterBar` throughout this proposal

There's a naming spectrum between Unicode's preferred terminology and Swift's.

One one end, if this were a standalone package aimed solely around providing an implementation of the Unicode standard via accelerated routines, there is an affordance to stick solely to Unicode terminology. Such a package could use the term "grapheme cluster" or "extended grapheme cluster" because "character" is not Unicode's terminology and could be ambiguous or confusing.

On the other end, anything in the String namespace or which vends a String type (including Character) would use use the term "character". Similarly anything outside of a dedicated Unicode package, library, namespace, module, or sub-module would as well.

As for what we are providing, we don't have a separate Unicode module or sub-module, largely for historical reasons. Unicode is an empty enum that functions as a namespace. Some precedent so far has found it better to lean towards Unicode terminology for items under Unicode. For example, Unicode.Scalar.Property.isGraphemeBase is the better name than Unicode.Scalar.Property.isCharacterBase.

Contiguous buffers and segmentation API

Many byte streams are backed by chunks of contiguous memory. Rather than read byte-at-a-time and return scalar-at-a-time using internal buffering, a byte stream decoder could communicate scalar-aligned positions in its upstream's backing buffers.

We explore functionality that reads from any byte source and vends chunks of validly-encoded, and validly-aligned (for a specified alignment) UTF-8. This enables efficient streaming operations, i.e. those that operate over a properly-aligned moving window of validly encoded UTF-8 bytes in contiguous memory. This involves areas of current investigation and could motivate and incorporate the more advanced lifetime management discussed.

Another important consideration is how to handle when a scalar, normalization, or grapheme cluster segment straddles multiple chunks of data. In that case, API may need to return a view into a new buffer which stores these contiguously.

Validation API

Validation looks at an entire input to ensure it is validly encoded. While it can be performed by decoding and discarding the contents, it can be done more efficiently as its own standalone operation if the original contents are meant to be kept in their original encoding.

extension Unicode.UTF8 {
  public static func validate<C: Collection<UInt8>>(
    _ bytes: C
  ) throws
}

// Available on UTF16 and UTF32, where endianness matters and
// where code units are not individual bytes
extension Unicode.UTF[16/32] {
  public static func validate<C: Collection<CodeUnit>>(
    _ codeUnits: C
  ) throws

  public static func validate<C: Collection<UInt8>>(
    _ bytes: C,
    endianness: Endianness
  ) throws
}

Alternative: Concrete functions taking a BufferView or some BufferViewable-like protocol

UTF-8 validity and efficiency

UTF-8 validation is particularly common concern and the subject of a fair amount of research. Once an input is known to be validly encoded UTF-8, subsequent operations such as decoding, grapheme breaking, comparison, etc., can be implemented much more efficiently under this assumption of validity. Swift's String type's native storage is guaranteed-valid-UTF8 for this reason.

However, if the input isn't actually valid, assuming validity leads to a new class of security concerns.

Memory safety is more nuanced. An ill-formed leading byte can dictate a scalar length that is longer than the memory buffer. The buffer may have bounds associated with it, which differs from the bounds dictated by its contents.

Additionally, a particular scalar value in valid UTF-8 has only one encoding, but invalid UTF-8 could have the same value encoded as an overlong encoding, which would compromise any code that checks for the presence of a scalar value by looking at the encoded bytes.

One approach is to define API that takes a parameter assuming that its contents contain correctly-encoded UTF-8. Today, that is often done via a unsafeAssumingValidUTF8: UnsafeRawBufferPointer parameter, but this is unsafe in multiple ways that might not be clear to the caller. The UnsafeRawBufferPointer is memory-unsafe of course, but even if the caller knows the memory itself is safe, the contents might be invalidly encoded in a way that subtly bypasses correct behavior elsewhere in the program.

A type such as BufferView would help mitigate the memory unsafety of the pointer itself, but not the far more subtle problems of assuming valid UTF-8.

The rest of this pitch is interwoven with on-going investigations into non-escapable values and statically-reasoned lifetimes. As such, it could change depending on when or how that support arrives. This is similar to how the new Atomics API was originally implemented with unsafe constructs before ~Copyable support was available.

Valid UTF8 buffer views

UTF8.ValidBufferView is a buffer view whose contents are known to be valid UTF-8 as represented in the type system.

extension Unicode.UTF8 {
  public struct ValidBufferView {
    /// TO INVESTIGATE: This field's lifetime is tied to `self`, i.e. the lifetime
    /// of either `owner` or the lexical scope into which it was returned. Any `get` 
    /// accessors should be non-escapable
    public var bytes: BufferView

    /// An object that owns the memory, if the API needed to allocate memory.
    ///
    /// This is needed when validation or alignment needs to allocate to ensure
    /// the relevant content is is contiguous memory
    public var owner: AnyObject?

    /// Create from the validated contents of `c`. If `c` contains invalidly encoded
    /// UTF-8, throws an error. If `c` is valid and and provide a `BufferView`, will
    /// borrow that view. If `c` is valid but does not provide a `BufferView`, will
    /// allocate memory to provide a contiguous view.
    public init(validating c: some Collection<UInt8>) throws

    /// As `validating:`, but repairs any encoding errors. If a repair was made, a
    /// new allocation must be made for the corrected content.
    public init(repairing c: some Collection<UInt8>)
  }
}

Alignments

There are 3 particularly useful alignments to segment content such that common operations can be performed by only looking at one chunk of data at a time.

Scalar aligned: decoding and validation
Normalization-segment aligned: canonical comparison
Grapheme-cluster aligned: forming Characters

Each successive segmentation is broader than the one before: every grapheme-cluster boundary is a normalization-segment boundary and every normalization-segment boundary is a scalar boundary.

Note: Normalization segments being sub-segments of grapheme-clusters is not technically guaranteed by the Unicode standard to always be true in future Unicode versions. Unicode is allowed to change the rules of grapheme breaking in future versions. That being said, a normalization segment is defined to start on a non-combining scalar, and grapheme clusters that break before non-combining scalars are nonsensical. Unicode handles nonsensical cases as degenerate cases and those cases do not break, though Unicode could change its mind in future versions. Because of this, the "sub-alignment" relationship between normalization segments and grapheme clusters should treated as illustrative for the reader and not a formal API guarantee into the future.

extension Unicode.UTF8 {
  public struct ValidScalarAlignedBufferView {
    public var buffer: ValidBufferView
  }

  public struct ValidNormalizationSegmentAlignedBufferView {
    public var buffer: ValidScalarAlignedBufferView
  }

  public struct ValidGraphemeClusterAlignedBufferView {
    public var buffer: ValidScalarAlignedBufferView
  }
}

extension Unicode.UTF8 {
  /// Transforms a sequence of buffer views to a sequence of valid buffer views
  public struct ValidBufferSource<
    Upstream: Sequence<ValidBufferView>
  >: Sequence {
    public struct Iterator: IteratorProtocol {
      public typealias Element = Slice<ValidBufferView>
      public mutating func next() -> Element?
    }
    public func makeIterator() -> Iterator

    public struct ScalarAlignedBufferSource: Sequence {
      public struct Iterator: IteratorProtocol {
        public typealias Element = ValidScalarAlignedBufferView
      }
      public func makeIterator() -> Iterator
    }
    public var scalarAligned: ScalarAlignedBufferSource { get }

    public struct NormalizationSegmentAlignedBufferSource: Sequence {
      public struct Iterator: IteratorProtocol {
        public typealias Element = ValidNormalizationSegmentAlignedBufferView
      }
      public func makeIterator() -> Iterator
    }
    public var normalizationSegmentAligned: NormalizationSegmentAlignedBufferSource { get }

    public struct GraphemeClusterAlignedBufferSource: Sequence {
      public struct Iterator: IteratorProtocol {
        public typealias Element = ValidGraphemeClusterAlignedBufferView
      }
      public func makeIterator() -> Iterator
    }
    public var graphemeClusterAligned: GraphemeClusterAlignedBufferSource { get }
  }

  /// ... similarly for async
  public struct ValidAsyncBufferSource<
    Upstream: AsyncSequence
  >: AsyncSequence where Upstream.Element == ValidBufferView {
    public typealias Element = Slice<ValidBufferView>
    public struct AsyncIterator: AsyncIteratorProtocol {
      public typealias Element = Slice<ValidBufferView>
      public mutating func next() async throws -> Element?
    }
    public func makeAsyncIterator() -> AsyncIterator

    // ... similarly, aligned async sources ...
  }
}

Aligning data along these boundaries can be useful for implementing data structures that retain their own copy of the storage. Such a data structure may want to guarantee that it can vend a given view's Element by inspecting only a single chunk.

Alternative: Type-parameterize based on alignment instead, or even dynamic-value-parameterize based on alignment.

Accessing ranges

The above API, which provides alignment with normalization segments and grapheme clusters, can also provide a view of the bytes which comprise an individual normalization segment or grapheme cluster:

extension Unicode.UTF8.ValidBufferSource.NormalizationSegmentAlignedBufferSource {
  /// A sequence of each individual normalization segment
  ///
  /// TO INVESTIGATE: Unfortunately, the process of normalization can produce
  /// sub-segments, so need to carefully design a contract here.
  public struct NormalizationSegmentView: Sequence {
    public struct Iterator: IteratorProtocol {
      public typealias Element = ValidNormalizationSegmentAlignedBufferView
      public mutating func next() -> Element?
    }
    public func makeIterator() -> Iterator
  }
}

extension Unicode.UTF8.ValidBufferSource.GraphemeClusterAlignedBufferSource {
  /// A sequence of each individual grapheme cluster
  public struct GraphemeClusterView: Sequence {
    public struct Iterator: IteratorProtocol {
      public typealias Element = ValidGraphemeClusterAlignedBufferView
      public mutating func next() -> Element?
    }
    public func makeIterator() -> Iterator
  }
}

Normalization segments are particularly tricky to account for, as the normalization process could turn a single segment into multiple ones.

Alternative: Pending BufferView's final design with respect to self-slicing, a view of the bytes comprising a single normalization-segment or grapheme cluster might be represented using a Slice.

Creating Strings

String's decoding initializers are difficult to discover and use as they make use of metatypes: String(decoding: myBytes, as: UTF8.self). Attempts to rectify this have been saddled with compatibility concerns. This may be a good opportunity to make some progress on this. Alternatively, this is severable should it start to bog down the rest of this pitch.

The below String inits are straw-person named and intentionally presented in a naming-vacuum, that is without consideration for existing String API names. This helps us work on enumerating the functionality and presenting the entire API picture without simultaneously juggling some of the current issues in API names.

For example, SE-0405 String Initializers with Encoding Validation takes a stab at improving the story somewhat with nil-return inits, but it uses the same validating: name as the error-throwing inits below. Depending on exactly how this pitch takes shape and when it is ready for review, the below could be considered an amendment to SE-0405 or a straw-person naming-vacuum investigation.

// Strawperson assuming typed throws
extension String {
  /// Sequence-version of stdlib's `String.init(decoding: x, as: UTF8.self)`
  public init(repairingUTF8: some Sequence<UInt8>)

  /// Puts contents in stdlib-normal-form for fast comparison.
  /// `Character`s are the same, but scalars and code unit views
  /// could show different (i.e. normalized) contents
  public init(normalizingUTF8: some Sequence<UInt8>)

  /// Checks for errors and throws them: Sequence error version
  public init(
    validatingUTF8: some Sequence<UInt8>
  ) throws(UTF8.ByteStreamDecodingError)

  /// Checks for errors and throws them: Collection error version
  public init(
    validatingUTF8: some Collection<UInt8>
  ) throws(UTF8.CollectionDecodingError)

  /// This is a convenience spelling for either repairing or normalizing.
  /// We can pick/debate which would be better, there are reasonable
  /// arguments for either.
  public init(utf8: some Sequence<UInt8>)
}

Similarly, API which is parameterized over the encoding, as well as API over byte streams associated with endianness.

extension String {
  // Repairing
  public init<Encoding: Unicode.Encoding>(
    repairing: some Sequence<Encoding.CodeUnit>, 
    as sourceEncoding: Encoding.Type
  )

  public init<Encoding: Unicode.Encoding>(
    repairing: some Sequence<UInt8>, 
    as sourceEncoding: Encoding.Type,
    endianness: Endianness
  )

  // Normalizing
  public init<Encoding: Unicode.Encoding>(
    normalizing: some Sequence<Encoding.CodeUnit>, 
    as sourceEncoding: Encoding.Type
  )

  public init<Encoding: Unicode.Encoding>(
    normalizing: some Sequence<UInt8>, 
    as sourceEncoding: Encoding.Type,
    endianness: Endianness
  )

  // Validating
  public init<Encoding: Unicode.Encoding>(
    validating: some Sequence<Encoding.CodeUnit>, 
    as sourceEncoding: Encoding.Type
  ) throws(UTF8.ByteStreamDecodingError)

  public init<Encoding: Unicode.Encoding>(
    repairing: some Sequence<UInt8>, 
    as sourceEncoding: Encoding.Type,
    endianness: Endianness
  ) throws(UTF8.ByteStreamDecodingError) // Or CodeUnitStreamDecodingError?

  public init<Encoding: Unicode.Encoding>(
    validating: some Collection<Encoding.CodeUnit>, 
    as sourceEncoding: Encoding.Type
  ) throws(UTF8.CollectionDecodingError)

  public init<Encoding: Unicode.Encoding>(
    repairing: some Sequence<UInt8>, 
    as sourceEncoding: Encoding.Type,
    endianness: Endianness
  ) throws(UTF8.CollectionDecodingError) // Or ByteCollectionDecodingError?

Library extensibility and use cases

Encodings and protocols

The stdlib has existing protocols, though they can be difficult to conform to, difficult to use, and derived operations can be inefficient. More investigation is needed to see how to improve them or else how to fit new improvements into them.

We could consider adding protocols for encoding errors, decoder structs, etc., seeing if there's a good library-extensibility story here.

Good case studies include CESU-8 which uses UTF-16-style surrogate pairs for non-BMP scalars in a UTF-8-like encoding, resulting in up to 6 bytes per Unicode scalar value. Java's modified UTF-8 further extends CESU-8's approach by using an overlong encoding for NUL. These are not valid UTF-8 encodings, but they are valid Unicode encodings as surrogates must be paired. They tradeoff some of UTF-8's advantages for compatibility benefits.

WTF-8 allows unpaired surrogates and thus is not a valid Unicode encoding. It could be interesting to consider how to help support this kind of invalid encoding by creating individual code points instead whole Unicode scalar values.

There are also encodings that only encode a subset of Unicode, such as ASCII (which UTF-8 is a binary-compatible superset of) and Latin1 (which UTF-8 is not binary compatible with). Supporting these are tricky as transcoding is lossy and otherwise complete functions become partial functions.

The stdlib currently provides ASCII as a Unicode encoding, however as a subset encoding it has some sharp edges and follows different conventions from the actual Unicode encodings. We should consider sunsetting this encoding in favor of UTF-8, which is a strict superset. The stdlib's implementation should detect and fast-path UTF-8 when the contents happen to be only-ASCII anyways. We can provide optimized isASCII queries on String, byte buffers and byte streams, etc.

Libraries

The Swift Collections package defines a BigString type, which provides String-like functionality over rope-like storage.

Foundation's AttributedString is built on BigString. Additionally, Foundation parses data formats such as plists which are encoded using UTF-16 in big endian byte order.

Foundation also normalizes paths, on some file systems, to a pre-Unicode-3.0 NFD. Unicode version 3.0 is important since it is only afterwards that normalization properties are stable. Other libraries may need to similarly specify a specific Unicode version and bundle their own data tables to drive normalization and, especially, decomposition. This could be done through a data-table provider protocol, though there may be efficiency concerns with working through such an abstraction. Either way, the normalization-segmentation API are helpful for performing custom decomposition.

Relatedly, a server-client library may wish to ensure that both the server and client are using the same version of Unicode for the purposes of canonical equivalence. While the properties for defined code points are stable, it is possible that an undefined code point could normalize differently in future Unicode versions. An alternative approach would be a quick scan for the presence of undefined code points.

We can look at using some of the byte-stream functionality over AsyncIterator to define (combiners / operators?) such as AsyncUnicodeScalarSequence, AsyncCharacterSequence, etc. These could be good API to have in the stdlib proper or in the Swift Async Algorithms package.

The WebURL package does a hefty about of Unicode processing and is a great example of the kinds of libraries that the stdlib should empower. It can serve as a good target for these improvements and many others.

Libraries such as Swift Syntax sometimes roll out their own decoding and would benefit from a standard approach.

I'm interested in hearing about other libraries and potential use cases.

Future directions

Normalization

A future direction is for String, UTF8.ValidBufferView, etc., to provide lazily-normalized views of their contents under NFC, NFD, as well as forms provided by libraries.

For the buffer-based API, a future direction could include composing and decomposing API, possibly driven by a library's data tables, along the lines of Foundation's path normalization described above.

BOM

A byte-order marker (conventionally using U+FEFF) is sometimes used when serializing content using an endianness-sensitive encoding.

When used this way, the BOM is not part of the content but part of the serialization format. These decoders will correctly decode a BOM (or, the non-character byte-swapped U+FFFE if given the wrong endianness), which makes implementing encoding-detection using a potential leading BOM easier.

We should add some additional API to perform this BOM analysis. E.g., returning:

enum BOMAnalysisResult {
  case utf16BE
  case utf16LE
  case utf32BE
  case utf32LE
}

However, BOM analysis should not be conflated with the decoding API. A BOM is actually just a conventional treatment of the Unicode scalar U+FEFF ZERO WIDTH NO-BREAK SPACE which can be a perfectly valid member of textual content or it can be an encoding signifier that's not part of the textual content.

Also, BOM analysis would require reading from the input, so any API would need to consume enough bytes to detect one. If there was no BOM, a byte-stream API might need to yield that input back in some fashion

Shared and ephemeral strings

An often desired feature is to have String or Substring API available on storage that's owned by another object, e.g. shared substrings or using String's ABI support.

One direction could be closure-based API on Substring and UTF8.ValidBufferView:

func withEphemeralString(_: (@nonescaping String) -> throws T) rethrows -> T

The non-escaping string parameter would share storage with the Substring or UTF8.ValidBuffer or slice type. It is marked non-escaping on Substring because it could be referencing a small portion of a much larger allocation, and to escape the string would keep that larger allocation alive. For UTF8.ValidBuffer, it is non-escaping because it could be borrowing the content from the lexical scope of the caller.

Another spelling could be a non-escaping computed var on UTF8.ValidBufferView or Substring, pending further investigation into statically-enforced lifetimes and lifetime inference.

This could be useful when an API strictly needs a (non-escaping) String and performing a copy of the slice is undesirable. However, any function that string is passed into would either also need to take a non-escaping string or else make a copy (which would defeat the original purpose).

String API on validated UTF-8 bytes

UTF8.ValidBufferView could also have String's API on it, at least in some fashion. Future work could include Regex support, etc.

milseman/unicode_processing.md