The Enduring Legacy of a Very Naughty Hack from 1995 / A Bit of Intl
This is my first attempt at a tech talk thing. The focus here is strings in ES. First we’ll talk about what a string is in ES, then some of the consequences that fall out of that (a few aren’t super obvious), and finally some of the cool tools available for working with strings safely & effectively. Because of the nature of the subject we’ll also get into some more general information about Unicode, encodings and broad concepts related to strings and language.
Caveat: I’m hardly an expert in this stuff, so there will likely be some details I describe inaccurately, but hopefully the truth:lies ratio is pretty good.
- What is a string in ES
- How can we deal with this
- Intl and string comparison
- About Intl
- Bonus round: String canonicalization
- Some additional info on encodings
So, one common way to think of strings is as lists of characters. But that’s not what they are in ES (nor most languages). Another common way to think of strings is as lists of codepoints. That’s also not what they are in ES.
They’re lists of code units. Specifically, UInt16 code units that are almost, but not quite, the code units of the UTF16 encoding scheme.
I’ll step back a moment though and provide some definitions, cause the terminology here is probably unfamiliar if you aren’t super into encodings and stuff.
Simple primer first — we’ve got three tiers of abstraction here:
- character: very human concept, think like ... the "idea" of the letter "A"
- codepoint: an underlying integer that "backs" characters (not always 1:1)
- code unit: an underlying integer specific to an encoding scheme that "backs" codepoints (not always 1:1). usually these are not something you need to be aware of when handling strings in a high level language
Let’s start going deeper with character. This is actually the hardest one.
Characters are super abstract. In Unicode, one character usually maps to one codepoint ('a', 0x61), but sometimes multiple codepoints can make one character (like when combining marks are used with letters), and sometimes one codepoint can make multiple characters (like German ß).
Now, keep in mind that Unicode is not an encoding. We’re not talking about encoding yet. Unicode is a standard that defines a lot of different stuff, and what we’re talking about here is the mappings between codepoints and characters. Very few programming languages will have you work with real characters as such. Most applications don’t really need to worry about characters, and most of the time, one character maps directly to one codepoint, or it just doesn’t matter much, and that’s the end of the story. But take for example the letter ä. There’s this character, a lowercase a with a dieresis floating on top, and actually there’s not just one way to describe this character with codepoints.
The obvious way is via the codepoint 0xE4
. That one codepoint maps to this
character.
A less obvious way is via the codepoint sequence 0x061 0x308
: ä. It’s the same
character, but we "made it" with two different codepoint representations. The
second one uses a combining mark. You see this stuff in action also when you use
emojis with skin color modifiers. Those are one character, two codepoints. We’ll
come back to this stuff later a bit. This terminology has nothing to do with ES,
but I wanted to start clarifying these words.
In programming, when we say "character" we usually are speaking very loosely. We often really mean something like "string value of one codepoint". Until we get to the section about canonicalization, we won’t worry too much about this.
So, I mentioned before, codepoints aren’t part of any particular encoding. These are numbers that describe either single characters or parts of composed characters or in a few cases one even describes multiple characters (really).
All the integers from 0x0
to 0x10FFFF
are "codepoints" in Unicode, but not
all of these are valid codepoints. There are two ranges
within there that are "blacked out", plus a handful of special cases where it
gets fuzzy, like the values involved in creating byte order marks. To be valid,
though, a codepoint does not have to have a defined meaning.
The term for a "valid" codepoint — one which may appear in a well-formed Unicode string — is "Unicode scalar value".
In really nice modern languages, strings are lists of codepoints. But ES isn’t that modern.
This is where we get to encoding-specific stuff. There are many ways to encode Unicode (utf8, utf16le, utf32, etc) or subsets of Unicode (ucs2, etc). The most common is UTF8. UTF8 is very efficient and clean. It’s usually the most compact way to represent Unicode text. But from a programming perspective, it’s sort of quirky. It doesn’t map neatly to how we like to pretend strings are simpler than they are. Every codepoint over 0x7F has to be described using 2 to 4 bytes. Unless you’re absolutely certain the text is constrained to ASCII codepoints, you can’t just iterate over bytes and get "characters". You always have to decode.
Each encoding may have its own set of code units. Code units are words that are all the same length. In UTF8, that length is a single byte, though some possible bytes go unused. So UTF8 has 243 code units. Or something thereabout.
UTF32 is an interesting case. In UTF32, the word length is four bytes. That’s
actually big enough to describe every codepoint in Unicode without any
shenanigans. It’s just 1:1. That makes it awesome for programming actually. In a
utf32 byte sequence, every four byte word is gonna just be a codepoint, no
decoding needed (but you still need sanity checks for whether it’s a valid
codepoint). The thing is, intuitive as this is, it’s kind of crazy to use a
uint32 for every codepoint when the vast majority of text out there is gonna
be composed of codepoints below 0x80
or at least below 0x10000
.
Caring about how many bytes are in play was even more important in 1995. That’s
when ES was invented. And at this point, the codepoints beyond the BMP (basic
multilingual plane — 0x0
to 0xFFFF
) were basically hypothetical. No software
supported them, and the handful of codepoints defined there were for shit like
byzantine musical symbols.
So Eich took a shortcut (which is not unique to ES). He went for the shittiest
of all full Unicode encodings, UTF16. In UTF16, all codepoints in the BMP map to
one two-byte word. So at the time, that meant it effectively had those nice
1:1 properties from UTF32. You could access a string by index like "foo"[0]
and you’d probably never think about what it really meant. Pseudocod:
fn getCharAt(underlyingBuffer, index) {
return toString(underlyingBuffer.readUint16LE(index * 2));
}
Fast forward to 2009. There’s lots of stuff above the BMP now, in the SMP (supplemental multilingual plane) — even scripts used by a few obscure but living languages. But who cares? Nobody programming, anyway. None of this seems to matter yet.
Okay, fast forward one more year. It’s October of 2010 and...
EMOJI 👏 LAND 👏 IN 👏 THE 👏 SMP.
OH NO WE CARE WE CARE SO MUCCCCCH 🍆🍆🍆.
You can see this early shortcut "leak" in two ways in ES today.
The first is that String.prototype.length
abuses the shortcut by reporting the
number of code units (pseudocod: buffer.length / 2
), not the number of code
points or characters:
assert("💩".length === 2);
The second comes into play with indexed string access and methods that do stuff like slicing strings. The indices point to these two-byte code units, not to codepoints. Code units don’t have string values. They are the symbols that get decoded to get codepoints, not the codepoints themselves, and it just so happens to be that for all Unicode scalar values below 0x10000, the code units used to represent them are the same as their own values.
So ... what do we get?
assert("💩"[0].length === 1); // hmm
assert("💩".charCodeAt(0) === 0xD83D); // WHO ARE YOU???
assert("💩".charCodeAt(1) === 0xDCA9); // WHAT DO U WANT
If we log either of these we’ll see "�", the replacement character that gets inserted for invalid values. Those two items are code units that are not also unicode scalar values — the mask is off, they were all code units all along!
Code units of what though? As mentioned earlier, code units are the symbols that appear in a specific encoding. In this case these two are code units of UTF16 specifically: the high and low surrogates. A pair of these together gets decoded to a single codepoint in UTF16. This is the crux of the quirk: ES permits addressing and even handling, as strings, sequences which use surrogate code units (an internal facet of the UTF16 encoding) as if they were scalar value codepoints / characters. This is an abstraction leak.
Excepting cases where you know in advance that a string will be constrained to codepoints that can be described with a single code unit in UTF16, we should be handling strings quite differently than we may be used to. ES2015 introduced the tools to do so.
Well, first of I’m gonna point out that annoying as this is, it’s not unusual. A lot of languages have similar quirks and it’s actually kind of a miracle that JS is natively Unicode (or Unicode-ish) to begin with, given it’s 22 years old. So much of the shitstorm around Python 3 concerned them finally fixing the incredibly broken strings from Python <= 2 and people were like waaaah but I love having all these bugs and casually alienating users who don’t speak English by accident waaah. (Python 3 does strings excellently now I think, they cleaned up hardcore.)
Because ES cannot break the web, it’s hard to fix something this systemically
deep entirely. So indexed access is still unsafe, as are charAt
and
charCodeAt
. I mentioned earlier that it’s okay to use these if you’re totally
confident about the nature of the string content, but personally I’d recommend
just treating them as deprecated — we have better ways now.
I’m going to use the word "character" a bit loosely now, to refer to the string value of a given codepoint. This is what we usually think of when programming.
ES2015 introduced iterators, which is a natural fit for cleaning up strings. Any
operation which accesses String.prototype[Symbol.iterator]
will correctly
yield the right stuff!
const str = '💩💩💩';
let i = 0;
for (const char of str) {
assert(char === '💩');
i++;
}
assert(i === 3);
Same dealie!
assert(Array.from(str).length === 3);
// or
assert([ ...str ].length === 3);
Same dealie!
assert([ ...str ][0] === '💩');
But if we do want the first char and the string is long, we may want to take a different approach. Above, we iterate over the whole string, but we really only need to pump the iterator one time:
assert(str[Symbol.iterator]().next().value === '💩');
This one is a bit different. On one hand it’s simple: swap out charCodeAt
and
use codePointAt
. But the index you pass in is still a code unit index. Kind of
lame, but it’s maybe understandable that they didn’t want to introduce an
opaquely different concept of indexing in strings. If you try to access a
codepoint at an index which points to the second code unit of a surrogate pair,
it will return the surrogate code unit value instead, just like charCodeAt
.
assert(str.codePointAt(0) === 0x1F4A9);
...but watch out for:
assert(str.codePointAt(1) === 0xDCA9);
For this reason you typically don’t want to use it this way, though you could if
you advanced the index by 2 rather than 1 whenever you see a codepoint over
0xFFFF
. However in the most common cases for accessing codepoints, you want
all of them — so you can just do this:
const codepoints = [ ...str ].map(char => char.codePointAt(0));
// Or better yet:
const codepoints = new Uint32Array.from(str, char => char.codePointAt(0));
That will be safe, but similar issues come up with things like slice()
; these
take code unit indices. So when figuring out what the indices are, keep that in
mind.
Alright, so let’s move on to something a bit different.
A classic "wat" people point to with js is something like the following:
[ 1, 2, 11, 22, 'abattoir', 'YOLO' ].sort();
// -> [ 1, 11, 2, 22, 'YOLO', 'abattoir' ]
The story here is pretty simple: the default sort is a string comparison. It’s a
bit silly — numeric comparison is surely the more commonly useful case. The
string comparisons in question are the same results you would get from using >
or <
to compare strings.
If we put aside the odd choice of default type comparison, there are still other
aspects of the result which may be unintuitive, but they are not a wat
. In fact
this result is very much the only correct answer possible without additional
context, and any assumptions about such a context would be
PURE, HIGH OCTANE EVIL.
Specifically, the string comparison is byte order (or 2-byte word order technically maybe, though this distinction would be opaque to us). The reason you can’t get 'abattoir' before 'YOLO' out of the box is that collation order is not an intrinsic property of a codepoint. It is a property of languages.
In other words, anything other than byte order would suck because it would either be nondeterministic (user locale) or default to, probably, English (colonialism is shitty).
Take, for example, the following array of strings:
const words = [ '1', '11', '2', '22', 'rövhål', 'Rugburn' ];
What order should these be in? The question makes no sense if you can’t say what language’s collation order to use. And there are a lot of languages, so this could be tricky ... but fortunately, since ES 5.1, it hasn’t been.
const english = new Intl.Collator('en');
const swedish = new Intl.Collator('sv');
words.sort(english.compare); // -> ["1", "11", "2", "22", "rövhål", "Rugburn"]
words.sort(swedish.compare); // -> ["1", "11", "2", "22", "Rugburn", "rövhål"]
So, cool, we have correct collation for alphabetic words per language, but we still have those annoying numbers. We can fix that, though.
const english = new Intl.Collator('en', { numeric: true });
words.sort(english.compare); // -> ["1", "2", "11", "22", "rövhål", "Rugburn"]
In fact there are quite a few additional options, because collation is a much more complex concern than may be obvious.
Many people don’t know about the Intl
object, but it is not a DOM API — it is
an intrinsic namespace object in ES like JSON
and Math
. Like the rest of ES,
it is governed by TC39, but it has a distinct spec (ECMA-402, as opposed to
ECMA-262) which is not tied to annual ES versions.
It also provides functionality concerning date and time parsing and formatting
and numeric formatting. Because Intl
is an unusual case in terms of size (it
needs complete knowledge of Unicode, among other things), implementations of ES
are permitted to omit it. Fortunately, the last hold out, iOS Safari, now
supports it too, but to be safe, you should still typically use Intl
functionality with a try-catch with a "less smart" simple fallback in the catch
block to be safe.
I said earlier that we would come back to "characters", how they can be created with different codepoint sequences. Unicode explicitly defines how these alternative representations of character identities work, because it’s sometimes desirable to normalize strings so that two strings don’t use different representations for the same content, e.g. when deduplicating.
ES2015 introduced this as String.prototype.normalize()
. It takes an argument
indicating whether you want to normalize to canonical compositions (i.e.
prefer, where available, pre-composed single codepoint forms) or canonical
decompositions (always prefer forms using the atomic combining characters).
There are two other options for special backwards compatibility needs but those
are lossy transformations which you would rarely want. Usually you want the
default, 'NFC' (canonical composition):
assert('\u00E4'.normalize() === '\u0061\u0308'.normalize());
If you’re curious about more details about implementation and usage in node:
From the perspective of an ES program, strings don’t "have" an encoding, so to speak, but the abstraction leaks described above imply an encoding which is like UCS2 except with surrogates permitted (i.e., all values from 0x0 to 0xFFFF are legal, and values from 0x10000 to 0x10FFFF are just "phenomena" that emerge when the string is viewed or accessed via ES2015 interfaces).
However in practice, this is not how engines actually encode strings internally, and even if it were, real world needs (where strings leave ES) mean there has to be a way to transcode the illegal Unicode into legal Unicode ... or at least into other invalid encodings.
Almost all text today is exchanged in UTF-8, so what emerged from this is a pseudo-standard called WTF-8.* For any valid Unicode string, WTF-8 and UTF-8 are identical: SMP code points are still encoded in the normal UTF-8 manner (UTF-8 does not use surrogates as code units; the code units in UTF8 are each a single byte — surrogates are code units only in UTF-16 and CESU-8). However for an ES-style string which may contain unpaired surrogates as if they were Unicode scalar values on their own, WTF-8 permits encoding these occurences in the regular UTF-8 fashion (that is, they would become three byte sequences beginning with 0xED, which is technically not UTF-8).
That said, WTF-8 is mainly intended for internal interchange; you wouldn’t want
to expose it, since it’s (potentially) garbage. Rather one normally would
convert illegal unpaired surrogates to 0xFFFD (the famous
question-mark-in-a-diamond character). This is indeed what, for example, node’s
Buffer
will do when it creates buffers from strings. If you need to actually
write such strings as-is to a file, you’d have to do so in userland, e.g. via a
custom WTF-8 encoder. But you would not want to do that, really.
* At one point WTF-8 was used to mean something else, a joke proposal inspired by the (now fortunately uncommon) problem of agents presuming everything is in oldskool CP-1252 and dumping mojibake.
Although not super relevent to the above, I mention it because it’s classic. Some example passages:
- Terminology
The key words "WHAT", "DAMNIT", "GOOD GRIEF", "FOR HEAVEN'S SAKE", "RIDICULOUS", "BLOODY HELL", and "DIE IN A GREAT BIG CHEMICAL FIRE" in this memo are to be interpreted as described in [RFC2119].
For example, code that reads cuneiform text encoded in UTF-16 ignoring the surrogate pairs and the byte order mark, then writes out the 16-bit numbers in UTF-8 thereby making the previous data illegible. The justification for allowing such incompetent code was that there were no major implementations of the Unicode supplementary planes and no significant amounts of data containing bronze age writing. The issue has been dubbed the "Babylonian mess", and the relevant programmers have pledged to produce different bugs in the future.
- Security Considerations
Implementers of WTF-8 should not consider the security aspects of how they handle character data. After all, it is inconceivable that in any circumstances an attacker would be able to exploit an incautious parser by sending it an octet sequence.