String Representations and String APIs

See also Unicode.

The underlying data representation and the appropriate APIs for handling strings.

How to Implement Strings

Overview of approaches to implementing strings. Primarily covers the underlying data-representation, without going deep on public APIs and handling of unicode.

Why is Swift's String API So Hard

Argues that the complexity of Swift's string types is necessary. In different contexts, you may want to work with a sequence of bytes, code points or grapheme clusters.

Swift strings have no canonical representation, but offer different views. The easiest way to treat them is as a sequence of characters (grapheme clusters), but it's not the only way. Several consequences fall out of this: O(1) indexing is not possible, and the Swift APIs are designed to make that obvious.

(Side note: some of the grapheme clusters in this article won't display properly depending on your OS/Browser).

Strings, Bytes, Runes and Characters in Go

Philosophically, I dislike the "strings are just an arbitrary []byte" approach of Go, but haven't done enough Go to determine whether it's problematic in practice (even in practice, opinions seem to differ based on use-case: there are individuals who swear Python2's treatments of bytes vs. strings is ok, or even better than Python3, arguably because they dealt with a subset of problems where they were ok ignoring the distinction).

History of Null-Terminated Strings

The PDP-11 had a special assembly instruction for dealing with null-terminated strings.

Relitigating history: the argument that null-termination saves memory doesn't make a lot of sense to me. I would think you could have used a variable representation for length with 1 bit representing whether the length is > 128), giving you an overhead of 1 byte for most strings (which equals the overhead of the null-byte), and thereby only incurring an overhead of <1% on large strings. That would require a few extra cycles, but nul-terminated strings already require extra cycles to get the length.

duclare's comment on C Strings

One interesting point: because C strings are nul-terminated, you can sometimes split a string into substrings in place with zero allocations (for instance splitting on line-boundaries after reading the string in from a file).

Why Rust Didn't Implement Small String Optimization

The usage patterns of Rust make copying strings somewhat less common than in C++, and therefore the payoff of the optimization is no longer worth the code bloat and branch prediction cost (TODO:better summary). Comments on Reddit

UTF-16 Strings

Primarily of interest for older languages/platforms, as I don't think any recent programming language has used UTF-16.

JEP 254

This feature from Java 9 made Java strings internally use either use a 1-byte per character representation for purely ASCII strings, and UTF-16 otherwise. Contains links to extensive performance analysis.

Slimmer and Faster Javascript Strings In Firefox

Gecko also migrated to a representation which can special case purely Latin1 strings. HN Comments


Swift 4 String Manifesto

Why does the size of this Python String change on a failed int conversion

Unicode caching weirdness in Python

Perl6: Unicode

How Perl6 handles normalization of input and output. By default, all strings are normalized except for filenames (which are treated as plain bytes). Strings that cannot be round-tripped through encoding/decoding use a special normalization form, NFG.