String Representations and String APIs
See also Unicode.
The underlying data representation and the appropriate APIs for handling strings.
Overview of approaches to implementing strings. Primarily covers the underlying data-representation, without going deep on public APIs and handling of unicode.
Argues that the complexity of Swift's string types is necessary. In different contexts, you may want to work with a sequence of bytes, code points or grapheme clusters.
Swift strings have no canonical representation, but offer different views. The easiest way to treat them is as a sequence of characters (grapheme clusters), but it's not the only way. Several consequences fall out of this: O(1) indexing is not possible, and the Swift APIs are designed to make that obvious.
(Side note: some of the grapheme clusters in this article won't display properly depending on your OS/Browser).
Philosophically, I dislike the "strings are just an arbitrary
byte" approach of Go, but haven't done enough Go to determine whether it's problematic in practice (even in practice, opinions seem to differ based on use-case: there are individuals who swear Python2's treatments of bytes vs. strings is ok, or even better than Python3, arguably because they dealt with a subset of problems where they were ok ignoring the distinction).
The PDP-11 had a special assembly instruction for dealing with null-terminated strings.
Relitigating history: the argument that null-termination saves memory doesn't make a lot of sense to me. I would think you could have used a variable representation for length with 1 bit representing whether the length is > 128), giving you an overhead of 1 byte for most strings (which equals the overhead of the null-byte), and thereby only incurring an overhead of <1% on large strings. That would require a few extra cycles, but nul-terminated strings already require extra cycles to get the length.
One interesting point: because C strings are nul-terminated, you can sometimes split a string into substrings in place with zero allocations (for instance splitting on line-boundaries after reading the string in from a file).
The usage patterns of Rust make copying strings somewhat less common than in C++, and therefore the payoff of the optimization is no longer worth the code bloat and branch prediction cost (TODO:better summary). Comments on Reddit
Primarily of interest for older languages/platforms, as I don't think any recent programming language has used UTF-16.
This feature from Java 9 made Java strings internally use either use a 1-byte per character representation for purely ASCII strings, and UTF-16 otherwise. Contains links to extensive performance analysis.
Gecko also migrated to a representation which can special case purely Latin1 strings. HN Comments
Unicode caching weirdness in Python
How Perl6 handles normalization of input and output. By default, all strings are normalized except for filenames (which are treated as plain bytes). Strings that cannot be round-tripped through encoding/decoding use a special normalization form, NFG.