Unicode

Introductions

Awesome Unicode

An introduction to unicode, along with the best collection of resources I've seen.

The Absolute Minimum Every Software Developer Absolutely Must Know About Unicode and Character Sets (No Excuses!)

Pretty good introduction, much of which may be common sense to developers in 2018. The advice to use UCS-2 is outdated, since the article is 15 years old. I think today that you probably want to use UTF-8 for any general purpose string if you have a choice (this is what Rust did, though not Swift).

Why Is Swift's String API So Hard?

Although the article is about Swift's string api, the first several paragraphs are a great introduction to the distinctions surrounding characters, grapheme clusters, code points, etc.

Dark Corners of Unicode

Introduction to a few concepts, and examples of programs and programming languages that handle unicode very poorly.

Unipain

A talk on handling unicode in Python (2 & 3). Some considerations are language specific, but a lot of it translates to other contexts.

Variable Names

It's generally considered a bad thing to only allow ASCII names in your programming language. However, simply allowing all unicode seems like a bridge too far, because you need to be able to visually distinguish identifiers (this also an issue in browser security). However, I've never seen a thorough guide to allowing unicode in identifiers without allowing shenanigans.

Getty Ritter on Whitespace Identifiers

U+2800 BRAILLE PATTERN BLANK is a character that renders as a space, but is not actually whitespace. Ruby will allow it in identifiers....

U+2800 is valid in Haskell source code, but it treats it as an operator symbol, so it looks like I'm somehow overloading whitespace here:

Fake Unicode on Unicode Identifiers

Unicode maintains recommended guidelines for [use of unicode in identifiers]

I think that these guidelines aren't enough, since they include characters that appear to be whitespace. It also seems reasonable to include some limitations on how characters from different alphabets can be combined.

Han Unification

I don't know of any reason Han Unification is something a programmer working today needs to worry about, but eevee says she has "seen it end friendships", and it seems like an important piece of history.

Han unification - Wikipedia

One of the major controversies surrounding unicode: how to handle glyphs that are semantically the same in CJK languages, but have different graphical representations.

Why Unicode Won't Work on the Internet: Linguistic, Political, and Technical Limitations

2001 article on why Han unification will prevent a barrier to using unicode on the internet.

Heap

WTF-8

UTF-8 superset used for working with strings from UTF-16 implementations that don't enforce the invariants of unicode. The rust implementation of OMG-WTF-8 Encoding is related, and has some very "interesting" diagrams.

Breaking Our Latin-1 Assumptions

Overview of several scripts that break the assumptions English speaking developers might have.

I highly recommend comparing against this relatively small list of scripts the next time you are writing code that does heavy manipulation of user provided strings.