I'm praising UTF-8 (without normalization) as a wonderful format where you
can do 99.9% of everything without ever caring about all the expensive
stuff.
But in order to do that, you really need to avoid normalization, and you
also need to accept mis-formed UTF-8 strings (because even if it is real
UTF-8, the string may actually be just a fragment of some larger string).
Once you do that (and _only_ if you do that), then UTF-8 is actually a
wonderful thing. You can consider it to be a traditional "everything is a
stream of bytes", and everything that only cares about a stream of byte
will work wonderfully well.
And then, the (actually relatively few) things that want to do things like
show things on the screen, or check for equivalence, or worry about width
of the characters, *those* can still do so.
So the beauty of UTF-8 is that you can switch between thinking of it like
just a binary blob and thinking of it like text, and everythign works
(including the traditional C null-termination).
And yes, that was obviously the explicit design goal. It's a good thing.
Sure. And I'm not arguing against them. Knowing the rules for combining
characters is really important for input and output.
Absolutely. It's what the kernel does, and I think that's what perl does
too for their "strings". It works really well. It also allows you to
handle binary data (ie data that *really* isn't text) with shared routines
etc etc.
And that's the beauty of non-normalized (and possibly badly formed) UTF-8.
Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html