2010/04/30

Character Sets

The character set used for Java programming is Unicode.  The Mjolner compiler allows only ASCII characters.  The ASCII character set is embedded in Unicode, as is the Latin-1 (ISO 8859-1) character set, so that upward compatibility is maintained.  This allows non-English-speaking programmers to write identifier names, comments, and text strings belonging to their own languages.  Loki will extend this ability to Beta programmers as well.  A trivial change to the Mjolner compiler (not involving extending it to Unicode!) will permit easy interchange between Mjolner and Loki Beta programs.

The Java compiler accepts programs in one of two transformation formats: UTF-8 and Unicode escape mode.  Both of these have the useful property that ASCII characters are represented by themselves, so that ASCII-only programs are immediately compatible.  UTF-8 is sufficiently documented elsewhere , and I will simply say that Loki will accept it.

Unicode escape mode is more interesting.  Most characters outside the ASCII range is represented by an escape sequence "\uxxxx" where "xxxx" is four hexadecimal digits.  (Some characters are represented by two consecutive escape sequences.)  These sequences are interpreted immediately on reading in the source code, and thus they may be used anywhere: in identifiers, comments, or strings.  It is legal in Java to use values of "xxxx" that represent an ASCII character (0000-007f), but I propose to forbid this usage in Beta code.  To the Java compiler, "\u002c" is equivalent to a comma in every way: it can be used to separate arguments in a method call or for any other purpose.  This usage makes for nothing but confusion to the reader.

No comments:

Post a Comment