ERights Home data / common-syntax 
Back to: Conformance On to: The Wysiwyg-ASCII Format

Representing Characters


This page can be safely skipped by readers concerned only with ASCII source texts.

Background

Because we do not yet have any personal experience with non-ASCII Unicode characters at the time of this writing, we considered specifying that source text for E 0.9 be restricted to ASCII, and therefore that all non-ASCII source characters may only be expressed using Backslash-u Decoding. However, this would be too great a burden on non-English-based programmers wishing to use E. We obtain a very similar effect indirectly.

The following text from the Java Language Specification (the JLS) effectively defines a character encoding form for Unicode:

The Java programming language specifies a standard way of transforming a program written in Unicode into ASCII that changes a program into a form that can be processed by ASCII-based tools. The transformation involves converting any Unicode escapes in the source text of the program to ASCII by adding an extra u -- for example, \uxxxx becomes \uuxxxx -- while simultaneously converting non-ASCII characters in the source text to a \uxxxx escape containing a single u.

The JLS defines a Unicode escape effectively as

'\\' 'u'+ <hexDigit> <hexDigit> <hexDigit> <hexDigit>

where the total number of slashes (if any) immediately preceding this sequence is even.

[Spec] The following Bug MUST be fixed: What if a non-ASCII character occurs immediately after an odd number of slashes? The above encoding will produce a Unicode escape sequence immediately following this odd number of slashes, which will therefore no longer be considered an actual Unicode escape. Is this also a bug in the JLS?

Written out at one byte per resulting ASCII character, this encoding form also defines a character encoding scheme. We call this encoding form/scheme UTF-J2, since the Unicode escape defined above can only represent a 16 bit (2 byte) code point. The same section of the JLS also defines two ways of decoding such text back into a sequence of 16 bit code points. The first reverses the above encoding with no loss of information:

The exact Unicode source can later be restored from this ASCII form by converting each escape sequence where multiple u's are present to a sequence of Unicode characters with one fewer u, while simultaneously converting each escape sequence with a single u to the corresponding single Unicode character.

The other decoding method simply decodes each Unicode escape into the Unicode code point it encodes. The first decoding method would be used to preserve appearance of the source to those using Unicode editors and mixing Unicode characters with Unicode escape sequences. We call this first decoding method a UTF-J2 presentational decode, and consider it no further. The second would be used prior to all other forms of further processing, which we call simply a UTF-J2 decode.

UTF-J4

To handle any Unicode character, we extend the above scheme by defining a Unicode escape to be a sequence of characters accepted either by the above pattern, or:

'\\' 'u'+ '{' '0' 'x' <hexDigit>+ '}'

We call this extended encoding scheme UTF-J4. A UTF-J4 encode, when generating a Unicode escape for a non-ASCII code point, SHOULD always use the first form for 16 bit code points, and SHOULD always use the shortest encoding in the second form for supplemetary characters.

Rationale: Pleasing Regularities

In the second form of Unicode escape, we include the '0' 'x' prefix so the string between the curlies will appear to be a numeric literal. This leaves us open to eventually allowing, for example, a character name to appear between the curlies instead of a hex code point.

For purposes of specification, we suppose the following functions

  • utfJ4Encode(CodePoint[]) -> AsciiByte[]
  • utfJ4Decode(CodePoint[]) -> CodePoint[]
  • utf8Decode(UTF8Byte[]) -> CodePoint[]

[Src] The octet sequence input to utf8Decode MAY optionally begin with the UTF-8 BOM sequence: 0xEF 0xBB 0xBF, which utf8Decode MUST skip.

Since ASCII is the 7-bit byte subset of both UTF-8 code units and Unicode code points, we consider AsciiByte[] to be a subtype of both CodePoint[] and UTF8Byte[].

For all sequences of Unicode code points u:
utfJ4Decode(u) == utfJ4Decode(utfJ4Encode(u)) ==
... # and so on, for any number of UTF-J4 encodings prior to the UTF-J4 decoding.

Therefore, given that we're going to do a utfJ4Decode prior to further processing, we don't care whether our input is the true source, or is a UTF-J4 encoding of the source. (If we change the spec below to track source positions on one of the representations prior to the utfJ4Decode, then these alternatives would no longer be strictly equivalent, so under some circumstances we would care.)

From Bytes (Octets) to Source Text

  • [*] When f is a sequence of octets to be decoded into source (such as the contents of a file), f MUST be double-decoded as follows:
    sourceText == utfJ4Decode(utf8Decode(f))

The double-decode above yields the same result as

utfJ4Decode(utfJ4Encode(utf8Decode(f))). 

If f is in ASCII, then utfJ4Decode(f) also yields the same result.

  • [Src] When f is a sequence of octets to be decoded into source, utfJ4Encode(utf8Decode(f))) SHOULD be in Wysiwyg-ASCII Format.

  • [Src] When a source language's grammar uses matched brackets to indicate nesting structure, source text in this language SHOULD use spaces for indentation to signal this nesting structure accurately to the human eye. Further, source text SHOULD NOT include any tab characters at all.

  • When rendering text in a fixed width font, tab characters SHOULD be rendered as whitespace extending to the next modulo-8 tab stop.

 [Advisor] An advisor therefore SHOULD alert reviewers of violations of the above Src RECOMMENDATIONS.

Rationale: Adversarial Code Reviews

Depending on the density of Unicode escape sequences, the UTF-J4 encoding of the source may or may not be adequately readable for a review. If this format is adequately readable, reviewers are advised to look at a rendering of this encoding in a font in which ASCII printing characters may be easily distinguished. For example, the following are distinct ASCII printing characters, and should each be unambiguously recognizable:

1l|!oO0`'

If Raven the reviewer is looking at a readable UTF-J4 encoding of conforming sources in Wysiwyg-ASCII format, in a font in which all ASCII printing characters are unambiguously recognizable, then Raven has grounds for some confidence that the appearance of the text encodes all the meaning of the text as it will be interpreted by a conforming language processor. Of course, Arthur the author can still write code that will confuse Raven the reviewer. But we hope we've made it hard for Arthur to also confuse Raven about whether she's confused. If Raven knows she's confused, she can simply reject Arthur's code.

Newline Canonicalization

Once we have source text that passes the above checks, the following transformations are then applied, logically in order, to create the source text used for lexical analysis:

  1. MS-DOS Newline Canonicalization. All occurrences of the sequence '\r' '\n' (or CRLF) are replaced with '\n' (LF).

  2. Mac OS <= 9 Newline Canonicalization. All remaining occurrences of '\r' (or CR) are replaced with '\n' (LF).

  3. Line and Column Numbering. Line and column numbers designate positions in the source text after the above steps. The first line is line number 1. The first column is column number 0.

(The JLS also says that newline canonicalization happens after interpreting Unicode escapes. Is this really true? It seems silly, but I'd rather follow Java's lead on this than to try reversing the order. What does Java do about source positions? Does it say anywhere?)

Only BMP Characters

  • [Src] Following the above double-decode, the source text MUST consist only of a sequence of Unicode encoded characters.

  • [Src] As of E 0.9, source text MUST contain only BMP characters, i.e., only those Unicode encoded characters whose code points fit within 16 bits. (From this, it would seem that UCS-2 characters might be what I mean, but I'm not sure.)

(Is this too strict? Should we say instead only that source text MUST contain only 16-bit code points and MUST NOT contain surrogate code points? Should we demote the other RULES to RECOMMENDATIONS? That would seem to be the minimal restriction needed to satisfies the following issue.)

Rationale: Indecision is the mother of convention

Unicode has had a complex but understandable history. As of the Unicode 3.0 standard or so, it was thought that Unicode could fit all the world's characters into a 16 bit character set. Based on this, the Java and Python language s defined a "char" as 16 bits. Java provided good support for handling Unicode, and became a leading platform for developing Unicode-ready software. Unfortunately, the Unicode consortium found that 16 bits was too tight, and expanded Unicode into a 21-bit character set. It was then unclear what to do about legacy formerly-Unicode-ready libraries. The litmus test is indexing: How does one interpret a source position? What is a counting unit for determining the length of a string? Currently, the dominant approaches are:

Java further defines and uses "Modified UTF-8" rather than standard UTF-8. In Java's modified UTF-8, a supplementary characters is represented by UTF-8 encoding each of the surrogate code points in the UTF-16 encoding of the character. This is explicitly forbidden by the Unicode spec (D36):

Because surrogate code points are not Unicode scalar values, any UTF-8 byte sequence that would otherwise map to code points D800..DFFF is ill-formed.

We thanks David Hopwood for pointing this out.

  • The XPath and Python way (see PEP 0263, PEP 261): A counting unit is a Unicode encoded character.

  • The DOM and Java 1.5 way: A counting unit is a UTF-16 code unit. A Java char no longer represents a character -- it represents a UTF-16 code unit.

  • IBM's ICU library supports both, although it's heavily biased towards the Java way.

Although the XPath and Python approach is clearly more right (and is recommended by CharMod), we wish to postpone choosing sides until it's clear who the winner will be. Therefore

  • [Spec] The E 0.9 specs must be downward compatible from any of the above choices.

  • [Producer][Validator][Advisor] Until a decision is made, programs written to handle text SHOULD be compatible with any of these choices being made in the future.

The E 0.9 requirement that the source text MUST contain only BMP characters implies that it MUST NOT contain any

  • supplementary characters -- characters whose code points are larger than 16 bits, i.e., are in the range 0x1_0000..0x10_FFFF.

  • surrogate_code_points -- code points in the range 0xD800 through 0xDFFF. The general category of these is "Cs".

  • undesignated code points -- also called reserved or unassigned code points. These are either noncharacters, or code points whose interpretation is not yet specified as of that version of Unicode. The general category of these is "Cn".

  • private-use code points -- those whose interpretation will not be specified by the Unicode consortium. The general category for these is "Co".

A validator MUST therefore statically reject source text containing code points that are not encodings of BMP characters.

? pragma.syntax("0.8")

? def makeChar := <import:java.lang.makeCharacter>

? def isBMPChar(codePoint :(0..0x10_FFFF)) :boolean {
>     # If it's not in the range 0..0x10_FFFF, then it's not a valid Unicode code point
> 
>     if (codePoint > 0xFFFF) {
>         # If it's larger than 0xFFFF, then it's a supplemental code point, 
>         #  rather than in the BMP
>         return false
>     }
>     # If it's in the BMP, then, even in E 0.9, we can convert it to a char.
>     def ch := codePoint.asChar()
>     def cat := ch.getCategory()
> 
>     # If it's general category isn't SURROGATE (Cs)  or UNASSIGNED (Cn), 
>     # does that mean it must be a Unicode  encoded character? 
>     # What about Private Use (Co)?
>     return !(["Cs", "Cn", "Co"].contains(cat))
> }
# value: <isBMPChar>

Source Text SHOULD be in NFC

[Src] source text SHOULD conform to CharMod and CharNorm. In particular, it SHOULD be in Unicode Normalized Form C (NFC), and SHOULD NOT contain Characters not Suitable for use With Markup.

(Should we further recommend that source text be include normalized or fully normalized? What would these mean in this context?)

Rationale: Caught in the Web

E is a distributed programming language. E code is often mobile code. Therefore, it could be considered like a kind of web content, even though it is not a kind of markup. For possible ease of integration with other tools, and to reduce cases, it would be good to stay within the W3C's character model.

 
Unless stated otherwise, all text on this page which is either unattributed or by Mark S. Miller is hereby placed in the public domain.
ERights Home data / common-syntax 
Back to: Conformance On to: The Wysiwyg-ASCII Format
Download    FAQ    API    Mail Archive    Donate

report bug (including invalid html)

Golden Key Campaign Blue Ribbon Campaign