ERights Home data 
Back to: Overview: Handling Symbolic Data On to: The Power of Irrelevance

Common Syntactic Element
For some of the E family of languages


Obligated Parties

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

A language spec places requirements on the (authors of) source text written in that language, and on the actions of the canonical kind of "language processor", such as a compiler or interpreter, that is to interpret source text written in that language. (A syntax highlighter is an example of a non-canonical kind of language processor.) When we state that the source text MUST (or MUST NOT, SHALL, SHALL NOT, REQUIRED), unless stated otherwise, there is a corresponding requirement on the canonical kinds of language processor for that language to reject programs that violate that obligation.

When we state that the source text SHOULD (or SHOULD NOT, RECOMMENDED, NOT RECOMMENDED), unless stated otherwise, we make a corresponding recommendation that lint-like tools SHOULD be built and employed during code reviews, so that code reviewers are aware of all violations of these recommendations, or of other surprises that may cause them to misread what they are looking at.

Rationale

The E family is intended to support secure programming and adversarial code reviews. The author of the source text, the reviewers reading it to understand what it might do, and the owner of the language processor on which it runs may all be mutually suspicious parties with somewhat divergent interests. The rules below on what source text MUST and MUST NOT do are designed to enable these parties to safely cooperate by use of these source texts. If source-author Alice has a reasonable expectation that a rule violation by Alice might not be caught by code reviewer Bob or language processor owner Carol, then Alice may violate that rule, independent of the wording in standards documents such as this. To avoid confusion, we only say "MUST", etc, when the rule can be enforced by the canonical language processor.

The reverse is not the case: If Carol operates her language processor in ways that violate these rules, no one is necessarily in a position to catch her. Therefore, the obligations on her language processor are primarily for her own protection.

Dependencies and Versions

Version Dependencies

the E language
Kernel-E
TermL

0.9 / alpha
Java 1.3.1
Unicode 4.0 (NFC)
W3C CharMod 1.0
IETF

rfc2396
rfc2119

Some of the E family of languages -- the E language, Kernel-E, and TermL -- delegate aspects of their specifications to other specifications, such as this page. This page in turn delegates to yet other specifications, such as Java, Unicode, and various IETF and W3C standards. This page currently intends to support the specifications of the E family as of the upcoming E 0.9 release (also to be known as E alpha). For each E release, the Version Dependencies table to the right specifies the minimal version of these other standards it depends on. An E program written to run on a given version of E SHOULD be compatible with the spec of that version as interpreted in terms of the minimal versions of the dependent versions of the other standards or later. For example, the E language 0.9 spec depend on Java 1.3.1 or later. Therefore, an E 0.9 program SHOULD be compatible with, for example, the E 0.9 spec interpreted in light of the Java 1.4 spec. Below, we generally leave out the version number of dependent specifications, and try to link to the latest of version of each one. Please consult this Version Dependencies table for the actual correspondences.

Besides covering E 0.9, we will also explain how we hope to evolve in upcoming E releases, and discuss the potential compatibility implications.

Bytes to Raw Characters

Source text consists of a "raw sequence" of the subset of Unicode characters whose code points fit within 16 bits, which we will here call "Unicode16". When source text is obtained from a source of octets (such as a file), then these octets are to be interpreted according to UTF-8 in order to create the raw sequence of Unicode16 characters.

Rationale & Future Directions

Because we do not yet have any personal experience with non-Ascii Unicode characters at the time of this writing, we considered specifying that raw source text for 0.9 be restricted to Ascii, and therefore that all non-Ascii source characters may only be expressed using Backslash-u Decoding (see below). However, this seems like too great a burden to place on non-English-based programmers wishing to use E.

In the other direction, unless we hear otherwise from Unicode users, we expect we will wish to eventually support the full Unicode character set, rather than just Unicode16. However, at the time of this writing, a Java char is only 16 bits, so larger characters cannot yet be supported with reasonable effort. We could allow raw characters to represent UTF-16 elements, and for sequences of such to represent Unicode characters according to the UTF-16 encoding, but this destroys much of the point of making chars larger than octets in the first place. This issue shows up in the semantics of the E languages as well -- their Strings and chars are defined for now to be Unicode16 characters, but we hope to eventually allow any Unicode character.

Note that Java is considering the UTF-16 direction. If they do, we expect that E will not follow it there.

Raw to Normalized Characters

Once we have a raw sequence of Unicode16 characters (whether from UTF-8 decoding or otherwise), the following transformations are then applied, logically in order, to the raw characters prior to lexical analysis:

  1. Newline Canonicalization. The three common encodings of newline -- "\r\n", "\r", and "\n" -- are all canonicalized to "\n".

  2. Trailing Whitespace Truncation. Sequences of non-newline whitespace characters immediately preceding a newline are removed, leaving just the newline. Similarly, trailing newlines at the end of a compilation unit are replaced with a single newline.

  3. Non-Whitespace Control Character Rejection. If the source text contains control characters other than the whitespace characters ' ' (space), '\t' (tab), and '\n', then it must be statically rejected.

  4. (Not yet implemented) Non-Canonical Unicode Rejection. If the source text is not in Unicode Normalized Form C (NFC) (See also this W3C recommendation), then it must be statically rejected.

  5. Backslash-u Decoding. A '\\' (backslash) followed by 'u' and 4 digit16s (hex digits) is decoded into the Unicode16 character with that code point. Likewise, a "\\u{" followed by some number of digit16s and a "}" is decoded into a Unicode16 character with that code point. As with the corresponding Java spec, such interpretation is suppressed if this sequence is preceded by an odd number of backslashses. If this results in any code points that don't represent Unicode16 characters, then it must be statically rejected.

By "logically in order", we mean the result must be equivalent to what it would be if each stage were applied to the output of the previous stage, not including itself. For example, a "\r\t \r\n" sequence must be transformed into "\n\n" rather than "\n".

Source positions are in terms of line and column after Newline Canonicalization and Trailing Whitespace Rejection, but before Backslash-u Decoding. The first line in a source unit is 1. The first column is 0.

The E family of languages uses matched brackets to indicate nesting structure to a language processor, and SHOULD use indentation to signal this nesting structure to the human eye. Further, source text SHOULD avoid including any tab characters at all. Code reviewers SHOULD be alerted to violations of these recommendations.

Rationale

Line and column positions are robust in the face of Newline Canonicalization and Trailing Whitespace Rejection. If such positions are interpreted (e.g., by an IDE) on the source text after UTF-8 Decoding but prior to these steps, the results will still be sensible.

Since Ascii (meaning standard 7-bit Ascii) is a subset of UTF-8, Backslash-u Decoding allows Ascii source text to represent any sequence of Unicode16 characters.

We introduce the curly-bracketed backslash-u form to prepare for eventually allowing longer Unicode characters.

Making Sources Wysiwyg

We wish to make sources as wysiwyg as possible, in order to support adversarial code reviews.

Trailing Whitespace Truncation can only be significant within quoted literal character sequences, and helps ensure that these character sequences are wysiwyg -- that their visual depiction is an accurate portrayal of their contents.

We reject non-NFC-canonical Unicode sequences, rather than perform NFC canonicalization, to avoid the danger of silently changing the intended meaning of the source text. When such normalization is desired, it's easy enough to apply a separate tool to perform such normalization as an edit to the source text.

Backslash-u Decoding occurs after Non-Canonical Unicode Rejection so that a normalized source character sequence may be essentially any sequence of Unicode16 characters, but non-canonical sequences can only be expressed by the visible use of the backslash-u.

These rules by themselves are not sufficient to make sources wysiwyg.

  • The NFC canonicalization rules may leave two different strings that appear to be the same to the untrained eye, or depending on the choice of font. As such ambiguities are noticed, we recommend building lint-like tools to spot these and alerting reviewers. For example, we recommend that source text, prior to Backslash-u Decoding, SHOULD be in the narrower Unicode Normalized Form KC (NFKC), and that reviewers should check whether this is the case prior to conducting a review. (However, we only enforce the less strict NFC normalization, because NFKC is too severe for case sensitive languages.)

  • The remaining whitespace difference between ' ' (space) and '\t' (tab) is not visible. Rather than reject sources that contain harmless tabs, we defer this issue to the tokenization rules. The tokenization rules MUST reject tabs where they can be semantically significant, such as within literal character sequences (i.e., between quote marks or within DocComments).

Character Classes

Name   Definition Ascii subset
hspace
::=
!'\n' isWhitespace
' ' | '\t'
whitespace
::=
isWhitespace
' ' | '\t' | '\n'
digit10
::=
isDigit
'0'..'9'
digit8
::=
 
'0'..'7'
digit16
::=
 
'0'..'9' | 'a'..'f' | 'A'..'F'
uric
::=
  IETF-URICs
| '\\' | '|' | '#'
  'a'..'z' | 'A'..'Z' | '0'..'9'
| anyof("_$.-;/?:@&=+,!~*'()%\\|#")

The non-ascii cases are not yet tested in the current implementations.

In the uric production, each '\\' (backslash) character is converted to '/', and each '|' (vertical bar) character is converted to ':'. Therefore, the possible semantic values associated with this production do not include the backslash or vertical bar characters.

Token Types

In the following table, the bold names with initial capitals are the token types. The others are supporting productions.

Name   Definition Denotes
digit10s
 ::= 
digit10 ('_'? digit10)*
 
Integer
 ::= 
  '-'? '0' 'x' digit16 ('_'? digit16)*
| '-'? '0' ('_'? digit8)*  # not yet implemented
/ '-'? digit10s

Precision-unlimited integer.

wholePart
::=
'-'? digit10s
 
fraction
::=
'.' digit10s
 
exponent
 ::= 
('e' | 'E') '-'? digit10s
 
Real64
::=
  wholePart fraction exponent?
| wholePart fraction? exponent

A real number representable in IEEE double precision.

charConst
::=
  '\\' anyof("btnfr\"'\\")
| '\\' 'x' digit16*2 # not yet implemented
| !'\'' !'"' .
Char
 ::= 
'\'' (charConst | '"') '\''

A Unicode character.

String
::=
'"' (charConst
     | '\'' 
     | '\\' '\n'
    )* '"'

A string of Unicode characters.

In a literal string, a backslash followed by a newline is ignored -- the backslash eats the newline.

Note that Real64 includes both 0.0 and -0.0. These are distinct, even though they represent the same real number.

Rationale

We allow '_' (underbar) characters within digit sequences so that long digit sequences can be broken up for readability. For example, the number of cents in 1.3 million dollars can be written as "1_300_000_00". (Is it PERL that also allows this?)

For convenience, we allow but do not require single quotes to be escaped in double quoted literals, and vice versa.

For convenience, we allow multi-line string literals without per-line delimeters, even though reviewers can become confused about what they're looking at. Syntax highlighting SHOULD be used to make literals visibly distinct from non-literal source text during reviews.

 
 
ERights Home
Download    FAQ    API    Mail Archive    Donate
email webmaster-at-erights.org
or report bug (including invalid html)

Golden Key Campaign Blue Ribbon Campaign Stop Policeware Campaign