|
|
Common
Syntactic Element
For some of the E
family of languages |
Obligated Parties
The key words "MUST", "MUST NOT",
"REQUIRED", "SHALL", "SHALL NOT", "SHOULD",
"SHOULD NOT", "RECOMMENDED", "MAY", and
"OPTIONAL" in this document are to be interpreted as described
in RFC 2119.
A language spec places requirements on the (authors of)
source text written in that language, and on the actions of the canonical
kind of "language processor", such as a compiler or interpreter,
that is to interpret source text written in that language. (A syntax highlighter
is an example of a non-canonical kind of language processor.) When we
state that the source text MUST (or MUST NOT, SHALL, SHALL NOT, REQUIRED),
unless stated otherwise, there is a corresponding requirement on the canonical
kinds of language processor for that language to reject programs that
violate that obligation.
When we state that the source text SHOULD (or SHOULD NOT,
RECOMMENDED, NOT RECOMMENDED), unless stated otherwise, we make a corresponding
recommendation that lint-like tools SHOULD be built and employed during
code reviews, so that code reviewers are aware of all violations of these
recommendations, or of other surprises that may cause them to misread
what they are looking at.
Rationale
The E family
is intended to support secure
programming and adversarial code reviews.
The author of the source text, the reviewers reading it to understand
what it might do, and the owner of the language processor on which it
runs may all be mutually suspicious parties with somewhat divergent
interests. The rules below on what source text MUST and MUST NOT do
are designed to enable these parties to safely cooperate by use of these
source texts. If source-author Alice has a reasonable expectation that
a rule violation by Alice might not be caught by code reviewer Bob or
language processor owner Carol, then Alice may violate that rule, independent
of the wording in standards documents such as this. To avoid confusion,
we only say "MUST", etc, when the rule can be enforced by
the canonical language processor.
The reverse is not the case: If Carol operates her language
processor in ways that violate these rules, no
one is necessarily in a position to catch her. Therefore, the obligations
on her language processor are primarily for her own protection.
Dependencies and Versions
Some of the E
family of languages -- the E
language, Kernel-E, and TermL -- delegate aspects of their specifications
to other specifications, such as this page. This page in turn delegates
to yet other specifications, such as Java, Unicode, and various IETF and
W3C standards. This page currently intends to support the specifications
of the E family as of the upcoming
E 0.9 release (also to be known
as E alpha). For each E
release, the Version Dependencies table to the right specifies the minimal
version of these other standards it depends on. An E
program written to run on a given version of E
SHOULD be compatible with the spec of that version as interpreted in terms
of the minimal versions of the dependent versions of the other standards
or later. For example, the E
language 0.9 spec depend on Java 1.3.1 or later. Therefore, an E
0.9 program SHOULD be compatible with, for example, the E
0.9 spec interpreted in light of the Java 1.4 spec. Below, we generally
leave out the version number of dependent specifications, and try to link
to the latest of version of each one. Please consult this Version Dependencies
table for the actual correspondences.
Besides covering E
0.9, we will also explain how we hope to evolve in upcoming E
releases, and discuss the potential compatibility implications.
Bytes to Raw Characters
Source text consists of a "raw sequence" of the
subset of Unicode characters whose code points fit within 16 bits, which
we will here call "Unicode16". When source text is obtained
from a source of octets (such as a file), then these octets are to be
interpreted according to UTF-8 in order to create the raw sequence of
Unicode16 characters.
Rationale & Future Directions
Because we do not yet have any personal experience with
non-Ascii Unicode characters at the time of this writing, we considered
specifying that raw source text for 0.9 be restricted to Ascii, and
therefore that all non-Ascii source characters may only be expressed
using Backslash-u Decoding (see below). However, this seems like too
great a burden to place on non-English-based programmers wishing to
use E.
In the other direction, unless we hear otherwise from
Unicode users, we expect we will wish to eventually support the full
Unicode character set, rather than just Unicode16. However, at the time
of this writing, a Java char is only 16 bits, so larger characters
cannot yet be supported with reasonable effort. We could allow raw characters
to represent UTF-16 elements, and for sequences of such to represent
Unicode characters according to the UTF-16 encoding, but
this destroys much of the point of making chars larger than octets in
the first place. This issue shows up in the semantics of the E
languages as well -- their Strings and chars are defined for now to
be Unicode16 characters, but we hope to eventually allow any Unicode
character.
Note that Java
is considering the UTF-16 direction. If they do, we expect that
E will not follow it there.
Raw to Normalized Characters
Once we have a raw sequence of Unicode16 characters (whether
from UTF-8 decoding or otherwise), the following transformations are then
applied, logically in order, to the raw characters prior to lexical analysis:
-
Newline Canonicalization. The three common encodings of newline
-- "\r\n", "\r", and "\n"
-- are all canonicalized to "\n".
-
Trailing Whitespace Truncation. Sequences of non-newline whitespace
characters immediately preceding a newline are removed, leaving just
the newline. Similarly, trailing newlines at the end of a compilation
unit are replaced with a single newline.
-
Non-Whitespace Control Character Rejection. If the source
text contains control characters other than the whitespace characters
' ' (space), '\t' (tab), and '\n', then it must be statically rejected.
-
(Not yet implemented) Non-Canonical Unicode Rejection. If
the source text is not in Unicode
Normalized Form C (NFC) (See also this
W3C recommendation), then it must be statically rejected.
-
Backslash-u Decoding. A '\\' (backslash) followed
by 'u' and 4 digit16s (hex digits)
is decoded into the Unicode16 character with that code point. Likewise,
a "\\u{" followed by some number of digit16s
and a "}" is decoded into a Unicode16 character
with that code point. As with the corresponding Java spec, such interpretation
is suppressed if this sequence is preceded by an odd number of backslashses.
If this results in any code points that don't represent Unicode16
characters, then it must be statically rejected.
By "logically in order", we mean the result must
be equivalent to what it would be if each stage were applied to the output
of the previous stage, not including itself. For example, a "\r\t
\r\n" sequence must be transformed into "\n\n"
rather than "\n".
Source positions are in terms of line and column after Newline
Canonicalization and Trailing Whitespace Rejection, but before Backslash-u
Decoding. The first line in a source unit is 1. The first column is 0.
The E family
of languages uses matched brackets to indicate nesting structure to a
language processor, and SHOULD use indentation to signal this nesting
structure to the human eye. Further, source text SHOULD avoid including
any tab characters at all. Code reviewers SHOULD be alerted to violations
of these recommendations.
Rationale
Line and column positions are robust in the face of Newline
Canonicalization and Trailing Whitespace Rejection. If such positions
are interpreted (e.g., by an IDE) on the source text after UTF-8 Decoding
but prior to these steps, the results will still be sensible.
Since Ascii (meaning standard 7-bit Ascii) is a subset
of UTF-8, Backslash-u Decoding allows Ascii source text to represent
any sequence of Unicode16 characters.
We introduce the curly-bracketed backslash-u form to prepare
for eventually allowing longer Unicode characters.
Making Sources Wysiwyg
We wish to make sources as wysiwyg as possible, in order
to support adversarial code reviews.
Trailing Whitespace Truncation can only be significant
within quoted literal character sequences, and helps ensure that these
character sequences are wysiwyg -- that their visual depiction is an
accurate portrayal of their contents.
We reject non-NFC-canonical Unicode sequences, rather
than perform NFC canonicalization, to avoid the danger of silently changing
the intended meaning of the source text. When such normalization is
desired, it's easy enough to apply a separate tool to perform such normalization
as an edit to the source text.
Backslash-u Decoding occurs after Non-Canonical Unicode
Rejection so that a normalized source character sequence may be essentially
any sequence of Unicode16 characters, but non-canonical sequences can
only be expressed by the visible use of the backslash-u.
These rules by themselves are not sufficient to make sources
wysiwyg.
-
The NFC canonicalization rules may leave two different strings
that appear to be the same to the untrained eye, or depending on
the choice of font. As such ambiguities are noticed, we recommend
building lint-like tools to spot these and alerting reviewers. For
example, we recommend that source text, prior to Backslash-u Decoding,
SHOULD be in the narrower Unicode Normalized Form KC (NFKC), and
that reviewers should check whether this is the case prior to conducting
a review. (However, we only enforce the less strict NFC normalization,
because NFKC is too severe for case sensitive languages.)
-
The remaining whitespace difference between ' ' (space) and '\t'
(tab) is not visible. Rather than reject sources that contain harmless
tabs, we defer this issue to the tokenization rules. The tokenization
rules MUST reject tabs where they can be semantically significant,
such as within literal character sequences (i.e., between quote
marks or within DocComments).
Character Classes
Name |
|
Definition |
Ascii subset |
|
hspace |
::= |
!'\n' isWhitespace |
' ' | '\t' |
|
whitespace |
::= |
isWhitespace |
' ' | '\t' | '\n' |
|
digit10 |
::= |
isDigit |
'0'..'9' |
|
digit8 |
::= |
|
'0'..'7' |
|
digit16 |
::= |
|
'0'..'9' | 'a'..'f' | 'A'..'F' |
|
uric |
::= |
IETF-URICs
| '\\' | '|' | '#' |
'a'..'z' | 'A'..'Z' | '0'..'9'
| anyof("_$.-;/?:@&=+,!~*'()%\\|#") |
The non-ascii cases are not yet tested in the current implementations.
In the uric production, each '\\' (backslash) character
is converted to '/', and each '|' (vertical bar) character is converted
to ':'. Therefore, the possible semantic values associated with this production
do not include the backslash or vertical bar characters.
Token Types
In the following table, the bold names with initial capitals are the
token types. The others are supporting productions.
In a literal string, a backslash followed by a newline is ignored --
the backslash eats the newline.
Note that Real64 includes both 0.0 and -0.0. These are distinct, even
though they represent the same real number.
Rationale
We allow '_' (underbar) characters within digit sequences so that long
digit sequences can be broken up for readability. For example, the number
of cents in 1.3 million dollars can be written as "1_300_000_00".
(Is it PERL that also allows this?)
For convenience, we allow but do not require single quotes to be escaped
in double quoted literals, and vice versa.
For convenience, we allow multi-line string literals without per-line
delimeters, even though reviewers can become confused about what they're
looking at. Syntax highlighting SHOULD be used to make literals visibly
distinct from non-literal source text during reviews.
|
|