/ data / serial / jhu-paper

Appendix A:
The Data-E Manual

Unserialization as Evaluation

(This is approximately an unabridged form of the Unserialization as Evaluation section presented previously.)

As shown above, unserialization can be thought of, or even implemented as, expression evaluation [ref Rees, XMLEncoder]. A depiction is an expression in some programming language, the unserializer is the eval function, the exit references to be reconnected are free variable references, the values to reconnect them to come from the scope (i.e., environment) provided to eval, and the root of the reconstructed subgraph is the value the expression evaluates to. Serialization is the logically inverse process, in which an uneval function is applied to a root and an unscope, and writes out an expression that, were it evaluated in the corresponding scope, would reconstruct the subgraph.

Data-E is the subset of E used for depicting a subgraph as an expression. Ignoring precedence, it consists of the following productions:

Fundamental Data-E Constructs

`expr ::=`	`literal \| varName \| tempName \| call \| define \| defrec`
`literal ::=`	`LiteralInt \| LiteralFloat64 \| LiteralChar \| LiteralString` `# Each are written as they are in Java. # Examples: 37, 42.3, 'c', "What me worry?"`
`varName ::=`	`Identifier # but not of the form "t__"Digit*` `# These variable names are only used freely. # Example: foo`
`tempName ::=`	`"t__"Digit*` `# These variable names are never used freely. # Example: t__12`
`call ::=`	`expr "." Identifier "(" exprs ")"` `# Example: __makeList.run("foo", 49)`
`define ::=`	`"def" tempName ":=" expr` `# Where expr does not refer to tempName # Example: def t__12 := makePoint.run(3, 5)`
`defrec ::=`	`"def" tempName ":=" expr` `# Where expr may refer to tempName # Example: def t__0 := __makeList.run(1, t__0, 3)`
(The difference between `define` and `defrec` is not apparent in the textual source language, but becomes significant in the other representations of Data-E, as explained below.)
`exprs ::=`	`(expr ("," expr)*)?`

E Syntactic Shorthands generated by `deSrcKit`

Since we use deSrcKit to build the depictions we present for expository purposes, we need to know the shorthands it builds, which are a subset of the shorthands recognized and expanded by E. Going the other way, all E syntactic shorthands, including those below, are recognized by deSrcKit, since it uses the E parser to parse and expand its input.

Any valid E expression that expands only into the above Data-E primitives is a valid Data-E expression with the same meaning. Likewise any valid Data-E expression is a valid E expression with the same meaning.

`expr "(" exprs ")"`	is shorthand for `expr ".run(" exprs ")"` If the message name is left out, it defaults to "`run`". For example, `__makeList("foo", 49)` is shorthand for `__makeList.run("foo", 49)`.
`"[" exprs "]"`	is shorthand for `"__makeList.run(" exprs ")"` Square brackets can be used to express lists. For example, `["foo", 49]` is shorthand for `__makeList.run("foo", 49)`.
`expr "[" exprs "]"`	is shorthand for `expr ".get(" exprs ")"` For example, `vec[i]` is shorthand for `vec.get(i)`.
`"<"Identifier">"`	is shorthand for `Identifier"__uriGetter"` For example, `<file>` is shorthand for `file__uriGetter`
`"<"Identifier":"URIC*">"`	is shorthand for `Identifier"__uriGetter.get(" URIBody ")"` Where the URIBody is a literal string whose value is the sequence of URI characters. As explained earlier in E's URI Expressions, `<file:/foo/bar>` is shorthand for `file__uriGetter.get("/foo/bar")`
`"-" exprs`	is shorthand for `exprs ".negate()"` Allows negative numbers to be written as concisely as literals. For example, `-5` is shorthand for `5.negate()`.

Using several cases together:

? pragma.syntax("0.8")

? def root := [1, root, 1, <import:java.lang.makeStringBuffer>]
# value: [1, <***CYCLE***>, 1, <makeStringBuffer>]

? def surgeon := <elib:serial.makeSurgeon>.withSrcKit("de: ")

? def depiction := surgeon.serialize(root)
# value: "de: def t__0 := [def t__2 := 1,
#                          t__0,
#                          t__2,
#                          <import:java.lang.makeStringBuffer>]"

? surgeon.unserialize(depiction)
# value: [1, <***CYCLE***>, 1, <makeStringBuffer>]

The value of the E expression <import:java.lang.makeStringBuffer> serializes as the Data-E expression import__uriGetter.get("java.lang.makeStringBuffer"), as will be explained in the next chapter. The deSrcKit shows this expression using the URI shorthand.
The def t__2 := 1 is an instance of the define production, since t__2 is not used on its right hand side.
The def t__0 := ... is an instance of the defrec production, since t__0 is used on its right hand side, expressing a cyclic data structure.
The square brackets for forming a list are shorthand for calling __makeList with those same arguments.

For those familiar with Java, Data-E should be mostly familiar, but with a few important differences:

In E, a variable definition is an expression. Like assignment, the value of a definition expression is the value of the right hand side.
null, false, and true are not keywords in E, but rather are variable names in E's universal scope and in Data-E's default scope and unscope. This means an expression can count on them having their normal values, so these don't need to be literals. Likewise for Infinity and NaN, which hold these floating point values that can't be written as literals. The "false" in the first example of Previews of Data-E Serialization above was a variable reference, not a literal, just as it is in E.
Using only the literal, varName, and call productions, we can write Data-E expressions that will evaluate to new tree structures whose leaves are reattached exit points.
In E, a variable is in scope starting from its defining occurrence, left-to-right, until the close-curly that closes the scope box (lexical contour) in which it is defined, and not counting regions of nested scope boxes where it is shadowed by a definition of the same name. Ignoring defrec for a moment, this is a straightforward generalization of behavior familiar from Java and many other block structured languages. In Data-E, since there are no constructs that introduce scope boxes (no constructs with curly brackets), every variable is in scope from its defining occurrence until the end of the top-level expression as a whole.
With the tempName and define productions, we can use Data-E to represent DAGs. For those values that are multiply referenced, we can write out the subexpression for calculating this value at the first position it needs to appear, as the right hand side of a define, capturing its value in a temp variable (def t__2 := 1). Everywhere else this value is needed, we use tempName to reuse this captured value (t__2).
With defrec, we can use Data-E to represent graphs. Unlike other block structured languages, even when the name being defined on the left is used on the right, E still holds strictly to the left-to-right scoping rule. But what value does the name on the right evaluate to, before its actual value has been determined? The answer is an unresolved promise, similar to a logic variable or a future. An unresolved promise is an object reference whose designation has not yet been determined. For every unresolved promise, there's a Resolver used to determine what object the promise designates. Once a promise is resolved, it becomes like any normal reference to the object it designates.

The defrec expression evaluates by defining a promise/Resolver pair, defining the value of tempName to be the promise, evaluating the right hand expression in the scope of this definition, and resolving the promise to the value of the right hand side. This value is also the value of the defrec expression as a whole. At the moment the above list is created, t__0 is still an unresolved promise. Afterwards, once the def t__0 := ... executes, the promise becomes the list itself. (*** unclear. Explain that defrec has right-to-left execution order, while maintaining left-to-right scoping order.)

Most other serialization systems [ref JOSS, XMLEncoder, BOSS, ...] are defined in systems without any kind of postponed references (promises, logic variables, futures), but which still allow user-defined unserialize-time behavior by the objects being unserialized. In any such system, when unserializing a cycle, user-defined behavior may interact with graph neighbors that are not yet fully initialized. In a conventional system this is only a minor source of bugs. But in a graph of mutually suspicious objects this would be a major opportunity for an adversary. By referring to an object only with promises until its fully initialized, we make such bugs safely fail-stop, enabling an adversary in the graph only to mount a denial-of-unserialization attack, which it can trivially do anyway, and which we make no effort or claim to prevent. (*** Need earlier section on security claims, non-claims, scenarios, threat models, etc. *)

Although Data-E and Kernel-E are both subsets of E, neither is a subset of the other, since Kernel-E does not support the defrec expression.

Data-E is a true subset of E, both syntactically and semantically. Since E is a secure object-capability language, the security issues surrounding evaluation of E programs are already well understood. By using a subset of E in this way, we get to leverage this understanding.

Recognizers and Builders

The following table shows all the current representations of Data-E and the kits that support them.

	recognizing	building	Kit Name
	Some Common Names		Kit Name
Data-E source	parsing	pretty-printing	`deSrcKit`
Data-E AST	visiting	expansion	`deASTKit`
DataECode assembly	assembling	disassembling	`deAssemblyKit`
DataECode bytecodes	dispatching	generating	`deBytecodeKit`
Delimited Subgraph	traversing unevaluating	constructing evaluating	`deSubgraphKit`

We have already encountered the first and last row. Like the first row, the middle rows are all depictions -- they contain nothing but portable bits. All the depiction rows are just straightforward engineering, and operate in a canonical way with no policy choices. All expression of policy and mixing of intent occurs on the subgraph row, mostly during recognition.

The second row, Data-E AST, is simply the abstract syntax tree representation corresponding to Data-E source -- it represents the same information in the same abstract way, but optimized for machine rather than human processing.
The third row, DataECode assembly, is the assembly language for the DataECode instruction set. DataECode is a traditional form of reverse-polish instruction set for a stack machine, designed solely as a "compilation" target for Data-E, for the purpose of evaluating such compiled Data-E expressions.
Finally, DataECode bytecodes, represents DataECode for machine rather than human processing. It is designed to be relatively compact, quick to generate, and quick to process. This is the default depiction for production use of serialization. Because it can easily be converted to and from the other forms of depiction, we gain the benefits of an efficient binary form with no real penalty in debuggability.

For all the depiction kits other than deASTKit, what you get out (during recognition) is the same as what you put in (during building). The deASTKit has one special feature of benefit to the readability of the other depictions -- it compacts and simplifies its input a bit, removing unnecessary temporary variables and turning defrecs into defines when it can.

? def cycle := ['a', cycle, 'a', -3]
# value: ['a', <***CYCLE***>, 'a', -3]

? def deSubgraphKit := <elib:serial.deSubgraphKit>
? def deSrcKit := <elib:serial.deSrcKit>

? deSubgraphKit.recognize(cycle, deSrcKit.makeBuilder())
# value: "def t__0 := [def t__2 := \'a\', t__0, t__2, def t__4 := -3]"

In our first preview example, several temporary variables were defined but none were used. Here, all but t__4 are used. deASTKit removes the definition of t__4 and weakens the defrec of t__2 into a define.

? def deASTKit := <elib:serial.deASTKit>
# value: <deASTKit>

? def ast := deSubgraphKit.recognize(cycle,
>                                    deASTKit.makeBuilder())
# value: term`defrec(0,
#                    call(import("__makeList"),
#                         "run",
#                         [define(2, 'a'),
#                          ibid(0),
#                          ibid(2),
#                          -3]))`

The result is a kind of tree structure known as a term tree, but for present purposes may as well be S-Expressions, XML, or any other adequate system for representing trees of symbols. In this case, we can ignore it, as it is just an intermediate representation on the way to the simplification we're interested in:

? deASTKit.recognize(ast, deSrcKit.makeBuilder())
# value: "def t__0 := [def t__2 := \'a\', t__0, t__2, -3]"

This shows the chaining of recognizers and builders, much as one chains Unix filters using pipes. Because deASTKit is particularly useful as a "filter", it provides a convenient wrap(..) method which takes a builder and returns a similar builder, but with deASTKit's simplifications interposed in front of the argument builder:

? deSubgraphKit.recognize(cycle,
>                         deASTKit.wrap(deSrcKit.makeBuilder()))
# value: "def t__0 := [def t__2 := \'a\', t__0, t__2, -3]"

We make a reasonably compact binary serialization as follows:

? def deBytecodeKit := <elib:serial.deBytecodeKit>
# value: <deBytecodeKit>

? (def builder := deASTKit.wrap(deBytecodeKit.makeBuilder())
>  def code := deSubgraphKit.recognize(cycle, builder)
>  code.size())
# value: 34

The variable code now holds a list of 34 bytes. To see what it says, we can disassemble it:

? def deAssemblyKit := <elib:serial.deAssemblyKit>
# value: <deAssemblyKit>

? deBytecodeKit.recognize(code, deAssemblyKit.makeBuilder())
# value: "OP_PROMISE # [t__0, t__1]
#        OP_IMPORT(\"__makeList\")
#        OP_LIT_CHAR(\'a\')
#        OP_DEFINE # t__2
#        OP_IBID(0)
#        OP_IBID(2)
#        OP_LIT_NEGINT(3)
#        OP_CALL(\"run\", 4)
#        OP_DEFREC(1)
#        OP_ROOT
#        "

Instruction

Does

Leaving the Stack as

OP_PROMISE
OP_IMPORT("__makeList")
OP_LIT_CHAR('a')
OP_DEFINE
OP_IBID(0)
OP_IBID(2)
OP_LIT_NEGINT(3)
OP_CALL("run", 4)
OP_DEFREC(1)
OP_ROOT

def [t__0,t__1] := Ref.promise()
push __makeList
push 'a'
def t__2 := tos
push t__0
push t__2
push -3
call run with three args
t__1.resolve(tos)
return pop()

()
()
(__makeList)
(__makeList, 'a')
(__makeList, 'a')
(__makeList, 'a', t__0)
(__makeList, 'a', t__0, 'a')
(__makeList, 'a', t__0, 'a', -3)
(['a', t__0, 'a', -3])
(['a', ['a', ... -3], 'a', -3])
()

or decompile it:

? deBytecodeKit.recognize(code, deSrcKit.makeBuilder())
# value: "def t__0 := [def t__2 := \'a\', t__0, t__2, -3]"

We can unserialize by interpreting these instructions:

? deBytecodeKit.recognize(code, deSubgraphKit.makeBuilder())
# value: ['a', <***CYCLE***>, 'a', -3]

The Event-Based `DEBuilder` API

We see above that we effectively have many converters between different representations of our subgraphs: serialization, unserialization, assembly, disassembly, decompilation, etc. We define each converter by composing a recognizer with a builder for the same reason many compiler suites compose a compiler front-end (lexer, parser, etc) with a compiler back-end (code generator) -- to provide all needed converters with n+m code rather than n*m code. (Or, in our case, 2n rather than n**2.)

We do this with the same trick used by many compilers: we define a reusable API which sits between multiple front ends and multiple back ends [ref gcc]. A front-end recognizes patterns in the input representation, and reports by calling this API. A backend implements this API. In response to being called, it builds an output representation. Our reusable API is what Data-E builders all implement, so it defines the interface type DEBuilder. (For present purposes, we ignore the differences between guards and types.)

With the DEBuilder API in the middle, we can convert from any of these representations to any other by composing any recognizer with any builder. This provides some novel flexibility. For example, sometimes the need for serialization and for unserialization are not separated in space or time, as when we only want a deep-copy-with-differences operation. In this case we can cut out the middle-man, and hook a subgraph recognizer directly to a subgraph builder, without ever creating an intermediate depiction.

DEBuilder is actually a parameterized type defined by the following E code:

def DEBuilderOf(Node :Guard, Root :Guard) :Guard {
    interface DEBuilder {

        to getNodeType() :Guard
        to getRootType() :Guard

        to buildRoot(top :Node) :Root

        to buildLiteral(value :(int | float64 | char | String)) :Node
        to buildImport(varName :String) :Node
        to buildIbid(tempIndex :int) :Node
        to buildCall(rec :Node, verb :String, args :Node[]) :Node

        to buildDefine(rValue :Node) :Tuple[Node, int]
        to buildPromise() :int
        to buildDefrec(resolverIndex :int, rValue :Node) :Node
    }
}

The values that may be given to buildLiteral are any integer, floating point number, character, or bare (unannotated) String. Not all of instances of these can be represented literally in all Data-E representations, for example, the Data-E source text representation. When the builder b is made by the deSrcKit, it copes as follows:

Negative numbers: b.buildLiteral(-5) becomes b.buildCall(5, "negate", [])
Floating point NaN and positive Infinity:
b.buildLiteral(NaN) becomes b.buildImport("NaN")
b.buildLiteral(Infinity) becomes b.buildImport("Infinity")
Floating point negative Infinity combines these cases
b.buildLiteral(-Infinity) becomes b.call(.buildImport("Infinity"), "negate", [])

The scope used for unserialization is assumed to have standard bindings from the names null, false, true, NaN, Infinity to the corresponding scalars. This covers the representation of all E scalars. During serialization, the default unscope also has bindings for these, so some of these cases will typically be handled by the unscope lookup farther below, rather than by a special case in buildLiteral.

Because integers used for crypto can be insufferably long when written in decimal, we have also included __makeInt in the default scope and unscope. For integers over a certain size, the deSrcKit writes them out as an expression of the form:

__makeInt.fromString64("UwFaUVGcc2HtxA...")

where the literal string is a string64 encoding of the integer (six bits per character).

For Data-E to be a practical system, buildLiteral and the DataECode instruction set would need to be enhanced to accept lists of scalars of the same type, in order to support a packed binary representation.

Node and Root are type parameters of the function DEBuilderOf. When this function is called with two actual type arguments, it returns a DEBuilder type defined in terms of those types arguments.

Grammar of Valid Sequences of Data-E Building Calls

DEBuilder must deal well both with sequence-based and with tree-based representations. Were we only building sequences, the Nodes returned by the build methods might not matter, but the sequence of calls would matter. In particular, to easily generate instructions for a stack machine, these methods must be called in postfix order (reverse polish). When generating a tree, the sequence might not matter, but the returned and argument Nodes would. To build a node of an AST, one would first build the children. The Nodes returned from those build calls would be the arguments to the call to build their parent. Therefore, since this interface can be used to generate either a tree or a sequence, its clients must obey both constraints, and the builder can make use of either regularity.

The allowed sequences of calls are described by the following "grammar":

`start ::=`	`expr0 buildRoot(node0)`
`expr ::=`	`literal \| varName \| tempName \| call \| define \| defrec`
`literal ::=`	`buildLiteral(value :(int \| float64 \| char \| String))`
`varName ::=`	`buildImport(varName :String)`
`tempName ::=`	`buildIbid(tempIndex :int)`
`call ::=`	`expr0 expr1..exprN buildCall(node0, verb :String, [node1..nodeN])`
`define ::=`	`expr0 buildDefine(node0)`
`defrec ::=`	`promise0 expr1 buildDefrec(resolver0 :int, node1)`
`promise ::=`	`buildPromise()`

Where nodeN is the value returned by exprN, and resolverN is one more than the integer returned by promiseN. In other words, if promise0 is from a buildPromise() which returned a 37, then resolver0 would be 38. The 37'th temporary variable will hold a promise, and the 38th will hold the corresponding Resolver. It is the responsibility of the builder to maintain a counter of temporary variable indices, and have buildDefine(..) and buildPromise() allocate and return the next sequential number. It is the responsibility of the caller to feed previous results back in as arguments according to this numbering.

Rather than burdening each individual recognizer and builder with the task of checking that its counterparty obeys this grammar, we expect instead to create a transparent validating builder wrapper that both does this checking and forwards these calls to its wrapped builder. We may then code the other recognizers and builders assuming the counterparty is correct. When this assumption in inappropriate, we can interpose the validator.

This kind of event-based interface for generating or consuming trees is in a tradition most prominently represented by SAX -- the" Simple API for XML" [ref SAX events]. If DEBuilder is like SAX, then Data-E AST is like DOM and Data-E Source is like XML surface syntax. For tree structured data, or any syntax expressed in BNF, many of these issues are eternal and perpetually reinvented.

However, SAX -- and most of the APIs that call themselves event-based -- provide only the sequence of calls (both entry and exit events, allowing both prefix and postfix processing), but not the passing of previous results back in as construction arguments. To build a DOM tree from SAX events, the builder must keep track of the stack itself. A closer precedent is provided by parser generators such as yacc [ref], in which the (building) actions happen in postfix order, and each action is provided with the "semantic value" (our "Node") returned by the actions associated with its child productions. A yacc generated parser calling semantic actions corresponds well to our notion of a recognizer calling a builder.

Unless stated otherwise, all text on this page which is either unattributed or by Mark S. Miller is hereby placed in the public domain.

/ data / serial / jhu-paper

ELib E Language Smart Contracts Related

Download FAQ API Mail Archive Donate

report bug (including invalid html)