/ data / serial / jhu-paper

Deconstructing Serialization Part 1
Introducing Data-E

Lessons of Deconstruction Serialization

The body of each chapter is presented using running code examples, and with detailed enough explanation that you should be able to follow this code. For those that wish only to understand the general lessons and how they may be applied to more conventional serialization systems, see the "Lessons of..." and "Corresponding Concepts in Convention Serialization" sections at the beginning and end of each chapter

The "Deconstruction Serialization" chapter defines our serialization format, Data-E, by subsetting a programming language, E. This move is inessential, but it allows us to more easily understand what's happening.

Many existing systems (Lisps, Smalltalk) often print an object as an expression that, if evaluated, would reconstruct that object. We can understand the data by reusing a subset of our code reading skills. When these printing systems do so reliably, they are often grown into serialization systems, in which the depiction actually is the program it appears to be. We can then understand the semantics of the serialized form by reusing a subset of our understanding of the semantics of the programming language.

(Serialization in Mozart creates the equivalent of a compiled module, providing the semantic correspondence but not the visibility. Perhaps these can be decompiled.)

In these terms, Java has three very different depiction systems.

The printed form of a Java object is simply its response to toString(). While these are often in the form of expressions that would reconstruct the object, often they're not, and there's no general convention.
Java Object Serialization Streams, or JOSS [ref JOSS], defines a complex opaque binary format which few know how to parse, and which is rarely made visible. JOSS is implemented using special powers (native methods for violating encapsulation) not available to other objects, making it too dangerous to use under mutual suspicion. However, JOSS also has an extensive and well thought out set of hooks for customizing serialization and unserialization. The logic of these hooks strikes the balance we need between flexibility and security. The corresponding hooks in Data-E are based on lessons learned from JOSS.
In apparent reaction to the maintenance problems created by opacity and the private access, Java now has an additional serialization framework, the XMLEncoder [ref XMLEncoder]. The XMLEncoder writes depictions which are semantically identical to Java programs. This format combines the semantics of Java with the readability of XML. Like Data-E, this system serializes only by interacting with the public protocol of the original objects, and unserializes by evaluating the depiction-as-program which again only uses public protocols to perform the reconstruction. Unfortunately, the XMLEncoder is designed around the Java Beans conventions, which conflates access with initialization, rendering it's design useless for our present purposes.

To produce a depiction of an object graph, a serializer must somehow obtain a representation of each object adequate for calculating an overall depiction. In the printing frameworks, like Java's toString(), the representation offered is a depiction -- it is already only bits, and each object is responsible for traversing the portion of the subgraph rooted in itself. This is maximally flexible -- it allows an object to claim anything it likes. Used for serialization, this flexibility has fatal security problems, as explained in the next chapter.

In our approach, the serializer must obtain for each object a portrayal -- a representation of that object in terms of live references to other objects. The serializer's traversal proceeds as it obtains a portrayal for each of these "components", etc. A serializer can simply ask an object to provide a portrayal of itself -- the object's self portrait -- or, as we will see, it can derive or substitute its own portrayal of the object as an expression of some other policy objective. However it obtains these portrayals, the resulting depiction is a literal picture only of the graph of these portrayals, rather than the actual graph of objects being serialized.

Concrete Embodiment in E

Although the ideas in this paper should be applicable to any object-capability language and many serialization formats, as previously mentioned, for concreteness, we present the implementation of these ideas in E as applied to Data-E. When an example is shown as an E command line session, like

? pragma.syntax("0.8")

? 2 + 3
# value: 5

then the example also doubles as an executable regression test. By updocing the page containing the examples, you can see whether the system behaves as shown by the example. If you installed E at, for example, "c:/Program Files/erights.org" and placed "c:/Program Files/erights.org/scripts" on your PATH, then in a directory containing the chapters of this paper, at a shell prompt you can type:

$ updoc.e deconstructing.html
directory-path/deconstructing.html:...............................

(Though please see "Security Considerations" before running this or any other Updoc script.)

Each of the dots is a test case that successfully passed, like the above "2 + 3". As you read, if you are curious about how a variation of an example would behave, make a copy of the page, edit appropriately, and updoc it. Or try the examples interactively at a rune command-line:

$ rune
? 2 + 3
# value: 5

?

The rune program starts an E read-eval-print loop. The "?" is the E prompt. To exit rune, type the end-of-file character: Control-D or Control-Z depending on your shell.

In order to have a common point of reference, this paper assumes a basic prior knowledge of Java. JOSS (Java's Object Serialization Streams) [ref JOSS] is occasionally used for comparison, so a prior knowledge of it will help, but is not required.

We assume only that prior knowledge of E explained in the Ode ("Capability-based Financial Instruments") [ref Ode] and that explained in the next section on "E's URI Expressions". E syntactically resembles other C tradition object languages, such as Java, C++, C#, and Python. When the meaning of E code isn't covered by the Ode or the next section, and isn't obvious by analogy with Java, we will explain as we go. For more on E, please see [ref erights.org, Walnut].

E's URI Expressions

An unfamiliar bit of E syntax, needed to understand the examples in this paper, is the URI expression, written as a URI string between angle brackets.

? def f := <file:/foo/bar>
# value: <file:c:/foo/bar>

The protocol name to the left of the URI's colon (file here, but any identifier) is transformed (mangled) into the name of the variable (file__uriGetter) whose value is asked to retrieve the named resource. The characters between the colon and the close angle bracket (which must be legal characters for a URI body), becomes the literal resource name to be looked up ("/foo/bar"). So the above session is a shorthand for the equivalent:

? def f := file__uriGetter.get("/foo/bar")
# value: <file:c:/foo/bar>

Besides accessing the kind of resources normally accessed by URLs, the URI expression is also used as E's module import mechanism. The typical form of import in E is

def name := <import:fully-qualified-name>

where fully-qualified-name is the full name, including package prefix, of an E module or a safe Java class. (In order to make the extensive Java libraries available in E without sacrificing capability security, we must tame them -- determine what subset of their public interface is consistent with capability security principles. As part of this taming process, we declare certain Java classes to be "safe", and therefore generally importable.) Because fully qualified names can be long, they are often factored as follows:

def <packgeName> := <import:package-prefix.*>
...
def name := <packageName:rest-of-path>

The package subtree rooted at "org.erights.e.elib" is provided as a built-in convenience, as if we had already executed

def <elib> := <import:org.erights.e.elib.*>

As a result,

? def deSubgraphKit := <elib:serial.deSubgraphKit>
# value: <deSubgraphKit>

is equivalent to

? def deSubgraphKit :=
>   <import:org.erights.e.elib.serial.deSubgraphKit>
# value: <deSubgraphKit>

Such a package subtree root gives access only to that subtree of the original package tree. Similarly, as will be explained in Manipulating Authority at the Exits, a directory can be used as a uriGetter in order to give access only to a subtree of the file system, as retrieved by names relative to that directory.

Previews of Data-E Serialization

In the Data-E System, we compose a serializer or unserializer from a pair of Data-E Kits, such as the deSubgraphKit already imported above. Each kit knows how to recognize and build a given kind of Data-E representation. The deSubgraphKit is special, in that the representation it traffics in is subgraphs of actual objects. All the other representations are depictions of subgraphs expressed in the Data-E "language". For example, the deSrcKit traffics in representations of Data-E as source strings, written in the Data-E subset of the E source language. In this document, to make clear when we're looking at Data-E rather than E source, all Data-E source strings are shown prefixed with "de: ".

? def deSrcKit := <elib:serial.deSrcKit>
# value: <deSrcKit>

To serialize is to recognize a subgraph and to build a depiction.

? "de: " + deSubgraphKit.recognize([false, 3], deSrcKit.makeBuilder())
# value: "de: def t__0 := [false, def t__2 := 3]"

Each recognize(..) method takes two arguments -- the input representation to recognize (here, the subgraph to traverse), and a builder to call as parts of the representation are recognized, in order to build the output representation (here, a Data-E source string). Each kit provides a recognize(..) method for accepting its form of representation as input, and a makeBuilder() for making a builder to build its form of representation as output. The recognize(..) method returns its argument builder's final output.

The deASTKit manages another depiction of Data-E programs: as Abstract Syntax Trees, as explained in Appendix A: The Data-E Manual. For present purposes, we care only about one feature of this kit, that it simplifies the expression a bit during building; for example, by removing unnecessary temporary variables. Interposing it between between the deSubgraphKit and the deSrcKit, we build our first expository serialize function, serialize_a. (All functions so named are for expository purposes only. See Appendix A for a guide to realistic usage.)

? def deASTKit := <elib:serial.deASTKit>
# value: <deASTKit>

? def serialize_a(root) :String {
>     def ast := deSubgraphKit.recognize(root, deASTKit.makeBuilder())
>     "de: " + deASTKit.recognize(ast, deSrcKit.makeBuilder())
> }
# value: <serialize_a>

? serialize_a([false, 3])
# value: "de: [false, 3]"

To unserialize is to recognize a depiction and to build a subgraph. The matching expository unserialize function follows. Parameters in E are actually patterns, of which the most common is a simple variable name, which defines the variable and binds it to the specimen (the argument passed in). Here we see a pattern for stripping off the additional prefix we added above. This pattern matches a string beginning with "de: ". If the specimen is such a string, then it matches the rest of the string against the pattern following the "@" -- in this case a simple variable name which is bound to the remainder of the string.

? def unserialize_a(`de: @src`) :any {
>     deSrcKit.recognize(src, deSubgraphKit.makeBuilder())
> }
# value: <unserialize_a>

? unserialize_a("de: [false, 3]")
# value: [false, 3]

The result of evaluating the depiction-as-expression is a reconstruction of the original subgraph. The case shown above stays clear of all the problematic cases for serialization: All the objects in the graph being serialized here (a boolean, an integer, and a list) are transparent -- they willingly divulge all their state to their clients through their public protocol. All the objects participating in the above serialization -- the serialize function and the objects being depicted -- seek only literal accuracy; and likewise during the above unserialization. We call this unproblematic starting case literal realism for transparent subgraphs.

A Proper Failure

Objects defined in E are encapsulated by default. If we make an encapsulated object and try to serialize it:

? def capsule {}
# value: <capsule>

? serialize_a(capsule)
# problem: Can't uneval <capsule>

we find our simple serialize function fails, as it must. Usually this failure will be the desired behavior. When it isn't, we have several alternatives.

Unconditional Transparency

In Java or Smalltalk, the methods on class Object define the messages all objects are expected to respond to, and provide default behaviors for those methods. The equivalent in E are the Miranda Methods. To avoid collision with programmer-chosen names, Miranda method names begin with a double underscore.

A serializer asks an object for its self portrait with the __optUncall() message. The corresponding default (Miranda) method returns null, indicating that no self portrait is offered. An object which overrides this to return a triple is unconditionally transparent -- it offers its self portrait to any client, just for the asking. The triple describes a call to be performed during unserialization to create the reconstruction of the original object. We start with a simple example:

? def iAmFive {
>     to __optUncall() :__Portrayal {
>         [2, "add", [3]]
>     }
> }
# value: <iAmFive>

? serialize_a(iAmFive)
# value: "de: 2.add(3)"

The expression after the ":" on the __optUncall() method declares what kinds of values it can return. __Portrayal is predefined to be

  Tuple[any, String, any[]]

which allows any three element list in which the second item is a string and the third item is a list of anything.

The expression "2 + 3" is just a syntactic shorthand for "2.add(3)".

? unserialize_a("de: 2.add(3)")
# value: 5

? unserialize_a("de: 2 + 3")
# value: 5

The object iAmFive is unconditionally transparent, but is not literally, or even usefully, realistic. When we reconstruct an object according to its self-portrait, the result can be quite different from the original.

Named Exit Points

Another approach to dealing with problematic objects is serialization avoidance by making references to these into named exit points. The traversal of a subgraph stops whenever it encounters such references, writing out named exit points instead (the jigsaw plugs shown on Figure 2: The Three Faces). We define matching serialize and unserialize functions customized to treat references to capsule as an exit point.

To create a custom serialize function, we first create a custom unscope -- a table mapping from exit references to names. This is most conveniently done by modifying the default unscope table -- the one used implicitly in the previous examples.

? def unscope_b := deSubgraphKit.getDefaultUnscope().diverge()
? unscope_b[capsule] := "foo"
# value: "foo"

? def recognizer_b := deSubgraphKit.makeRecognizer(null, unscope_b)
# value: <unevaler>

? def serialize_b(root) :String {
>     def ast := recognizer_b.recognize(root, deASTKit.makeBuilder())
>     "de: " + deASTKit.recognize(ast, deSrcKit.makeBuilder())
> }
# value: <serialize_b>

? serialize_b([capsule, 3])
# value: "de: [foo, 3]"

As we see, in the Data-E expression produced by serialization, a named exit reference becomes a free variable reference. Data-E unserialization is expression evaluation -- that subset of E expression evaluation applicable to the Data-E subset of E. The inverse of the unscope is therefore just the conventional notion of a scope (or environment), a mapping from variable names to values. On reconstruction, the named exit points will be reconnected to these values.

? def scope_b := deSubgraphKit.getDefaultScope().diverge()
? scope_b["foo"] := def newCapsule {}
# value: <newCapsule>

? def unserialize_b(`de: @src`) :any {
>     # Just a way of saying eval(src, scope_b)
>     deSrcKit.recognize(src, deSubgraphKit.makeBuilder(scope_b))
> }
# value: <unserialize_b>

? unserialize_b("de: [foo, 3]")
# value: [<newCapsule>, 3]

Each set of exit names together with a mutual understanding about what they may be bound to forms a unique micro-standard data format. Serializers and unserializers must agree on such a micro-standard, and so often come in matched pairs, as above. This agreement still leaves room for separate customization on each side, as with our unserializer's choice to bind a somewhat different object to the name "foo" in the scope, perhaps to adapt to a difference in the unserializer's context.

We name this abstraction the Gordian Surgeon both because it simply answers the question "How do we cut this tangle of references?", and also in honor of Gordie Freeman, the co-inventor, along with myself, of its main security property, unforgeable unscope lookup, as explained in the next chapter.

The Gordian Surgeon

Since the serialize function, unserialize function, scope, unscope, and (as we will see) uncallers list are often manipulated together, as above, we introduce the Gordian Surgeon -- an object that knows how to manipulate and wield these five tools in a coordinated fashion to, in effect, cut a subgraph from a donor context, freeze it, thaw it, and transplant it into a recipient graph. The following session is like that above, except that the same capsule is used for serialization and unserialization -- as this is the common pattern the surgeon makes convenient. The addExit(..) below adds the value-name association to the unscope and adds the inverse name-value association to the scope.

? def makeSurgeon := <elib:serial.makeSurgeon>
? def surgeon := makeSurgeon.withSrcKit("de: ").diverge()

? surgeon.serialize([capsule, 3])
# problem: Can't uneval <capsule>

? surgeon.addExit(capsule, "foo")

? surgeon.serialize([capsule, 3])
# value: "de: [foo, 3]"

? surgeon.unserialize("de: [foo, 3]")
# value: [<capsule>, 3]

Counting "Generations"

Our first realistic example uses both self-portraits (with unconditional transparency) and named exit points:

? def makeGenerationCounter(count :int) :any {
>     def generationCounter {
> 
>         /**
>          * Make my successor with the next larger count
>          */
>         to __optUncall() :__Portrayal {
>             [makeGenerationCounter, "run", [count+1]]
>         }
> 
>         /** similar purpose as Java's .toString() */
>         to __printOn(out :TextWriter) :void {
>             out.print(`<gen $count>`)
>         }
>     }
> }
# value: <makeGenerationCounter>

? def genCounter := makeGenerationCounter(0)
# value: <gen 0>

A generationCounter is an object with a single instance variable, count. The makeGenerationCounter function acts like a constructor -- it makes new generationCounter instances. Above, we make the genCounter instance with a count of zero. Each generationCounter uses the underlined code above to "misrepresent" itself as something that would be made by calling makeGenerationCounter with a count value one greater than its own.

However, this portrayal enables us to serialize a generationCounter only if we can serialize makeGenerationCounter, which is just as encapsulated as our earlier capsule. Since it is stateless, it is plausible to "serialize" it by not serializing it -- by making it into a named exit point.

? surgeon.addExit(makeGenerationCounter, "makeGenerationCounter")

? surgeon.serialize(genCounter)
# value: "de: makeGenerationCounter(1)"

? surgeon.unserialize("de: makeGenerationCounter(1)")
# value: <gen 1>

A reconstructed generationCounter has a count one greater than its original, thereby accumulating a count of the number of serialize / unserialize cycles it has been through since it was born. Rather than seeing the underlined "misrepresentation" as a problem to prohibit, this example shows how this representational freedom is an opportunity.

Unserialization as Evaluation

(This is approximately an abridged presentation of the Unserialization as Evaluation section of Appendix A: The Data-E Manual.)

As shown above, unserialization can be thought of, or even implemented as, expression evaluation [ref Rees, XMLEncoder]. A depiction is an expression in some programming language, the unserializer is the eval function, the exit references to be reconnected are free variable references, the values to reconnect them to come from the scope (i.e., environment) provided to eval, and the root of the reconstructed subgraph is the value the expression evaluates to. Serialization is the logically inverse process, in which an uneval function is applied to a root and an unscope, and writes out an expression that, were it evaluated in the corresponding scope, would reconstruct the subgraph.

Data-E is the subset of E used for depicting a subgraph as an expression. Ignoring precedence, it consists of the following productions:

Fundamental Data-E Constructs

`expr ::=`	`literal \| varName \| tempName \| call \| defexpr`
`literal ::=`	`LiteralInt \| LiteralFloat64 \| LiteralChar \| LiteralString` `# Each are written as they are in Java. # Examples: 37, 42.3, 'c', "What me worry?"`
`varName ::=`	`Identifier # but not of the form "t__"Digit*` `# These variable names are only used freely. # Example: foo`
`tempName ::=`	`"t__"Digit*` `# These variable names are never used freely. # Example: t__12`
`call ::=`	`expr "." Identifier "(" exprs ")"` `# Example: __makeList.run("foo", 49)`
`defexpr ::=`	`"def" tempName ":=" expr` `# Where expr may refer to tempName # Example: def t__0 := __makeList.run(1, t__0, 3)`
`exprs ::=`	`(expr ("," expr)*)?`

E Syntactic Shorthands generated by `deSrcKit`

Since we use deSrcKit to build the depictions we present for expository purposes, we need to know the shorthands it builds, which are a subset of the shorthands recognized and expanded by E. Going the other way, all E syntactic shorthands, including those below, are recognized by deSrcKit, since it uses the E parser to parse and expand its input.

Any valid E expression that expands only into the above Data-E primitives is a valid Data-E expression with the same meaning. Likewise any valid Data-E expression is a valid E expression with the same meaning.

`expr "(" exprs ")"`	is shorthand for `expr ".run(" exprs ")"` If the message name is left out, it defaults to "`run`". For example, `__makeList("foo", 49)` is shorthand for `__makeList.run("foo", 49)`.
`"[" exprs "]"`	is shorthand for `"__makeList.run(" exprs ")"` Square brackets can be used to express lists. For example, `["foo", 49]` is shorthand for `__makeList.run("foo", 49)`.
`expr "[" exprs "]"`	is shorthand for `expr ".get(" exprs ")"` For example, `vec[i]` is shorthand for `vec.get(i)`.
`"<"Identifier">"`	is shorthand for `Identifier"__uriGetter"` For example, `<file>` is shorthand for `file__uriGetter`
`"<"Identifier":"URIC*">"`	is shorthand for `Identifier"__uriGetter.get(" URIBody ")"` Where the URIBody is a literal string whose value is the sequence of URI characters. As explained earlier in E's URI Expressions, `<file:/foo/bar>` is shorthand for `file__uriGetter.get("/foo/bar")`

Using several cases together:

? def root := [1, root, 1, <import:java.lang.makeStringBuffer>]
# value: [1, <***CYCLE***>, 1, <makeStringBuffer>]

? def depiction := surgeon.serialize(root)
# value: "de: def t__0 := [def t__2 := 1,
#                          t__0,
#                          t__2,
#                          <import:java.lang.makeStringBuffer>]"

? surgeon.unserialize(depiction)
# value: [1, <***CYCLE***>, 1, <makeStringBuffer>]

The depiction is shown in the middle following the "de: ". It is written in Data-E and has the following meaning:

The value of the E expression <import:java.lang.makeStringBuffer> serializes as the Data-E expression import__uriGetter.get("java.lang.makeStringBuffer"), as will be explained in the next chapter. The deSrcKit shows this expression using the URI shorthand, which in this case looks like the original E.
The def t__2 := 1 is non-cyclic instance of the defexpr production, since t__2 is not used on its right hand side, even though t__2 is used later in the serialization.
The def t__0 := ... is a cyclic instance of the defexpr production, since t__0 is used on its right hand side, expressing a cyclic data structure.

For those familiar with Java, Data-E should be mostly familiar, but with a few important differences:

In E, a variable definition is an expression. Like assignment, the value of a definition expression is the value of the right hand side.
null, false, and true are not keywords in E, but rather are variable names in E's universal scope and in Data-E's default scope and unscope. This means an expression can count on them having their normal values, so these don't need to be literals. The "false" in the first example of Previews of Data-E Serialization above was a variable reference, not a literal, just as it is in E.
Using only the literal, varName, and call productions, we can write Data-E expressions that will evaluate to new tree structures whose leaves are reattached exit points.
In E, a variable is in scope starting from its defining occurrence, left-to-right, until the close-curly that closes the scope box (lexical contour) in which it is defined, and not counting regions of nested scope boxes where it is shadowed by a definition of the same name. In Data-E, since there are no constructs that introduce scope boxes (i.e., no constructs with curly brackets), every variable is in scope from its defining occurrence until the end of the depiction as a whole.
With the tempName and non-cyclic defexpr productions, we can use Data-E to represent DAGs. For those values that are multiply referenced, we can write out the sub-expression for calculating this value at the first position it needs to appear, as the right hand side of a define, capturing its value in a temp variable (def t__2 := 1). Everywhere else this value is needed, we use tempName to reuse this captured value (t__2).
With a cyclic defexpr, we can use Data-E to represent graphs. Unlike other block structured languages, even when the name being defined on the left is used on the right, E still holds strictly to the left-to-right rule. This may seem strange, since defexpr must execute right-to-left -- the expression on the right must be evaluated to a value before the variable on the right can be defined to hold this value. When this not yet defined but in-scope variable name is used within the expression on the right, what does it evaluate to before its actual value has been determined? The answer is an unresolved promise, similar to a logic variable or a future. An unresolved promise is an object reference whose designation has not yet been determined. At the moment the above list is created by evaluating the "[..]" expression on the right, t__0 is still an unresolved promise.

Once the expression on the right has evaluated to a value, then the promise bound to the variable on the left is resolved to be this value. Once a promise is resolved, it becomes like any normal reference to the object it designates. This value is also the value of the defexpr as a whole. By the time def t__0 := ... finishes evaluating, the promise within the list becomes a direct reference to the list itself.

By using promises to reconstruct cycles, we safely avoid a host of hazards. Most other serialization systems [ref JOSS, XMLEncoder, BOSS, ...] are defined in systems without any kind of delayed references (promises, logic variables, futures), but which still allow user-defined unserialize-time behavior by the objects being unserialized. In any such system, when unserializing a cycle, user-defined behavior may interact with graph neighbors that are not yet fully initialized. Before an object is fully initialized, it may be easily confused. In a conventional system this is only a minor source of bugs. But in a graph of mutually suspicious objects this would be a major opportunity for an adversary. By referring to an object only with promises until its fully initialized, we make such bugs safely fail-stop, enabling an adversary in the graph only to mount a denial-of-unserialization attack, which it can trivially do anyway, and which we make no effort or claim to prevent.

Data-E is a true subset of E, both syntactically and semantically. Since E is a secure object-capability language, the security issues surrounding evaluation of E programs are already well understood. By using a subset of E in this way, we get to leverage this understanding.

Unless stated otherwise, all text on this page which is either unattributed or by Mark S. Miller is hereby placed in the public domain.

/ data / serial / jhu-paper

ELib E Language Smart Contracts Related

Download FAQ API Mail Archive Donate

report bug (including invalid html)