/ elang / grammar

Quasi-Literals

Why Quasi-Literals?

Consider the ancient C printf and scanf statements:

printf("hello %s world\n", worldState);

as contrasted with the modern

out.print("hello " + worldState + " world\n");

They both say approximately the same thing, but the notation of the first is optimized to make visible the nature of the resulting value: It'll be like the format string but with the %-holes filled in. The second form is instead optimized to make visible the nature of the computation that will produce this value: Calculate these expressions and concatenate these strings together. Unfortunately for my explanation, it doesn't seem to make much difference on this case.

Though less familiar, the point is clearer with Lisp's quasi-quote, since it expands to a normal expression:

`(hello ,worldState world)

which means A list whose first element is the literal symbol 'hello, whose second element is the value of the variable worldState, and whose third element is the literal symbol 'world. Like "%", "," marks a hole to be filled in at runtime. This expression macro-expands to

(cons 'hello (cons worldState '(world)))

which describes the process for making such a list. In the first form, the nature of the list being produced is clear but the means of its production -- two cons operations -- is obscure. In the second form the reverse is true.

I believe this observation about the respective strengths and weaknesses of the two kinds of notations goes back to Alonzo Church, the inventor of Lambda. For most purposes, more literal forms of expression should be preferred when possible, for the same reasons that user interfaces should work by "direct manipulation" where possible. Much of the popularity of modern "scripting" languages -- Perl, Python, Tcl, etc -- is due to their support for quasi-literal expression of text by "interpolation" ($-substitution) and for quasi-literal text pattern matching by regular expressions.

As the Lisp language shows, there's no reason a language can't enable literals or quasi-literals to express data types other than scalars and strings, but few languages do. Lisp provides for the literal and quasi-literal expression of lists. But what about arbitrary user-defined objects? By providing a pluggable quasi-parser framework, each quasi-parser designer can define their own syntax for [quasi-]literally expressing whatever data types they choose. In closing below, we propose a radical future/past extension of this direction.

Quasi-Literal Expressions

A quasi-literal expression (or a quasi-expression) consists of an optional quasi-parser name (defaulting to "simple" if left out), a quasi-quote (also known as backquote, "`"), a string in the grammar accepted by that quasi-parser, and a closing quasi-quote. This string may also have several embedded "$" <identifier> or "$" "{" <E expression> "}" indicating values to be substituted in. The quasi-parser sees the string without the embedded expression source. For example:

foo`some text${bar() + 3}more text$baz`

expands to

foo__quasiParser.valueMaker("some text${0}more text${1}").
      substitute([bar() + 3, baz])

In E, line end normally terminates statements. However, if the line ends with a dangling binary operator, then it is continued on the next line. "." is the binary operator for immediate-call, familiar from other C-tradition oo languages.

The quasi-parser named "foo__quasiParser" is looked up, and is asked to created a ValueMaker from the transformed string. The transformation extracts each expression and leaves behind a "$" "{" <digit>+ "}", ie, a "substitution hole". Since this applies to all quasi-expressions, all quasi-grammars must recognize these substitution holes. The resulting ValueMaker is then invoked with a list of the values of the extracted expressions, which the ValueMaker should "plug into" the corresponding substitution holes, in order to make whatever kind of value the quasi-parser took to be described by the quasi-string. We currently have two quasi-parsers that support quasi-expressions: "simple__quasiParser" and "e__quasiParser".

"simple" Quasi-Expressions for Text

The "simple" quasi-parser is indeed the simplest useful example consistent with these rules. Except for the embedded holes, it takes the input text to just describe itself. Substitution consists of stringification of the value, and string substitution. It corresponds to the "interpolation" familiar from at least Perl, Tcl, and various shells:

? pragma.syntax("0.8")

? `abc${2+3}def`
# value: "abc5def"

The expression expands to

simple__quasiParser.valueMaker("abc${0}def").substitute([2 + 3])

"e" Quasi-Parser for Parse Trees

The "e" quasi-parser is simply E's own parser used as a quasi-parser. It accepts E source-code as its quasi-grammar, and takes it to describe ASTs (which we call parse trees) of the post-expansion Kernel E language:

? def sum := e`x + y`
# value: e`x.add(y)`

? def prod := e`a * $sum * b`
# value: e`a.multiply(x.add(y)).multiply(b)`

In E, "+" and "*" are just syntactic sugar for the "add" and "multiply" messages. (Kernel E parse trees print as e`...` quasi-expressions as well.) Since the substitution is structural, not textual, the fact that "*" binds tighter in E than "+" does not cause confusion. "deriv.e" in the distribution is a symbolic differentiation package written in E using the e quasi-parser so that algebraic expressions can be represented as trees, but expressed in familiar infix-operator notation.

Notice that each time the same quasi-expression is evaluated, the valueMaker(..) argument is the same, though the substitute(..) argument is usually different. Modulo some security issues beyond the scope of this overview, quasi-parsers should therefore cache ValueMakers by transformed string. It can then spend a lot of work "compiling" a ValueMaker so that substitute(..) can go faster.

See also Quasi-Literals and XML, which proposes to use XML DOM trees as the universal parse-tree data structure.

Quasi-Literal Patterns

A quasi-literal pattern (or a quasi-pattern) is also written as an optional quasi-parser name, a quasi-quote, a quasi-string with embedded stuff, and a closing quasi quote. This time the embedded stuff can consist of several

	"$"<identifier>
|	"$" "{" <E expression> "}"
|	"@"<identifier>
|	"@" "{" <E pattern> "}"

You can think of E's grammar as having two parsing contexts: expression and pattern. A quasi-literal is parsed and transformed as a quasi-expression or a quasi-pattern based on its parsing context. The quasi-pattern transformation is more complex:

? "abcde" =~ `a@{x}d@y`
# value: true

? x
# value: "bc"

? y
# value: "e"

As in Perl, "=~" is the pattern match operator. It takes an expression on the left and a pattern on the right. That first line expands to:

("abcde" =~ _4q ? (simple__quasiParser.
                       matchMaker("a@{0}d@{1}").
                       matchBind([], _4q) =~ [x, y]))

"_4q" is an E-expansion-generated temporary variable name. (By introducing a temporary variable, we can expand the pattern by itself into a Kernel-E pattern, and we preserve the source's original left-to-right order, necessary for both scoping and execution order.) The pattern constructs needed to understand the above pattern are:

<identifier>	Always matches, and binds identifier to the specimen. This is the only form of variable definition in E.
<pattern> "?" <expression>	"`?`" is pronounced "such that". First matches <pattern> to the specimen; if it succeeds, evaluates <expression> in the resulting scope. If its value is true, then the such-that pattern as a whole succeeds.
[<pattern>, ...]	Matches a list of the same length by matching each element against the corresponding pattern.

So, reading from the inside out, the quasi-parser named "simple" is asked to produce a MatchMaker from the string "a@{0}d@{1}", which is presumed to be in the quasi-grammar understood by this quasi-parser. (Once again, it pays for the quasi-parser to cache a compiled MatchMaker indexed by this string.) The MatchMaker is then asked to "matchBind(args, specimen)" where the args are extracted from embedded $-expressions, as with the earlier substitute(args). 'args' is here an empty list since there were no embedded $-expressions, and the specimen is here "abcde" by the logic of "?".

(Detail: If the argument to matchMaker contains more than one @-hole with the same index number, then the quasiParser may reject this string by throwing an exception. If the quasiParser accepts this string, then it must treat the first occurrence of an @-hole of a given index as a normal binding @-hole. All successive occurrences match only if the value they would have bound is the same (E's "==") to the value bound by the first occurrence. Note that this case will never be generated by expanding a quasi-literal pattern.)

"simple" Quasi-Patterns for Matching Text

If the matchBind() fails, it returns null, which will not match any list pattern, even the empty list pattern, resulting in the overall match failing. If the matchBind() succeeds, it returns a list with elements extracted from the specimen by @-holes, and put into the list at positions corresponding to the @-numbers. The MatchMaker resulting from the "simple" quasi-parser just does the simplest useful anti-greedy (ascetic?) string match and extract without backtrack consistent with this framework. Except for holes, all characters just represent themselves and must match exactly.

"e" Quasi-Patterns for Matching Trees

As with quasi-expressions, none of this is specific to text. A MatchMaker can accept whatever kind of object it wishes as specimen and extract whatever it wishes as the bindings. As you might guess, the e quasi-parser can also be used to match parse trees and extract subtrees. A portion of deriv.e:

/**
 * What's the derivative of expr wrt var?  Ie, what's "d(expr)/d(var)"?
 */
def deriv(expr, var) :any {

    switch (expr) {

        match e`$var ** @exp` ? (isConst(exp))  {
            e`$exp * $var ** ($exp -1)`
        }
        match e`@a + @b` {
            def da := deriv(a, var)
            def db := deriv(b, var)
            e`$da + $db`

The last 3 lines could also have been written

e`${deriv(a, var)} + $(deriv(b, var)}`

depending on your taste. These two examples show the similarity between the match-bind-substitute styles of programming as practiced by "scripting" languages on text, and that practiced by Prolog on trees of symbols.

"rx" Quasi-Patterns for Perl-5 Regular Expressions

The third quasi-parser currently available in E is named "rx", standing for "Regular eXression". "rx" takes a full Perl-5 regular expression engine obtained from Original Reusable Objects (now Savarese.Org) and wraps it to turn it into a quasi-parser. (This quasi-parser is written in E and is found in the E distribution as "PerlMatchMakerMaker.emaker".) As expected, most of the work is in the treatment of holes. The equivalent of the above string match written using "rx" rather than "simple" would be:

? "abcde" =~ rx`a(@{x}.*?)d(@y.*)`

The expansion is the same of course, except that the "rx" quasi parser is looked up instead of "simple", and the transformed string fed to matchMaker(..) is

"a(@{0}.*?)d(@{1}.*)"

Parentheses play a role in Perl regular expressions related to E's @-holes: they extract the part of the specimen matched against the sub-pattern they contain. To generate a proper Perl regular expression, the "rx" wrapper removes the @-holes leaving

"a(.*?)d(.*)"

but verifies that each @-hole was at the beginning of such a parenthesized group, and remembers which group it is. The (cached) MatchMaker consists of a compiled Perl Pattern object plus the correspondence between paren-group-number and @-number. On a successful match it returns the bindings in a list ordered according to the original @-numbers.

Unfortunately, to do this, "rx" must correctly distinguish between an "(" that starts a Perl-saved-group and one that doesn't. We currently impose some restrictions (reject some valid Perl patterns) to make this recognition easier, but we still don't always get this right. Fortunately, the known bugs here haven't yet bitten anyone on any real examples, but this needs to be fixed. Ideally, the OroMatcher code could be enhanced so "rx" could ask Pattern where it thinks the open parens of save-groups are.

The ORO website also provides a good quick reference to Perl-5 regular expressions, and a link to the official one.

Beware Write-Only Code

I always hated a lot of Perl code for being so hard to read. I took pride in how readable E was by comparison. That is, until I started heavily using Regular expressions. My code became a dense gobbledygook of punctuation characters that reminded several folks of APL. The answer is not to give up on regular expressions. Rather the answer comes with a new Perl-5 feature imported into E courtesy of the ORO folks: The "(?x)" flag.

Put "(?x)" at the beginning of your "rx" pattern and now you can freely embed whitespace and "#" comments in your pattern. Of course, you will then have to escape whitespace and "#" characters that are supposed to be part of the pattern, but this is a small price to pay. A good example of readable code using regular expressions can be found in the old "parseUpdoc.emaker" before and after using "(?x)". In any case, the current updoc parser, in makeUpdocParser.emaker in the source tree, no longer uses the rx__quasiParser. We think it's more readable without it, but YMMV.

Long Term Future Directions: Hypercard

Our long term future takes us back to the conversation with Scott Kim that pointed us in this direction long ago. Hypercard had just come out and was obviously powerful in a way most of had not seen before. Scott had a unique perspective on this power:

Most programming languages have as literal data types things like numbers, characters, and strings. Lisp, in addition, has lists, but all these only express literals using text strings. Why? Because the program itself is a text string. A weird pun is going on: when you put the insertion point of your text editor between the quotes of a literal string you're effectively no longer in a program editor, but in a nested text editor. The text you are editing is -- almost -- simply the text that that literal evaluates to. Whereas normally programming is indirect, here we're operating by direct manipulation -- creating the actual output directly.

Hypercard is normally conceived of as primarily a visual application builder with an embedded silly programming language (Hypertalk). Think instead of a whole Hypercard stack as a mostly literal program. In addition to numbers and strings, the literals you can directly edit include bitmaps, forms, buttons, and menus. You can literally assemble most things, but where you need dynamic behavior, there's a hole in the literal filled in by a Hypertalk script. The program editor for this script is experienced as being nested inside the direct manipulation user interface editor.

It's one of those 15 minute conversations that change your life.

Though there have been many many visual interface builders since Hypercard, I'm not aware of any that have pursued this analogy between partially literal user-interface construction and Lisp quasi-quote. How should E eventually pursue this?

Well, we could get fancy and say that source code should stop being text, and should instead be a graph of objects. Unfortunately, though this approach has paid off in the past, it is becoming increasingly less possible, as the world becomes increasingly more wedded to text. (Overall net progress can obscure many cases of horrifying regress, and the computer world is rife with them.) Fortunately, there's a low road:

The Macintosh Resource Fork and ResEdit was a crude but effective form for direct manipulation editing of an extensible set of non-textual literal data types. They fell short of what we seek in only two regards: 1) The literals weren't "quasi", ie, they were fully literal, rather than having holes to be filled in by more code. 2) They weren't meaningfully embedded within the program text -- not visually, and not scoping-wise. As pure literals, the lack of scope embedding doesn't matter. But once you fix #1 you must make sure the embedded expressions are in their lexical context.

Elmer currently uses the Java Swing plain text editor, a sibling of the Java Swing compound document editor (of which the wysiwyg html editor is a subclass). Java also has a resource mechanism (which I'd guess was inspired by the Macintosh's), and Java can serialize/unserialize object graphs into such resources.

A future E compound document editor could generate and recognize, for example, the expression

reseditor`<resource name> foo=${expr0} bar=${expr1}`

unserialize the resource with this name into (hopefully) a ValueMaker that's also an editable Model, and embed at this place in the program editor a nested editor for this type of Model. Of course, to preserve the quasi-literal illusion, the editor for the Model should seem like the editor for the data type the ValueMaker makes (in response to substitute()), but it would in turn contain embedded program editors at the positions in the Model corresponding to the names "foo" and "bar". Perhaps the literal plane should be shown above the program plane using shadows. The embedded program editors would be shown, in wells using reverse shadows, to be on the same plane (and so the same scope) as the surrounding program. On "Save", the editor would serialize the ValueMaker back into resource land.

Similarly, of course, for serialized MatchMakers.

Besides user interface construction, what would you use this for? How about generating graphics? Right now, you can either write computer graphics code, which is very indirect but very general; or use a drawing program, which works great as long as what you're drawing fits its limitations. What if you're not even trying to produce a single drawing, but rather to produce a program that does data-dependent drawing? Currently you'd be stuck with a mostly indirect way of expressing yourself. It would seem the ability to intimately mix elements expressed at different levels of literalness could let us use the best of both on the same task.

Or (borrowing slightly from the Garden system) how about embedding a graphical state diagram or state chart editor. The transition arcs could have holes containing actions expressed in code that's in the scope of the lexical context containing the embedded state diagram.

Acknowledgement

The quasi-literals of E owe a tremendous debt to Alonzo Church, Lisp quasi-quote, Hypercard and the Hypercard team, Scott Kim, Dean Tribble, and Joule. Thanks!!

Unless stated otherwise, all text on this page which is either unattributed or by Mark S. Miller is hereby placed in the public domain.

/ elang / grammar

ELib E Language Smart Contracts Related

Download FAQ API Mail Archive Donate

report bug (including invalid html)