I've been experimenting with extensions to S-Expression notation for a while now, and made some new additions that I'm going to test drive for a while in my experimental Lisp-like language "Bang".
The classic restricted S-Expression notation looks like this:
(top level list
(level 2 list)
(another list (with more (nested) (lists)))
(yet another list))
First I re-used naked notation from Nonelang, which uses indentation as a cue to balance parentheses, Python style:
top level list
level 2 list
another list
with more
nested ;comments are able to wrap single symbols
lists ;because they are first parsed as symbols, then stripped
(yet another list) ; classic notation is also supported
(Please forgive the arbitrarily highlighted keywords.)
The semicolon ;
is used for line comments in Lisp, Scheme, Assembler and
most recently, in LLVM, but contemporary programmers know it more as a
statement separator, which I wanted to honor.
In a quick survey I did among my Twitter followers, the double-slash //
used in C-like languages was the most popular comment token (48%), followed by
the hash #
(37%), which is traditionally used in scripting languages such
as bash, Tcl and Python.
I chose to go with #
, as //
is very useful in Python as a floordiv operator,
an operator I want to support. In naked notation, Bang looks very pythonic anyway,
so the hash would make the language more familiar to Pythoneers. C-like languages
understand #
as almost comment-like preprocessor tags (they're not part of the
actual AST), so I believe even C users won't be appalled by this choice.
Unfortunately I already use #
in Nonelang as an array index operator, like so:
print
matrix # x # y ; and here's a comment
So for Bang I had to find a suitable replacement. I chose @
, mostly because it
is spelled as "at" (which is a fitting moniker), is unused in C and belongs to a more
esoteric feature in Python (decorators). So with the new array indexing
syntax the statement becomes
print
matrix @ x @ y # and here's a comment
Alright. Now how do we fix the next problem:
do
print
turns into (do print)
, not (do (print))
, which is what we want.
We could just wrap print
into (print)
locally, but let's try to avoid
more braces for now, especially when they appear in such an utterly
surprising and irregular way in the midst of a parens-free statement
block.
Since naked notation does not wrap single symbols on a line, I used to append an empty comment as a hack to put a second symbol in the line and thus get the parser to wrap the statement:
do
print ; there, now it's wrapped
turns into
(do (print "; there, now it's wrapped\n"))
After stripping comments, we then get the desired (do (print))
.
Alas, our good friend ;
has been taken from us, and to add insult to
injury, the Lexer now strips #
before the parser even sees it (this is so
it plays nice with my editor's toggle block comment feature, which
comments blocks in such a way that the parser is completely confused by
those weirdly nested comment symbols.).
So how do we do it now? I turned ;
into a new control character,
the statement separator, which operates in both naked and coated notation
and wraps values up to the previous ;
or beginning of scope.
This way, (print a; print b; print;)
turns into ((print a) (print b) (print))
.
Similarly, in naked notation we can now do this:
do
print x y # is already wrapped as (print x y)
print x y; # superfluous semicolon has no effect
print x; print y; # added to (do ...) as (print x) (print y)
print; # wraps print as (print)
Not only do we have our old feature back, we get a context free statement separator that is used in both C-likes and Python.
There's a caveat though: if trailing values aren't topped off with ;
, they're not going to be wrapped. So (print a; print b; print;)
turns into ((print a) (print b) (print))
, but (print a; print b; print)
turns into ((print a) (print b) print)
! It sounds like a drawback, but allows to do this:
do
print x; print
a + b
which turns into (do (print x) (print (a + b)))
.
Another addition I made, as a sort of "styling utility" for domain specific language designers, was to add context free support for square brackets []
and curly brackets {}
as aliases for parentheses ()
, similar to how Clojure does it, but without inherent semantic meaning. Bracket expressions are all equivalent, but set a style for lists that can be queried in syntax handlers:
(do ((print x) (print y))) # coated
[do {(print x) (print y)}] # styled coated
do { # naked into styled coated
print x; # make use of new statement character
print y; # look, it's almost C! ;-)
}
I also realized that there is zero use for comma ,
in Bang, so why not use that one as a delimiter as well?
Now [ptr, * ptr, const * ptr]
turns into (ptr (* ptr) (const * ptr))
. This example also demonstrates the two key differences to the ;
delimiter:
- Single values are not wrapped, so
(a,b,c,d)
is equivalent to(a b c d)
. - Trailing values are wrapped, so
{a = 1, b = 2, c = 3}
is the same as{a = 1, b = 2, c = 3,}
which translates to((a = 1) (b = 2) (c = 3))
.
Furthermore, the comma separator ,
has lower precedence than the statement ;
separator. Take this fictitious example:
# do it naked, because we can.
do # (do
int x, int y; x = 5, y = 6 # ((int x) (int y)) ((x = 5) (y = 6)))
The reason why (x = 5) (y = 6)
is wrapped here despite a missing trailing statement separator is that naked notation takes care of the wrapping.
Sort of as an attempt to permit prefix headers to argument lists, I turned colon :
into a special symbol that controls from where argument separation begins. For example (label: a = 1, b = 2, c = 3)
now turns into (label : (a = 1) (b = 2) (c = 3))
.
Treating :
as a separable token also allows to parse Pythonic blocks like this one:
while x < size: # (while x < size :
x = x + 1 # (x = x + 1))
Since :
separates the bracketless expression from the body, the syntax handler can now easily tokenize the expression into (while (x < size) (x = x + 1))
without having to consider ambiguities.
Dots .
are truly special. They're pretty much useless in the traditional Lispy sense, as nobody needs such an important character simply devoted to concatenating lists. But they ended up as one of the very first infix operators that I needed in Nonelang, so that I could retire the unwildy format of e.g. (. object field subfield)
in favor of a more modern object.field.subfield
.
Because my stance on special characters is that they should be spaced correctly anyway, I never made dots special, so the lexer always worked them into symbols. (a . b.c)
would indeed literally parse as (a . b.c)
, and the syntax handler could only do as well as (a . (b . c))
, which is of course complete horse manure.
Therefore, in the new lexer, .
is tokenized separately, so that (a . b.c)
turns into the proper (a . b . c)
right away. Unfortunately this broke pattern expressions like (args ...)
which then became (args . . .)
, so I added another rule which groups successive dots. Now for example (a.b..c...d)
parses as (a . b .. c ... d)
, which should also make Lua syntax fans happy who know ..
as the string concatenation operator.
Of course this means I now also have to tokenize floating point numbers, which give a second legitimate reason to use dots in a symbol. I do this for doubles in Nonelang (all other notations like 0x
and 0b
are done in syntax handlers), and I am always worried because this makes it harder to support symmetric code transformation.
I thought I could do without in Bang, but it becomes clear that it's absolutely required, and there are better ways to transform code than parsing the source file, reassembling it and writing it back to disk: I store anchoring data for each token (start/end position in file), and this can be used to patch the file directly.
Are we? I'm not sure. I want to support '
as a second context free string quote style that may be used in the Python sense (as a simple alternate) or in the C sense (as a char (array) constructor). I'm still considering adding support for Python block strings """
but since our Lispy strings are already multi-line...
; same as "\"they don't\n\ttokenize strings like\n\tthey used to.\""
"\"they don't
tokenize strings like
they used to.\""
...I think an '
alternative is more than enough.
These are all changes I could think of at the Lexer / Parser level. I generally treat the problem of writing the parser as servicing the language designer who wants to invent interesting syntax handlers for S-Expressions, as well as the contemporary language user who prefers to style his expressions for clarity in a syntax he and his editor understand. I try to avoid creating an explosion of supposedly powerful forms that on closer inspection are more obscuring than enlightening.
Lisp purists can easily retreat to classical coated notation and never touch the fancy stuff. Tinkerers can play with the fundamental extensibility of the language. Everyone is happy!