Programming language experiments
This details a project called language_design
that I started September 2015 and worked on sporadically ever since. It was written in Scala.
The initial aim was to build a language that mixed aspects of Scala and Erlang. It would be strongly & statically typed, have actors, and be roughly functional. Functional in the sense that immutability was encouraged and functions were used everywhere, but purity wasn't a big concern.
It had a feature where, given a function fun lines(reader: Reader): List[String]
in scope and an r: Reader
, you could call r.lines()
. The point was to feel like objects (like Java), even the language didn't have any. That was cool.
Actors would be typed, and could implement interfaces. Actors could only be sent messages it could handle. Example: an actor could implement the Sink[T]
interface, handle message Next(item: Int)
, then be used wherever a actor Sink[T]
was needed.
Here's an example of desired syntax from the first commit to the project:
module me.lachlanap.file_average { process Main { handle Run(args: List[String]) = { val out = \Sink { handle Next(item: Int) = { println("Average: " + item) } }; val accumulator = new Accumulator(out); val filename = args.get(0); val reader = new Reader(filename, accumulator); reader.watch(error => println("Error reading file: " + error); crash(error); ); reader ! Execute; // desugared to: reader.!(reader.T.Execute); // then: !(reader, reader.T.Execute); // TODO: Garbage collect dropped processes? } } process Reader(filename: String, out: CloseableSink) { handle Execute = { use flat lang.io; val reader = openReader(filename, charset.UTF_8) with(reader, reader => { reader.lines() .map(line => line.parseInt()) .foreach(parse => parsed match { case None => crash("Failed to parse " + line); case Some(v) => out ! Next(v); }); }); out ! Close; finish(); } } process Accumulator(out: Sink[Int]): CloseableSink[Int] { var total = 0; var count = 0; handle Next(item: Int) = { total += item; count++; } handle Close = { out ! Next(total / count); finish(); } } }
I built my own set of combinator parsers to parse this... and quickly replaced them with FastParse. I managed to parse most of the language (although operator precendence is not a thing).
I didn't get very far with that language. I never got to generics. Or actors. Or data types beyond Strings and Ints.
The language today is very different.
module lang { data Bool = True | False data Unit = Unit crash :: String => String = message => message } module test.i2s { main :: () => String = () => "value: " + intToString(factorial(input)) input :: Int = 10 fib :: Int => Int = in => ifeq_int(in, 0, () => 1, () => { ifeq_int(in, 1, () => 1, () => { fib(in - 1) + fib(in - 2) }) }) intToString :: Int => String = in => if0_str(in, () => "0", () => { div10 = in / 10 digit = in - (div10 * 10) digitStr = digitToString(digit) if0_str(div10, () => digitStr, () => intToString(div10) + digitStr) }) digitToString :: Int => String = in => match(in) { case 0 = "0" case 1 = "1" case 2 = "2" case 3 = "3" case 4 = "4" case 5 = "5" case 6 = "6" case 7 = "7" case 8 = "8" case 9 = "9" } if0_str :: (Int, () => String, () => String) => String = (test, true, false) => ifeq_str(test, 0, true, false) ifeq_str :: (Int, Int, () => String, () => String) => String = (test, expected, true, false) => { match(test - expected) { case 0 = true() case _ = false() } } if0_int :: (Int, () => Int, () => Int) => Int = (test, true, false) => ifeq_int(test, 0, true, false) ifeq_int :: (Int, Int, () => Int, () => Int) => Int = (test, expected, true, false) => { match(test - expected) { case 0 = true() case _ = false() } } factorial :: Int => Int = { go = in => ifeq_int(in, 1, () => 1, () => { in * go(in - 1) }) go } }
This is the language today. It's purely functional. Functions are defined exclusively with lambdas. It has no generics, hence ifeq_str
and ifeq_int
. There is no integer-to-string function, so I implement my own --- and the only comparison operator is the match expression --- booleans aren't a thing yet. The only way to get output is to return the result from main
.
But it actually works. The example above prints:
result :: lang.String = "value: 3628800"
How it works
The source code above is parsed into:
module lang { data Bool = True() | False() data Unit = Unit() crash :: (String) => String = (message) => message } module test.i2s { main :: () => String = () => + "value: " intToString(factorial(input)) input :: Int = 10 fib :: (Int) => Int = (in) => ifeq_int(in, 0, () => 1, () => { ifeq_int(in, 1, () => 1, () => { + fib(- in 1) fib(- in 2) }) }) intToString :: (Int) => String = (in) => if0_str(in, () => "0", () => { div10 = / in 10 digit = - in * div10 10 digitStr = digitToString(digit) if0_str(div10, () => digitStr, () => + intToString(div10) digitStr) }) digitToString :: (Int) => String = (in) => match(in) { case 0 = "0" case 1 = "1" case 2 = "2" case 3 = "3" case 4 = "4" case 5 = "5" case 6 = "6" case 7 = "7" case 8 = "8" case 9 = "9" } if0_str :: (Int, () => String, () => String) => String = (test, true, false) => ifeq_str(test, 0, true, false) ifeq_str :: (Int, Int, () => String, () => String) => String = (test, expected, true, false) => { match(- test expected) { case 0 = true() case _ = false() } } if0_int :: (Int, () => Int, () => Int) => Int = (test, true, false) => ifeq_int(test, 0, true, false) ifeq_int :: (Int, Int, () => Int, () => Int) => Int = (test, expected, true, false) => { match(- test expected) { case 0 = true() case _ = false() } } factorial :: (Int) => Int = { go = (in) => ifeq_int(in, 1, () => 1, () => { * in go(- in 1) }) go } }
Not much difference. Whitespace is canonicalised, and binary operations are printed in prefix notation.
The parsed AST is translated into an Intermediate Representation.
{ lang.crash :: (lang.String) -> lang.String = (message_$1 :: lang.String) => message_$1 test.i2s.main :: () -> lang.String = () => "value: " + test.i2s.intToString(test.i2s.factorial(test.i2s.input)) test.i2s.input :: lang.Int = 10 test.i2s.fib :: (lang.Int) -> lang.Int = (in_$2 :: lang.Int) => test.i2s.ifeq_int(in_$2, 0, () => 1, () => test.i2s.ifeq_int(in_$2, 1, () => 1, () => test.i2s.fib(in_$2 - 1) + test.i2s.fib(in_$2 - 2))) test.i2s.intToString :: (lang.Int) -> lang.String = (in_$5 :: lang.Int) => test.i2s.if0_str(in_$5, () => "0", () => let div10_$6 = in_$5 / 10 digit_$7 = in_$5 - div10_$6 * 10 digitStr_$8 = test.i2s.digitToString(digit_$7) in test.i2s.if0_str(div10_$6, () => digitStr_$8, () => test.i2s.intToString(div10_$6) + digitStr_$8)) test.i2s.digitToString :: (lang.Int) -> lang.String = (in_$10 :: lang.Int) => match(in_$10) { case 0 = "0" case 1 = "1" case 2 = "2" case 3 = "3" case 4 = "4" case 5 = "5" case 6 = "6" case 7 = "7" case 8 = "8" case 9 = "9" } test.i2s.if0_str :: (lang.Int, () -> lang.String, () -> lang.String) -> lang.String = (test_$11 :: lang.Int, true_$12 :: () -> lang.String, false_$13 :: () -> lang.String) => test.i2s.ifeq_str(test_$11, 0, true_$12, false_$13) test.i2s.ifeq_str :: (lang.Int, lang.Int, () -> lang.String, () -> lang.String) -> lang.String = (test_$14 :: lang.Int, expected_$15 :: lang.Int, true_$16 :: () -> lang.String, false_$17 :: () -> lang.String) => match(test_$14 - expected_$15) { case 0 = true_$16() case _ = false_$17() } test.i2s.if0_int :: (lang.Int, () -> lang.Int, () -> lang.Int) -> lang.Int = (test_$19 :: lang.Int, true_$20 :: () -> lang.Int, false_$21 :: () -> lang.Int) => test.i2s.ifeq_int(test_$19, 0, true_$20, false_$21) test.i2s.ifeq_int :: (lang.Int, lang.Int, () -> lang.Int, () -> lang.Int) -> lang.Int = (test_$22 :: lang.Int, expected_$23 :: lang.Int, true_$24 :: () -> lang.Int, false_$25 :: () -> lang.Int) => match(test_$22 - expected_$23) { case 0 = true_$24() case _ = false_$25() } test.i2s.factorial :: (?) -> ? = letrec go_$27 = (in_$28 :: lang.Int) => test.i2s.ifeq_int(in_$28, 1, () => 1, () => in_$28 * go_$27(in_$28 - 1)) in go_$27 }
Now there's some difference. The translator qualifies all identifiers and assigns types to each expression. All the local bindings are addressed by a numeric id, though they still have a label for readability's sake. All the code blocks {x = ...}
have been replaced with let/letrec expressions. They're the same thing, except a let expression is guaranteed to return something.
Next comes the optimiser. The optimiser was my favourite part, and I spent hours writing passes that tried to inline and simplify. It's the most magical and least understood part of the system.
This is the list of passes today:
val RepeatedPasses: List[OptimiserPass] = List( new InlineGlobalBindingsPass, new RemoveDeadGlobalBindingsPass, new OptimiseLocalAliasesPass, new InlineImmediateFunctionsPass, new InlineLocalBindingsPass, new SimplifyUselessArithmeticPass, new InlineImmediateFunctionsPass, new ApplyKnownMatchesPass, new RemoveDeadLocalCodePass, new ExplodeComplexOperationsPass, new FlattenBadlyNestedLetsPass )
-
InlineGlobalBindings: for a given top-level binding, inline all referenced global bindings (except itself) as local declarations. Several iterations of this results in everything ending up in
main
. -
RemoveDeadGlobalBindings: removes any top-level binding not referenced by
main
. -
OptimiseLocalAliases: replaces things like
let x = y in x + x
(wherey
is a variable, not an arbitrary expression) withlet x = y in y + y
(iex
is now unused). It's important for function inlining (actually it's probably redundant considering InlineLocalBindings). -
InlineImmediateFunctions: replaces things like
(\x => x + x)(y)
withlet x = y in x + x
. Depends on InlineLocalBindings to finish the job. -
InlineLocalBindings: replaces things like
let x = 4 + 5 in x + x
with(4 + 5) + (4 + 5)
. -
SimplifyUselessArithmetic: replaces things like
x + 0
withx
. -
ApplyKnownMatches: replaces
match(2) { case 2 => x }
withx
. Tries to simplify a match expression when the input is known at compile-time. - RemoveDeadLocalCode: removes local bindings that aren't referenced.
- ExplodeComplexOperations: here for the awesomeness of generating heaps of temporary variables. It takes a complicated binary, function-call, or match expression and explodes its arguments into a let block.
-
FlattenBadlyNestedLetsPass: here for the aesthetics of debug output. Replaces
let x = let y = a in b in c
with
let y = a in let x = b in c
which gets pretty-printed as
let y = a x = b in c
In effect, flattening the lets.
The list RepeatedPasses
gets run over the IR repeatedly until the IR stops changing or we've hit 5 iterations. In particular, I don't have any limits on inlining, so it hits the maximum frequently.
The program, optimised:
{ test.i2s.main :: () -> lang.String = () => letrec test.i2s.intToString_$42 = (in_$43 :: lang.Int) => match(in_$43) { case 0 = "0" case _ = let _520_$520 = in_$43 / 10 in match(_520_$520) { case 0 = match(in_$43 - in_$43 / 10 * 10) { case 0 = "0" case 1 = "1" case 2 = "2" case 3 = "3" case 4 = "4" case 5 = "5" case 6 = "6" case 7 = "7" case 8 = "8" case 9 = "9" } case _ = test.i2s.intToString_$42(in_$43 / 10) + match(in_$43 - in_$43 / 10 * 10) { case 0 = "0" case 1 = "1" case 2 = "2" case 3 = "3" case 4 = "4" case 5 = "5" case 6 = "6" case 7 = "7" case 8 = "8" case 9 = "9" } } } in letrec go_$327 = (in_$328 :: lang.Int) => let _523_$523 = in_$328 - 1 in match(_523_$523) { case 0 = 1 case _ = in_$328 * go_$327(in_$328 - 1) } in let _524_$524 = go_$327(10) _525_$525 = test.i2s.intToString_$42(_524_$524) in "value: " + _525_$525 }
The program is somewhat shorter, and it's all been inlined into main. The fib
function and other unused code is gone. We're left with three main parts: the recursive intToString
, the recursive go
function from factorial
, and the original main
function.
After optimising, the IR is translated in a highlevel bytecode format. The VM is fully register-based.
{ main = @G0 ===== DATA TYPES ===== ===== STATIC BINDINGS ===== @G0 = @F0 ===== FUNCTIONS ===== @F0 () => (() => lang.String) { LVT { . #0 :: () => lang.String } CODE { 0000 #0 = @F1<> 0001 return #0 :: () => lang.String } } @F1 () => lang.String { LVT { . #test.i2s.intToString :: (lang.Int) => lang.String, . #go :: (lang.Int) => lang.Int, . #_524 :: lang.Int, . #3 :: lang.Int, . #_525 :: lang.String, . #5 :: lang.String, . #6 :: lang.String } CODE { 0000 #test.i2s.intToString = @F2<#test.i2s.intToString> 0001 #go = @F3<#go> 0002 #3 = 10 0003 #_524 = #go(#3) 0004 #_525 = #test.i2s.intToString(#_524) 0005 #5 = "value: " 0006 #6 = #5 + #_525 :: String 0007 return #6 :: lang.String } } @F2 (lang.Int) => lang.String { LVT { c #test.i2s.intToString :: (lang.Int) => lang.String, a #in :: lang.Int, . #2 :: lang.String, . #3 :: lang.Int, . #_520 :: lang.Int, . #5 :: lang.Int, . #6 :: lang.Int, . #7 :: lang.Int, . #8 :: lang.Int, . #9 :: lang.Int, . #10 :: lang.Int, . #11 :: lang.Int, . #12 :: lang.Int, . #13 :: lang.Int, . #14 :: lang.Int, . #15 :: lang.Int, . #16 :: lang.Int, . #17 :: lang.Int, . #18 :: lang.Int, . #19 :: lang.Int, . #20 :: lang.Int, . #21 :: lang.Int, . #22 :: lang.String, . #23 :: lang.Int, . #24 :: lang.Int, . #25 :: lang.Int, . #26 :: lang.Int, . #27 :: lang.Int, . #28 :: lang.Int, . #29 :: lang.Int, . #30 :: lang.String, . #31 :: lang.Int, . #32 :: lang.Int, . #33 :: lang.Int, . #34 :: lang.Int, . #35 :: lang.Int, . #36 :: lang.Int, . #37 :: lang.Int, . #38 :: lang.Int, . #39 :: lang.Int, . #40 :: lang.Int } CODE { 0000 mg37v_case0 #3 = 0 0001 if #in != #3: jump @mg37v_case1 0002 #2 = "0" 0003 jump @mg37v_end 0004 mg37v_case1 #5 = 10 0005 #_520 = #in / #5 0006 mqcdg_case0 #6 = 0 0007 if #_520 != #6: jump @mqcdg_case1 0008 #7 = 10 0009 #8 = #in / #7 0010 #9 = 10 0011 #10 = #8 * #9 0012 #11 = #in - #10 0013 m2p9w_case0 #12 = 0 0014 if #11 != #12: jump @m2p9w_case1 0015 #2 = "0" 0016 jump @m2p9w_end 0017 m2p9w_case1 #13 = 1 0018 if #11 != #13: jump @m2p9w_case2 0019 #2 = "1" 0020 jump @m2p9w_end 0021 m2p9w_case2 #14 = 2 0022 if #11 != #14: jump @m2p9w_case3 0023 #2 = "2" 0024 jump @m2p9w_end 0025 m2p9w_case3 #15 = 3 0026 if #11 != #15: jump @m2p9w_case4 0027 #2 = "3" 0028 jump @m2p9w_end 0029 m2p9w_case4 #16 = 4 0030 if #11 != #16: jump @m2p9w_case5 0031 #2 = "4" 0032 jump @m2p9w_end 0033 m2p9w_case5 #17 = 5 0034 if #11 != #17: jump @m2p9w_case6 0035 #2 = "5" 0036 jump @m2p9w_end 0037 m2p9w_case6 #18 = 6 0038 if #11 != #18: jump @m2p9w_case7 0039 #2 = "6" 0040 jump @m2p9w_end 0041 m2p9w_case7 #19 = 7 0042 if #11 != #19: jump @m2p9w_case8 0043 #2 = "7" 0044 jump @m2p9w_end 0045 m2p9w_case8 #20 = 8 0046 if #11 != #20: jump @m2p9w_case9 0047 #2 = "8" 0048 jump @m2p9w_end 0049 m2p9w_case9 #21 = 9 0050 if #11 != #21: jump @m2p9w_nomatch 0051 #2 = "9" 0052 jump @m2p9w_end 0053 m2p9w_nomatch crash "Match failed" 0054 m2p9w_end jump @mg37v_end 0055 mqcdg_case1 #23 = 10 0056 #24 = #in / #23 0057 #22 = #test.i2s.intToString(#24) 0058 #25 = 10 0059 #26 = #in / #25 0060 #27 = 10 0061 #28 = #26 * #27 0062 #29 = #in - #28 0063 mysu8_case0 #31 = 0 0064 if #29 != #31: jump @mysu8_case1 0065 #30 = "0" 0066 jump @mysu8_end 0067 mysu8_case1 #32 = 1 0068 if #29 != #32: jump @mysu8_case2 0069 #30 = "1" 0070 jump @mysu8_end 0071 mysu8_case2 #33 = 2 0072 if #29 != #33: jump @mysu8_case3 0073 #30 = "2" 0074 jump @mysu8_end 0075 mysu8_case3 #34 = 3 0076 if #29 != #34: jump @mysu8_case4 0077 #30 = "3" 0078 jump @mysu8_end 0079 mysu8_case4 #35 = 4 0080 if #29 != #35: jump @mysu8_case5 0081 #30 = "4" 0082 jump @mysu8_end 0083 mysu8_case5 #36 = 5 0084 if #29 != #36: jump @mysu8_case6 0085 #30 = "5" 0086 jump @mysu8_end 0087 mysu8_case6 #37 = 6 0088 if #29 != #37: jump @mysu8_case7 0089 #30 = "6" 0090 jump @mysu8_end 0091 mysu8_case7 #38 = 7 0092 if #29 != #38: jump @mysu8_case8 0093 #30 = "7" 0094 jump @mysu8_end 0095 mysu8_case8 #39 = 8 0096 if #29 != #39: jump @mysu8_case9 0097 #30 = "8" 0098 jump @mysu8_end 0099 mysu8_case9 #40 = 9 0100 if #29 != #40: jump @mysu8_nomatch 0101 #30 = "9" 0102 jump @mysu8_end 0103 mysu8_nomatch crash "Match failed" 0104 mysu8_end #2 = #22 + #30 :: String 0105 mg37v_end return #2 :: lang.String } } @F3 (lang.Int) => lang.Int { LVT { c #go :: (lang.Int) => lang.Int, a #in :: lang.Int, . #_523 :: lang.Int, . #3 :: lang.Int, . #4 :: lang.Int, . #5 :: lang.Int, . #6 :: lang.Int, . #7 :: lang.Int, . #8 :: lang.Int } CODE { 0000 #3 = 1 0001 #_523 = #in - #3 0002 m8jew_case0 #5 = 0 0003 if #_523 != #5: jump @m8jew_case1 0004 #4 = 1 0005 jump @m8jew_end 0006 m8jew_case1 #7 = 1 0007 #8 = #in - #7 0008 #6 = #go(#8) 0009 #4 = #in * #6 0010 m8jew_end return #4 :: lang.Int } } }
The "bytecode" is a set of instructions grouped by function. Each instruction is pretty-printed in a pseudo-code sort of way so it's not that hard to read, although match expressions compile into massive unwieldy if-then-jump tables.
Each function has a Local Variable Table that details how many registers it will use and their types. Closure and function arguments get passed in through the first few registers. Any function can have closure (most do), so to call one, you must first instantiate it with its closure. The function value can be passed around and will hold onto that closure.
If all user functions are instantiated, how does the runtime call main
? Well, there's the concept of global bindings. A global binding is a lazily initialised value that use a special function (one that doesn't need instantiation) to create its value. When called upon, it calls that function and persists the result. The code that asked for its value (and subsequent askings) will use that result.
In this example, main
is defined to be @G0. @G0 uses @F0 to instantiate the function @F1, and the interpreter invokes the result to run the program.
@F2 is the recursive intToString
, and @F3 is the recursive go
.
And the result is of course:
result :: lang.String = "value: 3628800"
Conclusion
There's a lot more I could say. The project took a lot of different turns to get here, depending on what articles I'd been reading or which language I was fascinated with at the time (right now it's Haskell). There were plenty of dead ends --- I rewrote the typechecker several times because it kept breaking. The latest branch tried a different optimisation technique that failed because I forgot about closure (...and I can't be bothered to handle it. Closure complicates everything).
I occasionally have dreams of the superoptimiser, eg: Why can't the optimiser simplify the factorial program down to "value: 3628800"
?
Answer: recursion is hard to optimise, even if you have a deep CS/mathematics degree (I do not).
I've wanted to write a last stage that targets LLVM bitcode, but I never found the time (or desire).
And of course, I want to implement actors. But they're a lot of work too...
This was a fun project, and I learned a lot. If I ever do any more language design, however, I'll probably scrap the whole thing and start again, without the piled up twists and turns. And in a language less verbose than Scala.