Programming language experiments

Lachlan Phillips

2017-01-18 10:43

This details a project called language_design that I started September 2015 and worked on sporadically ever since. It was written in Scala.

The initial aim was to build a language that mixed aspects of Scala and Erlang. It would be strongly & statically typed, have actors, and be roughly functional. Functional in the sense that immutability was encouraged and functions were used everywhere, but purity wasn't a big concern.

It had a feature where, given a function fun lines(reader: Reader): List[String] in scope and an r: Reader, you could call r.lines(). The point was to feel like objects (like Java), even the language didn't have any. That was cool.

Actors would be typed, and could implement interfaces. Actors could only be sent messages it could handle. Example: an actor could implement the Sink[T] interface, handle message Next(item: Int), then be used wherever a actor Sink[T] was needed.

Here's an example of desired syntax from the first commit to the project:

module me.lachlanap.file_average {
  process Main {
    handle Run(args: List[String]) = {
      val out = \Sink {
        handle Next(item: Int) = {
          println("Average: " + item)
        }
      };

      val accumulator = new Accumulator(out);
      val filename = args.get(0);

      val reader = new Reader(filename, accumulator);
      reader.watch(error =>
        println("Error reading file: " + error);
        crash(error);
      );

      reader ! Execute;
      // desugared to: reader.!(reader.T.Execute);
      //         then: !(reader, reader.T.Execute);

      // TODO: Garbage collect dropped processes?
    }
  }

  process Reader(filename: String, out: CloseableSink) {
    handle Execute = {
      use flat lang.io;
      val reader = openReader(filename, charset.UTF_8)

      with(reader, reader => {
        reader.lines()
              .map(line => line.parseInt())
              .foreach(parse => parsed match {
          case None => crash("Failed to parse " + line);
          case Some(v) => out ! Next(v);
        });
      });

      out ! Close;
      finish();
    }
  }

  process Accumulator(out: Sink[Int]): CloseableSink[Int] {
    var total = 0;
    var count = 0;

    handle Next(item: Int) = {
      total += item;
      count++;
    }

    handle Close = {
      out ! Next(total / count);
      finish();
    }
  }
}

I built my own set of combinator parsers to parse this... and quickly replaced them with FastParse. I managed to parse most of the language (although operator precendence is not a thing).

I didn't get very far with that language. I never got to generics. Or actors. Or data types beyond Strings and Ints.

The language today is very different.

module lang {
  data Bool = True | False
  data Unit = Unit

  crash :: String => String
   = message => message
}

module test.i2s {
  main :: () => String
   = () => "value: " + intToString(factorial(input))

  input :: Int = 10

  fib :: Int => Int
   = in =>
    ifeq_int(in, 0,
      () => 1,
      () => {
        ifeq_int(in, 1,
          () => 1,
          () => {
            fib(in - 1) + fib(in - 2)
          })
      })

  intToString :: Int => String
   = in =>
    if0_str(in,
      () => "0",
      () => {
        div10 = in / 10
        digit = in - (div10 * 10)
        digitStr = digitToString(digit)

        if0_str(div10,
          () => digitStr,
          () => intToString(div10) + digitStr)
      })


  digitToString :: Int => String
   = in => match(in) {
    case 0 = "0"
    case 1 = "1"
    case 2 = "2"
    case 3 = "3"
    case 4 = "4"
    case 5 = "5"
    case 6 = "6"
    case 7 = "7"
    case 8 = "8"
    case 9 = "9"
  }


  if0_str :: (Int, () => String, () => String) => String
   = (test, true, false) => ifeq_str(test, 0, true, false)

  ifeq_str :: (Int, Int, () => String, () => String) => String
   = (test, expected, true, false) => {
    match(test - expected) {
      case 0 = true()
      case _ = false()
    }
  }

  if0_int :: (Int, () => Int, () => Int) => Int
   = (test, true, false) => ifeq_int(test, 0, true, false)

  ifeq_int :: (Int, Int, () => Int, () => Int) => Int
   = (test, expected, true, false) => {
    match(test - expected) {
      case 0 = true()
      case _ = false()
    }
  }

  factorial :: Int => Int
   = {
       go = in =>
              ifeq_int(in, 1,
                 () => 1,
                 () => {
                   in * go(in - 1)
                 })
       go
     }
}

This is the language today. It's purely functional. Functions are defined exclusively with lambdas. It has no generics, hence ifeq_str and ifeq_int. There is no integer-to-string function, so I implement my own --- and the only comparison operator is the match expression --- booleans aren't a thing yet. The only way to get output is to return the result from main.

But it actually works. The example above prints:

result :: lang.String = "value: 3628800"

How it works

The source code above is parsed into:

module lang {
  data Bool = True() | False()
  data Unit = Unit()

  crash :: (String) => String
   = (message) => message
}
module test.i2s {
  main :: () => String
   = () => + "value: " intToString(factorial(input))

  input :: Int
   = 10

  fib :: (Int) => Int
   = (in) => ifeq_int(in, 0, () => 1, () => {
    ifeq_int(in, 1, () => 1, () => {
      + fib(- in 1) fib(- in 2)
    })
  })

  intToString :: (Int) => String
   = (in) => if0_str(in, () => "0", () => {
    div10 = / in 10
    digit = - in * div10 10
    digitStr = digitToString(digit)
    if0_str(div10, () => digitStr, () => + intToString(div10) digitStr)
  })

  digitToString :: (Int) => String
   = (in) => match(in) {
    case 0 = "0"
    case 1 = "1"
    case 2 = "2"
    case 3 = "3"
    case 4 = "4"
    case 5 = "5"
    case 6 = "6"
    case 7 = "7"
    case 8 = "8"
    case 9 = "9"
  }

  if0_str :: (Int, () => String, () => String) => String
   = (test, true, false) => ifeq_str(test, 0, true, false)

  ifeq_str :: (Int, Int, () => String, () => String) => String
   = (test, expected, true, false) => {
    match(- test expected) {
      case 0 = true()
      case _ = false()
    }
  }

  if0_int :: (Int, () => Int, () => Int) => Int
   = (test, true, false) => ifeq_int(test, 0, true, false)

  ifeq_int :: (Int, Int, () => Int, () => Int) => Int
   = (test, expected, true, false) => {
    match(- test expected) {
      case 0 = true()
      case _ = false()
    }
  }

  factorial :: (Int) => Int
   = {
    go = (in) => ifeq_int(in, 1, () => 1, () => {
      * in go(- in 1)
    })

    go
  }
}

Not much difference. Whitespace is canonicalised, and binary operations are printed in prefix notation.

The parsed AST is translated into an Intermediate Representation.

{
  lang.crash :: (lang.String) -> lang.String
   = (message_$1 :: lang.String) => message_$1

  test.i2s.main :: () -> lang.String
   = () => "value: " + test.i2s.intToString(test.i2s.factorial(test.i2s.input))

  test.i2s.input :: lang.Int
   = 10

  test.i2s.fib :: (lang.Int) -> lang.Int
   = (in_$2 :: lang.Int) => test.i2s.ifeq_int(in_$2, 0, () => 1, () => test.i2s.ifeq_int(in_$2, 1, () => 1, () => test.i2s.fib(in_$2 - 1) + test.i2s.fib(in_$2 - 2)))

  test.i2s.intToString :: (lang.Int) -> lang.String
   = (in_$5 :: lang.Int) =>
    test.i2s.if0_str(in_$5, () => "0", () =>
      let
        div10_$6 = in_$5 / 10
        digit_$7 = in_$5 - div10_$6 * 10
        digitStr_$8 = test.i2s.digitToString(digit_$7)
      in
        test.i2s.if0_str(div10_$6, () => digitStr_$8, () => test.i2s.intToString(div10_$6) + digitStr_$8))

  test.i2s.digitToString :: (lang.Int) -> lang.String
   = (in_$10 :: lang.Int) =>
    match(in_$10) {
      case 0 = "0"
      case 1 = "1"
      case 2 = "2"
      case 3 = "3"
      case 4 = "4"
      case 5 = "5"
      case 6 = "6"
      case 7 = "7"
      case 8 = "8"
      case 9 = "9"
    }

  test.i2s.if0_str :: (lang.Int, () -> lang.String, () -> lang.String) -> lang.String
   = (test_$11 :: lang.Int, true_$12 :: () -> lang.String, false_$13 :: () -> lang.String) => test.i2s.ifeq_str(test_$11, 0, true_$12, false_$13)

  test.i2s.ifeq_str :: (lang.Int, lang.Int, () -> lang.String, () -> lang.String) -> lang.String
   = (test_$14 :: lang.Int, expected_$15 :: lang.Int, true_$16 :: () -> lang.String, false_$17 :: () -> lang.String) =>
    match(test_$14 - expected_$15) {
      case 0 = true_$16()
      case _ = false_$17()
    }

  test.i2s.if0_int :: (lang.Int, () -> lang.Int, () -> lang.Int) -> lang.Int
   = (test_$19 :: lang.Int, true_$20 :: () -> lang.Int, false_$21 :: () -> lang.Int) => test.i2s.ifeq_int(test_$19, 0, true_$20, false_$21)

  test.i2s.ifeq_int :: (lang.Int, lang.Int, () -> lang.Int, () -> lang.Int) -> lang.Int
   = (test_$22 :: lang.Int, expected_$23 :: lang.Int, true_$24 :: () -> lang.Int, false_$25 :: () -> lang.Int) =>
    match(test_$22 - expected_$23) {
      case 0 = true_$24()
      case _ = false_$25()
    }

  test.i2s.factorial :: (?) -> ?
   = letrec
    go_$27 = (in_$28 :: lang.Int) => test.i2s.ifeq_int(in_$28, 1, () => 1, () => in_$28 * go_$27(in_$28 - 1))
  in
    go_$27
}

Now there's some difference. The translator qualifies all identifiers and assigns types to each expression. All the local bindings are addressed by a numeric id, though they still have a label for readability's sake. All the code blocks {x = ...} have been replaced with let/letrec expressions. They're the same thing, except a let expression is guaranteed to return something.

Next comes the optimiser. The optimiser was my favourite part, and I spent hours writing passes that tried to inline and simplify. It's the most magical and least understood part of the system.

This is the list of passes today:

val RepeatedPasses: List[OptimiserPass] = List(
  new InlineGlobalBindingsPass,
  new RemoveDeadGlobalBindingsPass,
  new OptimiseLocalAliasesPass,
  new InlineImmediateFunctionsPass,
  new InlineLocalBindingsPass,
  new SimplifyUselessArithmeticPass,
  new InlineImmediateFunctionsPass,
  new ApplyKnownMatchesPass,
  new RemoveDeadLocalCodePass,
  new ExplodeComplexOperationsPass,
  new FlattenBadlyNestedLetsPass
)

InlineGlobalBindings: for a given top-level binding, inline all referenced global bindings (except itself) as local declarations. Several iterations of this results in everything ending up in main.
RemoveDeadGlobalBindings: removes any top-level binding not referenced by main.
OptimiseLocalAliases: replaces things like let x = y in x + x (where y is a variable, not an arbitrary expression) with let x = y in y + y (ie x is now unused). It's important for function inlining (actually it's probably redundant considering InlineLocalBindings).
InlineImmediateFunctions: replaces things like (\x => x + x)(y) with let x = y in x + x. Depends on InlineLocalBindings to finish the job.
InlineLocalBindings: replaces things like let x = 4 + 5 in x + x with (4 + 5) + (4 + 5).
SimplifyUselessArithmetic: replaces things like x + 0 with x.
ApplyKnownMatches: replaces match(2) { case 2 => x } with x. Tries to simplify a match expression when the input is known at compile-time.
RemoveDeadLocalCode: removes local bindings that aren't referenced.
ExplodeComplexOperations: here for the awesomeness of generating heaps of temporary variables. It takes a complicated binary, function-call, or match expression and explodes its arguments into a let block.
FlattenBadlyNestedLetsPass: here for the aesthetics of debug output. Replaces let x = let y = a in b in c

with

let y = a in let x = b in c

which gets pretty-printed as

let y = a x = b in c

In effect, flattening the lets.

The list RepeatedPasses gets run over the IR repeatedly until the IR stops changing or we've hit 5 iterations. In particular, I don't have any limits on inlining, so it hits the maximum frequently.

The program, optimised:

{
  test.i2s.main :: () -> lang.String
   = () =>
    letrec
      test.i2s.intToString_$42 =
        (in_$43 :: lang.Int) =>
          match(in_$43) {
            case 0 = "0"

            case _ = let
              _520_$520 = in_$43 / 10
            in
              match(_520_$520) {
                case 0 = match(in_$43 - in_$43 / 10 * 10) {
                  case 0 = "0"
                  case 1 = "1"
                  case 2 = "2"
                  case 3 = "3"
                  case 4 = "4"
                  case 5 = "5"
                  case 6 = "6"
                  case 7 = "7"
                  case 8 = "8"
                  case 9 = "9"
                }

                case _ = test.i2s.intToString_$42(in_$43 / 10) + match(in_$43 - in_$43 / 10 * 10) {
                  case 0 = "0"
                  case 1 = "1"
                  case 2 = "2"
                  case 3 = "3"
                  case 4 = "4"
                  case 5 = "5"
                  case 6 = "6"
                  case 7 = "7"
                  case 8 = "8"
                  case 9 = "9"
                }
              }
          }
    in
      letrec
        go_$327 =
          (in_$328 :: lang.Int) =>
            let
              _523_$523 = in_$328 - 1
            in
              match(_523_$523) {
                case 0 = 1
                case _ = in_$328 * go_$327(in_$328 - 1)
              }
      in
        let
          _524_$524 = go_$327(10)
          _525_$525 = test.i2s.intToString_$42(_524_$524)
        in
          "value: " + _525_$525
}

The program is somewhat shorter, and it's all been inlined into main. The fib function and other unused code is gone. We're left with three main parts: the recursive intToString, the recursive go function from factorial, and the original main function.

After optimising, the IR is translated in a highlevel bytecode format. The VM is fully register-based.

{
  main = @G0

  ===== DATA TYPES =====

  ===== STATIC BINDINGS =====
  @G0 = @F0

  ===== FUNCTIONS =====
  @F0 () => (() => lang.String) {
    LVT {
      . #0 :: () => lang.String
    }
    CODE {
      0000 #0 = @F1<>
      0001 return #0 :: () => lang.String
    }
  }
  @F1 () => lang.String {
    LVT {
      . #test.i2s.intToString :: (lang.Int) => lang.String,
      . #go :: (lang.Int) => lang.Int,
      . #_524 :: lang.Int,
      . #3 :: lang.Int,
      . #_525 :: lang.String,
      . #5 :: lang.String,
      . #6 :: lang.String
    }
    CODE {
      0000 #test.i2s.intToString = @F2<#test.i2s.intToString>
      0001 #go = @F3<#go>
      0002 #3 = 10
      0003 #_524 = #go(#3)
      0004 #_525 = #test.i2s.intToString(#_524)
      0005 #5 = "value: "
      0006 #6 = #5 + #_525 :: String
      0007 return #6 :: lang.String
    }
  }
  @F2 (lang.Int) => lang.String {
    LVT {
      c #test.i2s.intToString :: (lang.Int) => lang.String,
      a #in :: lang.Int,
      . #2 :: lang.String,
      . #3 :: lang.Int,
      . #_520 :: lang.Int,
      . #5 :: lang.Int,
      . #6 :: lang.Int,
      . #7 :: lang.Int,
      . #8 :: lang.Int,
      . #9 :: lang.Int,
      . #10 :: lang.Int,
      . #11 :: lang.Int,
      . #12 :: lang.Int,
      . #13 :: lang.Int,
      . #14 :: lang.Int,
      . #15 :: lang.Int,
      . #16 :: lang.Int,
      . #17 :: lang.Int,
      . #18 :: lang.Int,
      . #19 :: lang.Int,
      . #20 :: lang.Int,
      . #21 :: lang.Int,
      . #22 :: lang.String,
      . #23 :: lang.Int,
      . #24 :: lang.Int,
      . #25 :: lang.Int,
      . #26 :: lang.Int,
      . #27 :: lang.Int,
      . #28 :: lang.Int,
      . #29 :: lang.Int,
      . #30 :: lang.String,
      . #31 :: lang.Int,
      . #32 :: lang.Int,
      . #33 :: lang.Int,
      . #34 :: lang.Int,
      . #35 :: lang.Int,
      . #36 :: lang.Int,
      . #37 :: lang.Int,
      . #38 :: lang.Int,
      . #39 :: lang.Int,
      . #40 :: lang.Int
    }
    CODE {
      0000 mg37v_case0    #3 = 0
      0001                if #in != #3: jump @mg37v_case1
      0002                #2 = "0"
      0003                jump @mg37v_end
      0004 mg37v_case1    #5 = 10
      0005                #_520 = #in / #5
      0006 mqcdg_case0    #6 = 0
      0007                if #_520 != #6: jump @mqcdg_case1
      0008                #7 = 10
      0009                #8 = #in / #7
      0010                #9 = 10
      0011                #10 = #8 * #9
      0012                #11 = #in - #10
      0013 m2p9w_case0    #12 = 0
      0014                if #11 != #12: jump @m2p9w_case1
      0015                #2 = "0"
      0016                jump @m2p9w_end
      0017 m2p9w_case1    #13 = 1
      0018                if #11 != #13: jump @m2p9w_case2
      0019                #2 = "1"
      0020                jump @m2p9w_end
      0021 m2p9w_case2    #14 = 2
      0022                if #11 != #14: jump @m2p9w_case3
      0023                #2 = "2"
      0024                jump @m2p9w_end
      0025 m2p9w_case3    #15 = 3
      0026                if #11 != #15: jump @m2p9w_case4
      0027                #2 = "3"
      0028                jump @m2p9w_end
      0029 m2p9w_case4    #16 = 4
      0030                if #11 != #16: jump @m2p9w_case5
      0031                #2 = "4"
      0032                jump @m2p9w_end
      0033 m2p9w_case5    #17 = 5
      0034                if #11 != #17: jump @m2p9w_case6
      0035                #2 = "5"
      0036                jump @m2p9w_end
      0037 m2p9w_case6    #18 = 6
      0038                if #11 != #18: jump @m2p9w_case7
      0039                #2 = "6"
      0040                jump @m2p9w_end
      0041 m2p9w_case7    #19 = 7
      0042                if #11 != #19: jump @m2p9w_case8
      0043                #2 = "7"
      0044                jump @m2p9w_end
      0045 m2p9w_case8    #20 = 8
      0046                if #11 != #20: jump @m2p9w_case9
      0047                #2 = "8"
      0048                jump @m2p9w_end
      0049 m2p9w_case9    #21 = 9
      0050                if #11 != #21: jump @m2p9w_nomatch
      0051                #2 = "9"
      0052                jump @m2p9w_end
      0053 m2p9w_nomatch  crash "Match failed"
      0054 m2p9w_end      jump @mg37v_end
      0055 mqcdg_case1    #23 = 10
      0056                #24 = #in / #23
      0057                #22 = #test.i2s.intToString(#24)
      0058                #25 = 10
      0059                #26 = #in / #25
      0060                #27 = 10
      0061                #28 = #26 * #27
      0062                #29 = #in - #28
      0063 mysu8_case0    #31 = 0
      0064                if #29 != #31: jump @mysu8_case1
      0065                #30 = "0"
      0066                jump @mysu8_end
      0067 mysu8_case1    #32 = 1
      0068                if #29 != #32: jump @mysu8_case2
      0069                #30 = "1"
      0070                jump @mysu8_end
      0071 mysu8_case2    #33 = 2
      0072                if #29 != #33: jump @mysu8_case3
      0073                #30 = "2"
      0074                jump @mysu8_end
      0075 mysu8_case3    #34 = 3
      0076                if #29 != #34: jump @mysu8_case4
      0077                #30 = "3"
      0078                jump @mysu8_end
      0079 mysu8_case4    #35 = 4
      0080                if #29 != #35: jump @mysu8_case5
      0081                #30 = "4"
      0082                jump @mysu8_end
      0083 mysu8_case5    #36 = 5
      0084                if #29 != #36: jump @mysu8_case6
      0085                #30 = "5"
      0086                jump @mysu8_end
      0087 mysu8_case6    #37 = 6
      0088                if #29 != #37: jump @mysu8_case7
      0089                #30 = "6"
      0090                jump @mysu8_end
      0091 mysu8_case7    #38 = 7
      0092                if #29 != #38: jump @mysu8_case8
      0093                #30 = "7"
      0094                jump @mysu8_end
      0095 mysu8_case8    #39 = 8
      0096                if #29 != #39: jump @mysu8_case9
      0097                #30 = "8"
      0098                jump @mysu8_end
      0099 mysu8_case9    #40 = 9
      0100                if #29 != #40: jump @mysu8_nomatch
      0101                #30 = "9"
      0102                jump @mysu8_end
      0103 mysu8_nomatch  crash "Match failed"
      0104 mysu8_end      #2 = #22 + #30 :: String
      0105 mg37v_end      return #2 :: lang.String
    }
  }
  @F3 (lang.Int) => lang.Int {
    LVT {
      c #go :: (lang.Int) => lang.Int,
      a #in :: lang.Int,
      . #_523 :: lang.Int,
      . #3 :: lang.Int,
      . #4 :: lang.Int,
      . #5 :: lang.Int,
      . #6 :: lang.Int,
      . #7 :: lang.Int,
      . #8 :: lang.Int
    }
    CODE {
      0000              #3 = 1
      0001              #_523 = #in - #3
      0002 m8jew_case0  #5 = 0
      0003              if #_523 != #5: jump @m8jew_case1
      0004              #4 = 1
      0005              jump @m8jew_end
      0006 m8jew_case1  #7 = 1
      0007              #8 = #in - #7
      0008              #6 = #go(#8)
      0009              #4 = #in * #6
      0010 m8jew_end    return #4 :: lang.Int
    }
  }

}

The "bytecode" is a set of instructions grouped by function. Each instruction is pretty-printed in a pseudo-code sort of way so it's not that hard to read, although match expressions compile into massive unwieldy if-then-jump tables.

Each function has a Local Variable Table that details how many registers it will use and their types. Closure and function arguments get passed in through the first few registers. Any function can have closure (most do), so to call one, you must first instantiate it with its closure. The function value can be passed around and will hold onto that closure.

If all user functions are instantiated, how does the runtime call main? Well, there's the concept of global bindings. A global binding is a lazily initialised value that use a special function (one that doesn't need instantiation) to create its value. When called upon, it calls that function and persists the result. The code that asked for its value (and subsequent askings) will use that result.

In this example, main is defined to be @G0. @G0 uses @F0 to instantiate the function @F1, and the interpreter invokes the result to run the program.

@F2 is the recursive intToString, and @F3 is the recursive go.

And the result is of course:

result :: lang.String = "value: 3628800"

Conclusion

There's a lot more I could say. The project took a lot of different turns to get here, depending on what articles I'd been reading or which language I was fascinated with at the time (right now it's Haskell). There were plenty of dead ends --- I rewrote the typechecker several times because it kept breaking. The latest branch tried a different optimisation technique that failed because I forgot about closure (...and I can't be bothered to handle it. Closure complicates everything).

I occasionally have dreams of the superoptimiser, eg: Why can't the optimiser simplify the factorial program down to "value: 3628800"?

Answer: recursion is hard to optimise, even if you have a deep CS/mathematics degree (I do not).

I've wanted to write a last stage that targets LLVM bitcode, but I never found the time (or desire).

And of course, I want to implement actors. But they're a lot of work too...

This was a fun project, and I learned a lot. If I ever do any more language design, however, I'll probably scrap the whole thing and start again, without the piled up twists and turns. And in a language less verbose than Scala.