Blog of Tom Bentley

Implementing Method Invocation in ceylonc

My main involvement in the Ceylon project has been in the compiler and within that one of the things I've been involved with is method invocation. So I thought I'd blog about some of the details of the compiler, to show that working on it isn't that hard.

Syntactically, Ceylon has two different ways of invoking a method (and the metamodel will add a third). Positional invocation will be very familiar to a Java programmer. In Ceylon it's conceptually pretty much the same, including support for 'varargs'. In this post I'm going to go into some detail about how support for positional method invocation is implemented in the compiler. I might cover the other syntactic form, named argument invocation, at a later date.

Before we go much further I just need to define some terminology. A method is declared with zero or more parameters, the last of which may be a sequenced parameter ('varargs'):

void foo(Natural n, Integer i, String... strings) {
    // some logic
}

At a (positional) call site the method is supplied with values for each of the parameters in the declaration. These values are usually called the arguments of the method invocation.

How we generate code

In the following sections I'm going to be presenting bits of Ceylon code and the 'equivalent Java' code but it's important to understand that the Ceylon compiler doesn't actually generate Java source code. It instead constructs an abstract syntax tree (AST) directly using the internal OpenJDK javac API. This AST is then subject to the same Java type checks as normal Java source code, before it get converted to bytecode. You can read more about the architecture of the Ceylon compiler here.

The major benefit of piggy-backing on javac like this is that we don't have to get into the details of generating correct bytecode, such as worrying about which instruction to jump to. We can stick to higher level concepts that we're more familiar with, while we focus on actually getting something working. In the long term, it would be nice if ceylonc could be self-hosting.

Erasure

Because of the similarity with Java, supporting positional invocation in Ceylon isn't difficult: It boils down to generating a plain Java method invocation. But the two certainly aren't equivalent.

Although notionally in Ceylon 'everything is an object', the compiler is allowed (and does) optimise the numerical types (Natural, Integer, Float, Boolean) to the corresponding Java primitive type (long, long, double and boolean respectively). This means that when you write a Ceylon statement such as

Natural n = 1;

it is transformed into a Java statement like this

final long n = 1;

We call this 'erasure' (yes, I know erasure has another meaning in Java to do with the loss of generic type information, but it's the term we use).

Erasure in itself wouldn't cause a problem for method invocation because the method parameters are subject to erasure just as the method arguments are. However, sometimes we need to 'box' the primitive, just like Java does.

A good example of this is passing a Natural argument to a parameter declared Natural?. The Java method declaration must use a boxed type (Natural from the runtime) as opposted to the Java primitive (long) it would otherwise be erased to in order to cope with the possibility of being passed a null. This means it is the compiler's responsibility to box the erased Natural (a Java long) at the call site.

Ceylon uses its own boxing classes in the runtime version of the language module. Each class implements the API of the relevant type. Because Ceylon doesn't use the same classes to box primitives as Java does we can't rely on javac's auto boxing/unboxing support. Performing this boxing correctly and exactly when and where it's needed is where method invocation starts to get a little more complex than simply being 'A Ceylon method invocation is the same as a Java method invocation'.

Varargs

Varargs isn't implemented in terms of Java's varargs support. The reason in this case is that a Ceylon sequence is not the same thing as a Java array. So when someone declares a Ceylon method like this

void varargs(T... ts) {

}

the equivalent Java looks something like this

void varargs(Iterable<T> ts) {

}

Now consider the Ceylon call site

varargs<String>("foo", "bar", "baz");

When compiling this invocation we have to create a concrete instance of the Iterable<T> (using the arguments provided) to pass to the method. This is done using an ArraySequence (an implementation of a Ceylon Sequence in the runtime), so that the generated Java looks something like this

varargs(new ArraySequence("foo", "bar", "baz"));

Aside: The observant reader will realise that using varargs with erased types creates another boxing problem...

Conclusion

So, after all that explanation, hopefully the source code should make some kind of sense.

None of what I've discussed above should be that hard to understand for anyone who's done much Java programming. I will admit at this point that I deliberately chose something that would be familiar and where the transformation between Ceylon and Java are small. This has allowed me to focus on some of the annoying-but-necessary details that are important to understand if you're going to hack on the compiler.

The take-home message is that you really don't have to know a great deal about compilers, or even the JVM to be able to contribute something genuinely useful.

Note

Since this post was originally written:

  • the Natural type has since been remove from ceylon.language.
  • ceylonc has become ceylon compile.

Changed syntax highlighting

I've been writing quite a lot of documentation lately, and my previous decision to require an HTML comment in order to get syntax highlighting of Ceylon code started to look like the wrong choice. Since the vast majority of the code blocks on the site are going to be Ceylon source, and since we're (almost?) always going to want that highlighted I've changed the rules:

  • Now all indented code blocks will be assumed to be Ceylon source code that requires syntax highlighting, unless indicated otherwise.
  • If the source code is not Ceylon, you can use a <!-- lang: foo --> comment (having setup the syntax highlighter and gsub transformer to do highlighting for the language foo).
  • If you don't want any highlighting at all use <!-- lang: none -->

Improved syntax highlighting and deep linking

A couple of improvements to the site generation have been added.

Syntax highlighting

We were previously using literal <pre> blocks in markdown pages so that we could set the CSS necessary for the syntax highlighter. The big downside of this was the need to HTML-escape the code within those <pre> blocks. Using an awestruct 'transformer' it's now possible to use the more natural markdown way of including example code (i.e. using indentation), which means you don't have to worry about using HTML entities for <, & etc.

By default <pre> blocks will not be subject to highlighting. To enable highlighting simply preceed the block with an HTML comment hinting at the language being included. For example

    class Foo() {
        void bar() {

        }
    }

It's also easy to have a page with syntax highlighting of multiple languages, by using an appropriate hint.

For those who are interested, the way this works is having a transformer perform a global gsub on the generated output using a suitable regular expression to pick out the code blocks generated by markdown.

Deep linking

Markdown by itself doesn't allow the author to put id attributes or anchors (<a name="...">) in the generated HTML, which is annoying if you want to link to a particular part of the page. You can of course put <a name="..."> elements in yourself, but:

  • It clutters up the nice markdown with HTML.
  • It forces you to ensure the uniqueness of the ids you use.
  • It's tedious.

Using another transformer we now generate id attributes automatically for all <h*> and <p> elements. For headings we use the heading text (suitably munged into an XML NAME). For paragraphs we use the first few words (again suitable normalized, and also worrying about uniqueness). If an element already has an id then it's untouched. If there's a matching id (or name, since they share the same namespace), steps are taken to disambiguate the generated one.

The long and the short of this is that we can now easily and accurately link into the middle of documents. The caveat is that you should be aware that changing the first few words of a paragraph (or the text of a heading) is likely to break any links which point to it. To prevent such broken links you can always use an explicit <a name=""> at the relevant point.