Chapter 2. Lexical structure

Every Ceylon source file is a sequence of Unicode characters. Lexical analysis of the character stream, according to the grammar specified in this chapter, results in a stream of tokens. These tokens form the input of the parser grammar defined in the later chapters of this specification. The Ceylon lexer is able to completely tokenize a character stream in a single pass.

2.1. Whitespace

Whitespace is composed of strings of Unicode SPACE, CHARACTER TABULATION, FORM FEED (FF), LINE FEED (LF) and CARRIAGE RETURN (CR) characters.

Whitespace: " " | Tab | Formfeed | Newline | CarriageReturn
Tab: "\{CHARACTER TABULATION}"
Formfeed: "\{FORM FEED (FF)}"
Newline: "\{LINE FEED (LF)}"
CarriageReturn: "\{CARRIAGE RETURN (CR)}"

Outside of a comment, string literal, or single quoted literal, whitespace acts as a token separator and is immediately discarded by the lexer. Whitespace is not used as a statement separator.

Source text is divided into lines by line-terminating character sequences. The following Unicode character sequences terminate a line:

  • LINE FEED (LF),

  • CARRIAGE RETURN (CR), and

  • CARRIAGE RETURN (CR) followed by LINE FEED (LF).

2.2. Comments

There are two kinds of comments:

  • a multiline comment begins with /* and extends until */, and

  • an end-of-line comment begins with // or #! and extends until the next line terminating character sequence.

Both kinds of comments can be nested.

LineComment: ("//"|"#!") ~(Newline | CarriageReturn)* (CarriageReturn Newline | CarriageReturn | Newline)?
MultilineComment: "/*" (MultilineCommentCharacter | MultilineComment)* "*/"
MultilineCommentCharacter: ~("/"|"*") | ("/" ~"*") => "/" | ("*" ~"/") => "*"

The following examples are legal comments:

//this comment stops at the end of the line
/*
   but this is a comment that spans
   multiple lines
*/
#!/usr/bin/ceylon

Comments are treated as whitespace by both the compiler and documentation compiler. Comments may act as token separators, but their content is immediately discarded by the lexer and they are not visible to the parser.

2.3. Identifiers and keywords

Identifiers may contain letters, digits and underscores.

LowercaseCharacter: LowercaseLetter | "_"
UppercaseCharacter: UppercaseLetter
IdentifierCharacter: LowercaseCharacter | UppercaseCharacter | Number

The lexer classifies Unicode uppercase letters, lowercase letters, and numeric characters depending on the general category of the character as defined by the Unicode standard.

  • A LowercaseLetter is any character whose general category is Ll or any character whose general category is Lo or Lm which has the property Other_Lowercase.

  • An UppercaseLetter is any character whose general category is Lu or Lt, or any character whose general category is Lo or Lm which does not have the property Other_Lowercase.

  • A Number is any character whose general category is Nd, Nl, or No.

All identifiers are case sensitive: Person and person are two different legal identifiers.

The lexer distinguishes identifiers which begin with an initial uppercase character from identifiers which begin with an initial lowercase character or underscore. Additionally, an identifier may be qualified using the prefix \i or \I to disambiguate it from a reserved word or to explicitly specify whether it should be considered an initial uppercase or initial lowercase identifier.

LIdentifier: LowercaseCharacter IdentifierCharacter* | "\i" IdentifierCharacter+
UIdentifier: UppercaseCharacter IdentifierCharacter* | "\I" IdentifierCharacter+

The following examples are legal identifiers:

Person
name
personName
_id
x2
\I_id
\Iobject
\iObject
\iclass

The prefix \I or \i is not considered part of the identifier name. Therefore, \iperson is just an initial lowercase identifier named person and \Iperson is an initial uppercase identifier named person.

The following reserved words are not legal identifier names unless they appear escaped using \i or \I:

assembly module package import alias class interface object given value assign void function new of extends satisfies abstracts in out return break continue throw assert dynamic if else switch case for while try catch finally then let this outer super is exists nonempty

Note: assembly and abstracts are reserved for possible use in a future release of the language, for declaration of assemblies and lower bound type constraints respectively.

2.4. Literals

A literal is a single token that represents a Unicode character, a character string, or a numeric value.

2.4.1. Numeric literals

An integer literal may be expressed in decimal, hexadecimal, or binary notation:

IntegerLiteral: DecimalLiteral | HexLiteral | BinLiteral

A decimal literal has a list of digits and an optional magnitude:

DecimalLiteral: Digits Magnitude?

Hexadecimal literals are prefixed by #:

HexLiteral: "#" HexDigits

Binary literals are prefixed by $:

BinLiteral: "$" BinDigits

A floating point literal is distinguished by the presence of a decimal point or fractional magnitude:

FloatLiteral: NormalFloatLiteral | ShortcutFloatLiteral

Most floating point literals have a list of digits including a decimal point, and an optional exponent or magnitude.

NormalFloatLiteral: Digits "." FractionalDigits (Exponent | Magnitude | FractionalMagnitude)?

The decimal point is optional if a fractional magitude is specified.

ShortcutFloatLiteral: Digits FractionalMagnitude

Decimal digits may be separated into groups of three using an underscore.

Digits: Digit+ | Digit{1..3} ("_" Digit{3})+
FractionalDigits: Digit+ | (Digit{3} "_")+ Digit{1..3} 

Hexadecimal or binary digits may be separated into groups of four using an underscore. Hexadecimal digits may even be separated into groups of two.

HexDigits: HexDigit+ | HexDigit{1..4} ("_" HexDigit{4})+ | HexDigit{1..2} ("_" HexDigit{2})+
BinDigits: BinDigit+ | BinDigit{1..4} ("_" Digit{4})+

A digit is a decimal, hexadecimal, or binary digit.

Digit: "0".."9"
HexDigit: "0".."9" | "A".."F" | "a".."f"
BinDigit: "0"|"1"

A floating point literal may include either an exponent (for scientific notation) or a magnitude (an SI unit prefix). A decimal integer literal may include a magnitude.

Exponent: ("E"|"e") ("+"|"-")? Digit+
Magnitude: "k" | "M" | "G" | "T" | "P"
FractionalMagnitude: "m" | "u" | "n" | "p" | "f"

The magnitude of a numeric literal is interpreted as follows:

  • k means e+3,

  • M means e+6,

  • G means e+9,

  • T means e+12,

  • P means e+15,

  • m means e-3,

  • u means e-6,

  • n means e-9,

  • p means e-12, and

  • f means e-15.

The following examples are legal numeric literals:

69
6.9
0.999e-10
1.0E2
10000
1_000_000
12_345.678_9
1.5k
12M
2.34p
5u
$1010_0101
#D00D
#FF_FF_FF

The following are not valid numeric literals:

.33  //Error: floating point literals may not begin with a decimal point
1.  //Error: floating point literals may not end with a decimal point
99E+3  //Error: floating point literals with an exponent must contain a decimal point
12_34  //Error: decimal digit groups must be of length three
#FF.00  //Error: floating point numbers may not be expressed in hexadecimal notation

2.4.2. Character literals

A single character literal consists of a Unicode character, inside single quotes.

CharacterLiteral: "'" Character "'"
Character: ~("'" | "\") | EscapeSequence

A character may be identified by an escape sequence. Every escape sequence begins with a backslash. An escape sequence is replaced by its corresponding Unicode character during lexical analysis.

EscapeSequence: "\" (SingleCharacterEscape | "{" CharacterCode "}")
SingleCharacterEscape: "b" | "t" | "n" | "f" | "r" | "e" | "\" | """ | "'" | "`" | "0"

The single-character escape sequences have their traditional interpretations as Unicode characters:

  • \b means BACKSPACE,

  • \t means CHARACTER TABULATION,

  • \n means LINE FEED (LF),

  • \f means FORM FEED (FF),

  • \r means CARRIAGE RETURN (CR),

  • \e means ESCAPE,

  • \\, \`, \', and \" mean REVERSE SOLIDUS, GRAVE ACCENT, APOSTROPHE, and QUOTATION MARK, respectively, and, finally

  • \0 means NULL.

A Unicode codepoint escape is a two-, four-, or six-digit hexadecimal literal representing an integer in the range 0 to 10FFFF, or a Unicode character name, surrounded by braces, and means the Unicode character with the specified codepoint or character name.

CharacterCode: "#" ( HexDigit{2} | HexDigit{4} | HexDigit{6} ) | UnicodeCharacterName

Legal Unicode character names are defined by the Unicode specification.

The following are legal character literals:

'A'
'#'
' '
'\n'
'\{#212B}'
'\{ALCHEMICAL SYMBOL FOR GOLD}'

2.4.3. String literals

A character string literal is a sequence of Unicode characters, inside double quotes.

StringLiteral: """ StringCharacter* """
StringCharacter: ~( "\" | """ | "`" ) | "`" ~"`" | EscapeSequence | EscapedBreak

A string literal may contain escape sequences. An escape sequence is replaced by its corresponding Unicode character during lexical analysis.

A line-terminating character sequence may be escaped with a backslash, in which case the escaped line termination is removed from the string literal during lexical analysis.

EscapedBreak: "\" (CarriageReturn Newline | CarriageReturn | Newline)

A sequence of two backticks is used to delimit an interpolated expression embedded in a string template.

StringStart: """ StringCharacter* "``"
StringMid: "``" StringCharacter* "``"
StringEnd: "``" StringCharacter* """

A verbatim string is a character sequence delimited by a sequence of three double quotes. Verbatim strings do not contain escape sequences or interpolated expressions, so every character occurring inside the verbatim string is interpreted literally.

VerbatimStringLiteral: """"" VerbatimCharacter* """""
VerbatimCharacter: ~""" | """ ~""" | """ """ ~"""

The following are legal strings:

"Hello!"
"\{#00E5}ngstr\{#00F6}ms"
" \t\n\f\r,;:"
"\{POLICE CAR} \{TROLLEYBUS} \{WOMAN WITH BUNNY EARS}"
"""This program prints "hello world" to the console."""

The column in which the first character of a string literal occurs, excluding the opening quote characters, is called the initial column of the string literal. Every following line of a multiline string literal must contain whitespace up to the initial column. That is, if the string contents begin at the nth character in a line of text, the following lines must start with n whitespace characters. This required whitespace is removed from the string literal during lexical analysis.

2.5. Operators and delimiters

The following character sequences are operators and/or punctuation:

, ; ... { } ( ) [ ] ` ? . ?. *. = => + - * / % ^ ** ++ -- .. : -> ! && || ~ & | === == != < > <= >= <=> += -= /= *= %= |= &= ~= ||= &&=

Certain symbols serve dual or multiple purposes in the grammar.