// you’re reading...

ActionScript 3

Tokenizing ActionScript 3 for fun and profit

Be careful what you wish for.  Seriously.  Back in university, I thought compilers and programming languages were the coolest things ever.  Well, at least in the Computer Science department.  And for the last few years, I’ve been working on tooling for ActionScript 3 focusing on, as luck would have it, languages and compilers.  While AS3 might be a pretty good language to code in, it’s a horrifically  bad language to write tooling for.  The language does things that make compiler folks swear profusely while curling up in a ball and sobbing.  Or maybe it’s just me.

Anyway, I’m rewriting an existing ActionScript 3 tokenizer (or lexer if you prefer)  for the new project I’m working on (it might be this Falcon thing that Deepa mentioned at MAX).  The old tokenizer basically built a list of tokens for the entire document that we’re parsing, and then fed that list to the AS3 parser and produced a syntax tree.

This approach has always made me a bit uncomfortable.  If we’re going to tokenize something like UIComponent we’re talking about roughly 100,000  tokens stored in a list. This incurs object creation cost (granted our tokens are small), memory overheard, and GC cost as all of these tokens are eventually collected.  But, there was a very good reason that we did this.

As I said before(here, here and here), ActionScript 3 is a bit of a mess when it comes to syntax.  There are enough ambiguous constructs to  make your head spin, and to make my life a bit harder.  To add further complication, we use an LL(*) parser-generator to build our parser and it, unsurprisingly, does not like ambiguities in the grammar.  Because of all of these ambiguities, we had to do a lot of processing on the tokenizer level to produce a set of tokens sane enough for our parser to handle properly.

I keep mentioning ambiguity, so let me actually talk about one.  AS3 has standard keywords, and then it has syntactic keywords that exist as keywords based on context, and as identifiers elsewhere.  So, the keyword “get” is only a keyword if it’s preceded by the “function” keyword and if the next token is an identifier.  If the next token is an open parenthesis, then it’s not longer the “get” keyword but an identifier named “get“.

//it's a getter
public function get data():Object
//it's a function named "get"
public function get():Object

In the old tokenizer, we built the entire list of tokens first so we could run a normalization pass over them. This ensured we resolved any ambiguities that the parser wouldn’t like.  Granted, I’m simplifying things a bit, but that’s the general idea.  And it did the job quickly, but like almost everything in software, left room for improvement.

So my goal was to cut down on object creation (through a token pool) and to avoid building a gigantic list of tokens (by supporting a traditional iterator to the tokens).

So in order to solve the problem illustrated above (and much harder ones) in the new tokenizer, I’m stealing a bit of standard parser behavior.  Now, when we build a token from source, we check a few things.  Most notably, we check to see if we’re a token type that can’t be known without first checking our surrounding context.   Looking back is easy, it’s looking forward that takes a bit of work.  The obvious answer (through weird for a tokenizer) is to support token lookahead.

When we encounter the “get” keyword, we check backwards to see if the last token was the “function” keyword.  If it was, we then look forward to see if we’re an identifier or something else.

So the new tokenizer actually supports LA(n) where n is the number of potential tokens that can we created from the source buffer.  In most cases, we only need LA(1) but for cases such as fully-qualified custom namespaces preceding functions of variables, we need to seek quite a bit forward:

//fully qualified namespace decorating a function
my.qualified.ns.named.bob function hello()
//valid expression above a function
my.qualified.ns.named.bob
function hello()

My initial performance results are very promising.  My test data is the entire Flex framework (reduced to one file).  We build a full parse tree, and then repeat at least four times.  So far, the new tokenizer consumes 1/5 the RAM and adds about 75,000 lines a second to our overall parsing time.  You’ll just have to deal with slightly vague numbers…maybe it will make the surprise better.

And yes, I realize that there are approximately 5 people who will find this interesting.  For everyone else, this is exactly why I usually don’t go into detail about my job.

Discussion

8 comments for “Tokenizing ActionScript 3 for fun and profit”

  1. Well i must be part of the 5 people then cause i found it pretty interesting, even if compilers and tokens ain’t my daily concern :) .
    The results sound good so far, i really wish there was a “code-check-as-you-type” just like in Java for Flex / AS3 dev, and i guess tokenizing faster is one step in that direction.

    Keep up the good work!
    Fabien

    Posted by Adobe Flex Tutorial | November 14, 2010, 2:49 pm
  2. [...] This post was mentioned on Twitter by Eric Snowden, Eric Snowden. Eric Snowden said: Tokenizing ActionScript 3 for fun and profit http://goo.gl/fb/0t0K5 [...]

    Posted by Tweets that mention Marking Occurrences | Tokenizing ActionScript 3 for fun and profit -- Topsy.com | November 14, 2010, 3:34 pm
  3. Damn! Don’t be so hard on yourself! You came up with a really great solution.

    Also, I’m interested in those posts too, its the reason why I’m subscribed to your blog.

    Posted by Subb | November 14, 2010, 11:39 pm
  4. Hey David,

    don’t stop writing these posts. They are very interesting!

    Posted by Sven | November 15, 2010, 1:06 am
  5. Same as Fabien, very interesting note. Compiling is about hitting F11 key to me, but this adds to my general understanding.
    I suppose reducing the memory consumption also means less OoME while compiling?
    Anyway, thanks !

    Posted by Nicolas | November 15, 2010, 2:30 am
  6. Good to know we are improving on compilation time front. thank you :)

    Posted by Aditya Kumar Pandey | November 21, 2010, 5:03 am
  7. Man you made my day I was looking for something like this blog and post for a long long time, please write more about this problems, especialy about flex compiler, about syntax parsing, lexers, token types, ast and more, maybe you find this not interesting only because you are doing it everyday ;)
    I’m working with flex in my work but sometimes after work I try to write some lexer / parser or read about it. The more deeper you go with the topic, the much more I appriciate.
    Thank You.

    Posted by Michal Szczepanski | January 20, 2011, 3:53 pm

Post a comment