2014/06/01

Fun with antlr, undertow, and d3.js for musics

It was a Friday night and I had watched all the episodes of Silicon Valley. I'd been itching to build something fun with Antlr4 since I learned about it. It's a pretty awesome Lexer/Parser generator tool written by Terence Parr, who's been quietly powering the Java parsers of the world with Antlr and a smile on his face. If you're not familiar with the tool I recommend the walk-through he did that's up on Vimeo. I've had the pleasure of using this tool on other projects and wanted to do something silly with it.

(tldr here)

Enter Busta Rhymes

Stuck in mid-90's rap for some time now I thought it'd be interesting to see if I could write a simple grammar to parse the lyrics of a Busta song. I wasn't going to do much but visualize it so I didn't reach for any standard NLTK out there. The grammar I came up with was simple:

grammar Myne;

fragment COMMA_: ',' ;

WORD: [a-zA-Z0-9\-'']+ ;

PUNCTUATION: [!?.]+ ;

NEWLINE: '\r'? '\n' ;
COMMA: COMMA_ ;

IGNORE:
    ( '(' ~')'+ ')'
    | '[' ~']'+ ']'
    | [\t "]+
    ) -> skip
;

song:
    NEWLINE* stanza+
;

stanza:
    verse+ NEWLINE*
;

verse:
    sentence+ NEWLINE?
;

sentence:
    (WORD COMMA?)+ PUNCTUATION?
;

If you're familiar at all with regular expressions the syntax is pretty accessible. I'm a regex nerd from way back (Owl book v1) so I took to it with gusto.

Basically UPPERCASED things are token definitions. lowercased things are parser rules. A stream of bytes gets analysed and turned into a stream of TOKENS. Then a stream of TOKENS gets parsed into a parse tree with grammar rules. I created an IGNORE token definition that basically skips anything between () or [] or non-critical whitespace. And the fragment definition here just makes it so my commas are still labeled with a COMMA token in the parse tree (since the COMMA def only matches a single char, antlr will just use that char as the token name in the parse tree output. I wanted the token name to be there too.).

Save this to a file and run this through antlr via:


java -jar antlr-4.2.2-complete.jar Myne.g4

and it outputs some cryptic looking Java class definitions.

-rw-r--r--   1 jason  staff    48B Jun  1 18:53 Myne.tokens
-rw-r--r--   1 jason  staff   2.3K Jun  1 18:53 MyneBaseListener.java
-rw-r--r--   1 jason  staff   3.1K Jun  1 18:53 MyneLexer.java
-rw-r--r--   1 jason  staff    48B Jun  1 18:53 MyneLexer.tokens
-rw-r--r--   1 jason  staff   1.5K Jun  1 18:53 MyneListener.java
-rw-r--r--   1 jason  staff    10K Jun  1 18:53 MyneParser.java

You can peek under the hood and see a gnarly state machine and some seriously optimized and compressed data. I am new to this part so I'll leave the explanation to core devs, but it's not as scary as I'd expect from generated code.

Now we've got code that can parse lyrics according to my grammar. A nice way to test this is with antlrworks2. It's a pretty simple ide that can produce a nice parse tree png of your grammar given some input.


./antlrworks2 --cachedir /tmp

Open my grammar file, hit Run -> run in testrig, point it at my lyrics.txt file and select 'song' as my top-most grammar rule, and bada-bing:

I've got a mile-wide png of my parse tree. Cool! Now let's integrate it into a Java service for integrated good rhymes.

I picked undertow as a java REST service framework. There's a bunch but it did pretty well in the Techempower Benchmarks (~900k/reqs/sec on bare hardware and ~50k/reqs/sec on AWS m1.larges) and it's got a pretty easy-to-grok style that's similar to Go's http handlers and I suppose many others. I pulled the undertow json hello world example from the techempower benchmark repo and modified it for my project. Was easy to get up and running and may actually be fast too!

The handler code is pretty straightforward but the meat happens here:

try {

    MyneLexer lexer = new MyneLexer(new ANTLRInputStream(song));
    CommonTokenStream tokens = new CommonTokenStream(lexer);
    MyneParser parser = new MyneParser(tokens);

    parser.setErrorHandler(new BailErrorStrategy());

    ParseTree tree = parser.song();

    Node new_tree = Node.BuildTree(tree, parser);

    exchange.getResponseHeaders().put(Headers.CONTENT_TYPE, "application/json; encoding=UTF-8");

    exchange.getResponseSender().send(
            ByteBuffer.wrap(
                    objectMapper.writeValueAsBytes(new_tree)
            )
    );

} catch (Exception e) {

    e.printStackTrace();

    exchange.setResponseCode(400);
    exchange.getResponseSender().send("Unable to parse dat shit");
    return;

}

First I create a Lexer with the InputStream wrapper around my song (a posted String). Next create a token stream. Then create a parser. It's not actually done anything yet. This is just the setup to what happens next. I call parser.song() and it spits out a parse tree. It's at this point that things will fail if they do and what's neat is that all my parser grammar rules are methods on my parser object. So calling song() is saying, "parse the whole thing, top to bottom" since song is my top-most grammar rule.

Next I take my parse tree and all it's rich data and convert it into a friendlier, less rich object tree using a hand-rolled method called Node.BuildTree. I also ignore things like newlines since they appear in the tree for structural purposes and I don't care about them in the output.

Then I drop my tree into my json serializer (fasterxml-jackson was the one from the hello world code so I stuck with it), and it squirts out a nice json parse tree.

{
   "name":"song",
   "value":"",
   "children":[
      {
         "name":"stanza",
         "value":"",
         "children":[
            {
               "name":"verse",
               "value":"",
               "children":[
                  {
                     "name":"sentence",
                     "value":"",
                     "children":[
                        {
                           "name":"WORD",
                           "value":"Hit",
                           "children":[]
                        },
                        {
                           "name":"WORD",
                           "value":"you",
                           "children":[]
                        },
                        {
                           "name":"WORD",
                           "value":"with",
                           "children":[]
                        },
                        {
                           "name":"WORD",
                           "value":"no",
                           "children":[]
                        },
                        {
                           "name":"WORD",
                           "value":"delayin",
                           "children":[]
                        },

Awesome! Now to do something cool with it.

d3.js is a powerful framework for generating visualized data. It's something I reach for on a occasion and definitely refer to for inspiration. I wanted to keep this a single weekend project so I scanned through the examples for a nice compliment to my parse tree and found a circular tree output. The authors of d3.js and the example code made things easy to co-opt for my purposes and my json tree output matched the expected input. The hardest part of all of this was getting the page to look the way I wanted. I literally spent a few hours playing around with page layouts and didn't come to the final output until after a night's sleep.

I made the lyrics section a text area that you can type in and it makes calls back to the antlr/undertow service on keyup and resize. Feel free to paste your fav lyrics in or anything else. I started playing around with a haiku parser too, but syllables...

It was a fun weekend project and still had enough time to see a few movies in-between. With all the frameworks out there it's so easy to argue over style and forget that this stuff saves us time to have lives with. I'm heading to yoga now. Cali has infected me with it's culture of leisure. :)

Project Source