Fair warning: This articule is very technical and I’m not an expert on language analysis, I’m just a programmer with a problem to fix.

Extending strings in Localization

Think about the last time you played an online game. A competitive FPS, for example. The match ends, there’s a winner and the game displays:

Player xXxKillATonxXx wins the match

How can the game developers know in advance that Mr. xXxKillATonxXx was going to play? That’s either a string concatenation or a string substitution and sometimes game devs opt for the second choice. This means that in the source of the game we’ll have something in the like of (let’s not get in the discussion of if this is a good solution):

1
ID_WINNING_GAME = "Player {playerName} wins the match"

See it? that’s hell for QA. If you have, let’s say, 18 different text langs to localize: you need to be sure that those curly braces match, that the variable name “playerName” is correctly spelled on every language, and so on and so forth. That’s a reasonably easy problem to solve using RegExes but what happens when the UI team goes really wild and they allow something like this:

1
ID_WINNING_GAME = "<red>Player</red> {playerName} <blue>wins <italic>the</italic> match</blue>"

… well, in that case you don’t have a simple string anymore you have a DSL which is a way more complex problem to solve. And, from the QA perspective, it’s more difficult to track.

So at this point we have a combination of tags, variables that can be nested indefinitely, in a process that it’s incredibly error prone and very difficult to catch by eyeballing strings. It’ll also end in broken strings on screen during runtime, and that’s a risk for Certification.

And don’t forget that, due to grammar, different languages might have the tags in different places and maybe in different orders. The only rule is that any localized version should have the same tags and structure (in terms of tag nesting) than the source language.

Parsing DSLs, enters Pidgin

Facing that problem I had two alternatives: either program a recursive parser that’ll chew the strings and tokenize them properly, or use a more formal approach, in this case through Pidgin. The documentation of this library is pretty good and the test and samples folders contains a plethora of good small snippets that you can use right away.

So, let’s dig into this problem a little bit. For simplicity, I’m going to reduce the scope to single format strings that can be nested as much as we want, so let’s begin with the basics, let’s consume innocent strings:

1
2
Parser<char, string> Fluff = from f in Token(c => c != '<' &&
c != '>').ManyString()

simple enough, right? A call to Parse with that Parser will consume anything that doesn’t contain < or > and will be flagged as Success. On top of that Fluff also accepts empty strings.

We can make our lifes a little bit simpler by adding a bunch of simple parsers:

1
2
3
4
Parser<char, string> LT = String("<");
Parser<char, string> GT = String(">");
Parser<char, string> Slash = String("/");
Parser<char, Unit> LTSlash = LT.Then(Whitespaces).Then(Slash).Then(Return(Unit.Value));

so we have the basics of the language right there, LTs, GTs, slashes .. all the components. Let’s aim for something more complex, the tag Identifier, where we impose that the first element has to be a letter, in glorious LINQ like:

1
2
3
Parser<char, string> Identifier = from first in Token(char.IsLetter)	// "Token" makes this parser return the parsed content
from rest in Token(char.IsLetterOrDigit).ManyString()
select first + rest;

… we’re ready for consume a full string that starts with a format marker and ends with the closing of such format marker, something like this will do:

1
2
3
4
5
6
7
8
9
Parser<char, Tag> FormatTag = from opening in LT
from formatLabel in Identifier.Between( SkipWhitespaces )
from closing in GT
from body in Fluff // !!! Attention here
from closingTokenOpener in LTSlash
from closingLabel in Identifier.Between( SkipWhitespaces )
from closingTokenCloser in GT
where ( formatLabel == closingLabel ) // we assure that we're clossing the correct tag
select new Tag( formatLabel, body); // Let's imagine that you have this defined

If we’re lucky enough and the string that we need to parse is surrounded by a single format marker, this piece of code will take care of it and return a “Tag” object. That we’ll be able to compare and consume later.

But that’s not what we want to solve, we should change that call to Fluff for something that can potentially consume more tags that live embedded in the string. Also, we need to take care of a string that starts and ends with normal text and happens to have a Tag in the middle, let’s do that now:

1
2
3
4
5
Parser<char, Tag> tagParser =
from preFluff in Fluff
from items in Try( FormatTag )
from postFluff in Fluff
select items;

see that try modifier? That’s what enables the parser to backtrack in case of failure. In essence you don’t “lose the input” and you can use other rules. Incredibly useful. But still, we can’t consume several of this rules, let’s fix that now:

1
2
3
Parser<char, IEnumerable<Tag>> stringParser =
OneOf( Try( tagParser.AtLeastOnce() ),
Try( Fluff.ThenReturn(null as IEnumerable< Tag > ) ) );

That needs some unpacking:

OneOf accepts a sequence of Parser and will try to execute them in sequence for left to right, once one consumes input that one is selected, otherwise it fails. In this case we’re trying to either parse a tag or simple and innocent text.
At least once executes the previous parser one or more times and accumulates the output into an Enumerable container.
ThenReturn Let’s you return whatever you want once a Parser has completed succesfully, in this case we need to change the output of fluff from string to IEnumerable Tag. At the end, the goal is not to know what the string contains but just to ensure that the structure remains between different languages.

So, going back to our “FormatTag” Parser, we need to tweak it a little, with:

1
2
3
4
5
6
7
8
9
Parser<char, Tag> FormatTag = from opening in LT
from formatLabel in Identifier.Between( SkipWhitespaces )
from closing in GT
from body in stringParser // <<<<<<<<
from closingTokenOpener in LTSlash
from closingLabel in Identifier.Between( SkipWhitespaces )
from closingTokenCloser in GT
where ( formatLabel == closingLabel )
select new Tag( formatLabel, body);

And there we have it, nested strings, embedded indefinitely, with your memory as the only limitting factor in this solution.

This is, of course, an incomplete solution. But it covers the main points of the grammar in place: recursion and tag verification.

Some lessons

  1. Recursive grammars become incredibly complex to parse. Using TDD is a must.
  2. Chop, chop, chop your problem. Every parser should do the absolute minimum, combining cases is the shortest route to failure and headaches.
  3. Test for End() Sometimes the strings are empty or you want to check that you’ve consumed the whole input.
  4. OneOf + Try is a patter on its own. The library might have something more compact, but with my knowledge, I like to use it.

Not data driven, but flexible enough

One of my few regrets with this solution is that it’s not completely data - driven. Other 2-step solutions would’ve been more flexible. Imagine a grammar description in an external file that it’s compiled in runtime and ends in an in memory parser that you can use as you please. That’d been way cooler, but also more complex, at least with my current knowledge of this libraries and technologies.