I just wrote a C# parser for toki pona. It was an ad hoc grammar, meaning I didn’t write a PEG or YACC or other formal description of the grammar and then process it into a parser. (I didn’t use or write a compiler compiler). Why? Because I kept feeling like a computer compiler is aimed at the problem of translation of one language to machine code. Also, I didn’t get a comp-sci degree, so I don’t actually follow how compiler compilers work.
From what I understand, I wrote a “order of operations” parser, which takes a paragraph, chops it at sentence breaks creating a string of sentences, then chops at the predicate marker creating subject/predicate arrays, and so on. This works for 90% of toki pona’s parsings needs.
Then other things came up and I stopped moving the C# parser forward. Now I’m learning C++ and mostly I keep thinking of how I could use C++ to do toki pona parsing. When I wrote the C# parser, I decided to favor powerful and expressive parsing over fast or general parsing (i.e. able to parse any language with a formal grammar). For example, if you have mixed English toki pona texts, you can do dictionary lookups to determine what words are English. Dictionary lookups are computationally very slow. But C++ is supposed to be very fast, like 4x or more faster than C#.
After I wrote my parser, I wrote up some lessons learned.
1) Natural language processing is about processing discrete arrays of tokens. It superficially looks like string processing, but string processing means dealing with irrelevant white space, punctuation, capitalization and other crap that just gets in the way of dealing with higher level concepts. But you need to be able to do the same things you do with substrings, but with arrays of tokens, for example, finding a substring? Well, actually I need to find a sub-token-list. Need to do a string replacement? Actually, I need to do a token replacement. Surprisingly, arrays and lists don’t normally support the full range of operations that string processing provides for
2) C++ allows for fast cache pre-fetching if you stick to using memory aligned data structures. In otherwords, if the words have to be looked up all over memory, you get the speed of memory. But if the data is moved through the pre-fetch cache, you get the speed of the cache, which is like 100x (1000x?) faster. In C#, I have no idea what the memory layout of my data is, so I can’t exploit this. But all my data is an adjacent stream of data, I should be able to exploit this.
3) Perf optimization in C# was limited to looking at which routines used the most time. In my case, it was substring checks, which got faster after I told C# to stop taking into account international considerations when doing substring checks– the .NET framework was doing extra work just in case the string was German or Chinese. My other attempts to improve perf made no impact– memoization had no impact.
4) I know that stream processing is more efficient that any alternatives (i.e. if your data structure is a stream that the CPU does an operation on, then moves to the next, etc, as opposed to say, a tree). My C# code encouraged using data structures that aren’t streams. C++’s string library seems to encourage treating all strings as streams, i.e. more like C# StringBuilders.
6) My C# code was very toki pona centric. At the end, I could see which parts of the library could have been used by any conlang project.
7) The C# parser didn’t have the concept of a runtime. When I speak English, the run time is my human brain. I hear a cake recipe and that moves me to make a cake. I almost created a runtime that represented sort of a database of sentences that could be queried, but didn’t get far because I didn’t succeed in making a template matcher.
8) Speaking of templates, templates were not first class citizens. Imagine that the grammar had this seemingly unnecessarily set of rules:
jan pona ==> jan-pona (colocation translates to friend when it comes time to use the parse tree in the “runtime” which in my case was just text colorization and English glossing)
mi pilin e ni: S => complex-sentence. The other rules of the grammar can generate this pattern, but it looks like us humans use these templates as first class citizens– we use them too often and too predictably to imagine that we thought up that template on the spot. The template has slots just like a complex verb in other languages, e.g.
mi [pronoun modifier] [li] pilin [adverb] e ni: S => complex-sentence.
So here are my C++ Goals
1) API that works with arrays of tokens as expressively as the string api works with char arrays.
3) Immutable data structures
4) Relies on knowledge (i.e. not just grammar rules, but long lists of colocatoins, English dictionary lookups, i.e. mini databases)
5) Has a runtime and all utterance can manipulate the runtime (i.e. statement == store this sentence in the knowledge base, special statements ==> remove this sentence from the knowledge base, update a sentence, retrieve sentences of a similar template)
6) Still support colorization scenarios, DOM-like APIs, etc.
7) Works for two languages, and I choose Esperanto and toki pona, only because they are well documented
8) Community corpus driven standards for “done” and “correctness”, i.e. it’s works because it works with typical community texts.
9) Will not try to deal with the problem of incorrect texts. (Doing parsing transformations that turn invalid text into the intended text is still too hard)