Enumerating all sentences

So I had an idea based on a few other ideas.

What sort of mantra is worth reciting?
The number pi has nonrepeating digits. If you looked long enough, then eventually you wound find the digits that encode your name in ascii. You would also find the digits that encode your picture in a JPG, a GIF and an PNG file. You would also find the string of pi digits that encode all movies ever made, and in all formats, both encrypted and decrypted and so on.
There is a way to enumerate all rational numbers.

So, I’m thinking about how would one create an enumeration of all possible toki pona sentences, excluding the uninteresting ones with infinite repeating sections. That enumeration of all sentences would contain the biographies of everyone you know, and the answer to all the questions you ever had. It also would include lies and slander, but also mostly gibberish.

So let’s start enumerating!

Word li Word. There are 125*125 of these.
W li W [Prep W]. There are 125*125*6*125 of these. At 2 seconds per sentence, it would take a bit under a year to chant all these.


Sentences can be simple or compound, S, or S la S.
Sentences must contain a subject, a verb phrase, optionally some direct objects and optionally up to six different prepositional phrases.
Phrases can optionally have modifiers or pi chains.

So the whole of toki pona could be a chain of decisions starting with S, and running until the maximum phrase size is reached. Enumerating systematically would result in a lot of similar sentences. (ni li a. ni li pona. ni li soweli. ad nauseum) Enumerating them stochastically would be more interesting to read. Now if we could map digits of pi to the choices in building up a sentence (i.e. compound or not, transitive or not, with prep phrase or not), then we could get a list of sentences that would eventually cover all possible toki pona sentences.

Posted in machine assisted conlanging, toki pona | Leave a comment

C++ for Conlanging

I just wrote a C# parser for toki pona. It was an ad hoc grammar, meaning I didn’t write a PEG or YACC or other formal description of the grammar and then process it into a parser. (I didn’t use or write a compiler compiler). Why? Because I kept feeling like a computer compiler is aimed at the problem of translation of one language to machine code. Also, I didn’t get a comp-sci degree, so I don’t actually follow how compiler compilers work.

From what I understand, I wrote a “order of operations” parser, which takes a paragraph, chops it at sentence breaks creating a string of sentences, then chops at the predicate marker creating subject/predicate arrays, and so on. This works for 90% of toki pona’s parsings needs.

Then other things came up and I stopped moving the C# parser forward. Now I’m learning C++ and mostly I keep thinking of how I could use C++ to do toki pona parsing. When I wrote the C# parser, I decided to favor powerful and expressive parsing over fast or general parsing (i.e. able to parse any language with a formal grammar). For example, if you have mixed English toki pona texts, you can do dictionary lookups to determine what words are English. Dictionary lookups are computationally very slow. But C++ is supposed to be very fast, like 4x or more faster than C#.

After I wrote my parser, I wrote up some lessons learned.

1) Natural language processing is about processing discrete arrays of tokens. It superficially looks like string processing, but string processing means dealing with irrelevant white space, punctuation, capitalization and other crap that just gets in the way of dealing with higher level concepts. But you need to be able to do the same things you do with substrings, but with arrays of tokens, for example, finding a substring? Well, actually I need to find a sub-token-list. Need to do a string replacement? Actually, I need to do a token replacement. Surprisingly, arrays and lists don’t normally support the full range of operations that string processing provides for

2) C++ allows for fast cache pre-fetching if you stick to using memory aligned data structures. In otherwords, if the words have to be looked up all over memory, you get the speed of memory. But if the data is moved through the pre-fetch cache, you get the speed of the cache, which is like 100x (1000x?) faster. In C#, I have no idea what the memory layout of my data is, so I can’t exploit this. But all my data is an adjacent stream of data, I should be able to exploit this.

3) Perf optimization in C# was limited to looking at which routines used the most time. In my case, it was substring checks, which got faster after I told C# to stop taking into account international considerations when doing substring checks– the .NET framework was doing extra work just in case the string was German or Chinese. My other attempts to improve perf made no impact– memoization had no impact.

4) I know that stream processing is more efficient that any alternatives (i.e. if your data structure is a stream that the CPU does an operation on, then moves to the next, etc, as opposed to say, a tree). My C# code encouraged using data structures that aren’t streams. C++’s string library seems to encourage treating all strings as streams, i.e. more like C# StringBuilders.

5) My C# code works great on a server. But if I want to give it away, I’d have to create a REST API. What would be even better is if I could give away a Javascript library. Then people could use it along with their favorite web framework, be it Python Flask, Ruby and Rails, or what have you. As it happens, reasonably efficient C++ to Javascript cross compiling has appeared on the scene.

6) My C# code was very toki pona centric. At the end, I could see which parts of the library could have been used by any conlang project.

7) The C# parser didn’t have the concept of a runtime. When I speak English, the run time is my human brain. I hear a cake recipe and that moves me to make a cake. I almost created a runtime that represented sort of a database of sentences that could be queried, but didn’t get far because I didn’t succeed in making a template matcher.

8) Speaking of templates, templates were not first class citizens. Imagine that the grammar had this seemingly unnecessarily set of rules:

jan pona ==> jan-pona (colocation translates to friend when it comes time to use the parse tree in the “runtime” which in my case was just text colorization and English glossing)
mi pilin e ni: S => complex-sentence. The other rules of the grammar can generate this pattern, but it looks like us humans use these templates as first class citizens– we use them too often and too predictably to imagine that we thought up that template on the spot. The template has slots just like a complex verb in other languages, e.g.
mi [pronoun modifier] [li] pilin [adverb] e ni: S => complex-sentence.

So here are my C++ Goals
1) API that works with arrays of tokens as expressively as the string api works with char arrays.
2) Templates
3) Immutable data structures
4) Relies on knowledge (i.e. not just grammar rules, but long lists of colocatoins, English dictionary lookups, i.e. mini databases)
5) Has a runtime and all utterance can manipulate the runtime (i.e. statement == store this sentence in the knowledge base, special statements ==> remove this sentence from the knowledge base, update a sentence, retrieve sentences of a similar template)
6) Still support colorization scenarios, DOM-like APIs, etc.
7) Works for two languages, and I choose Esperanto and toki pona, only because they are well documented
8) Community corpus driven standards for “done” and “correctness”, i.e. it’s works because it works with typical community texts.
9) Will not try to deal with the problem of incorrect texts. (Doing parsing transformations that turn invalid text into the intended text is still too hard)

Posted in machine assisted conlanging, toki pona | Comments Off

Meanings of Small in the area of Languages

1) Small lexicon. Incompletely described languages also have small lexicons. Klingon falls into this category. The lexicon can grow. Esperanto on day 365 was a small language. Something like a century later, it is a large lexicon language.

2) Closed lexicons. All (?) languages exhibit the feature where some classes of words are closed, e.g prepositions in English– you can’t make up your own. Proper nouns in English need only follow the phonotactic rules, make ‘em up all day. If a lexicon is small and closed, then there are still new lexemes, but the will be made of recognizable parts. It’s sort of like, you can’t use new ingredients, but if you make a new recipe, you have to show the recipe. The recipe could still be incoherent.

3) Small distance to your native tongue. This is what really makes language easy. A condialect would be the easiest. The maligned re-lex is small in the sense that really you just need a lexicon and the rules for (possibly mechanically) mapping grammar from one language to the other.

4) Small phonetic inventory. Doesn’t make it especially easy though, cf Hawaiian with the long words with repeated vowels.

5) Small syntax. Regularization reduced size twice- irregular morphology can be looked at as lexical syntax (a new word say for each form of a certain tense), or as a complex set of rules with exceptions and exceptions to the exceptions. However, one of the magic things about syntax is that a small number of rules can in the right hands make a massive maximal sentence with enough complexity to be hard to read and sometimes one more rule would make certain areas of complexity go away. This is essentially the story of the evolution of modern computer programming languages.

6) Small number of speakers. Just for completeness.

7) The speakers themselves are small. Okay, now I’m just being silly.

Posted in Uncategorized | 1 Comment

Valency in Lojban and toki pona

Verbs have valency, which is how many “arguments” they have. For example intransitive means no arguments (or only one, the subject), transitive, means one (or two, subject and object). So a typical lojban verb (gismu) works like this:

x1 leaves from x2 via route x3.

This compares to toki pona which has 7 slots marked by either e or a preposition.

mi moku
e kili
kepeken ilo
sama soweli
lon tomo mi
tawa pilin pona
tan wile moku mi
poka jan pona mi

I eat an (object), with a (tool), like (similar object), at (a location), for (some goal, benefit), from (some place or causal reason), with (a collaborator). That is it. There are no more slots. This compares to English where we have many more slots via a much larger list of prepositions.

The toki pona verb always follows the same pattern, the lojban one has different meanings for different slots depending on the verb (gismu)

Of course the toki pona phrase can be re-arranged, except for the e phrases, which must come first. Also, routinely, the prepositional phrases can be modifiers for a single content word, which doesn’t have an analogy in lobjan, AFAIK (which isn’t much).

Also, another observation. If there are only 6 slots, marked by a particle on the head verb, wouldn’t these turn into case markers, like in a single generation of human use?

Posted in lojban, toki pona | 1 Comment

Dictionaries for toki pona

I read about the dictionary making for Algonquin, a highly synthetic language with few unbound morphemes. Everything of interest is a bound morpheme. Full words necessarily drag along with them a lot of other cruft, as if a dictionary had a definition for unsympathetically but the word sympathetic wasn’t allowed to be a stand alone word.

Surprisingly, toki pona is like that. toki pona has compound words, which if you are a grumpy cat, you can call them colocations (words that appear together commonly), or just call them compound words– because they behave rather similar to two stem words in languages with bound morphemes. Beyond that, we have “templates”.

Noun Phrases (content phrase)
jan pona. This is a perfect compound word. It takes modifiers, resists splitting, and it has two “slots”– stuff goes before it and after it.

These phrases have little internal structure. These are useful for machine parsing, the traditional dictionary just works. You could look up words by their head word and life is beautiful.

kin la. == really. Also a good compound word, it has two slots, you can put more la phrases before, a sentence after and that is it.

Verb Phrase
Verbs phrases are closer to templates because the head verb is one word.
kama sona. This isn’t a perfect compound word, it has three slots: [0] kama [1] sona [2]. The head verb is still kama and you can add modals before, negation, intensification and adverbs after kama. stuff after sona describes sona, not the kama sona phrase.

Templates are a lousy fit for a traditional dictionary. The head word could be in a variety of places. Sometimes the template doesn’t rely on any specific word, e.g.

x li y tawa z. x feels y towards x. (where y is usually pona or ike, x is usually an agent.)

I don’t even know where to put that in dictionary alphabetical order. I feel like I’m back in Algonquin again.

mun li pimeje e suno. eclipse. This almost doesn’t feel like a template anymore. To use it in a sentence requires extensive rework. It has at least 4 template points not counting adding all the optional things available to the maximal sentence.

Other patterns.
kule lon palisa luka. Fingernail polish. This is also a template with significant internal structure.

Keep the templates separate from untemplated definitions.
Be explicit about the slots in templates.

Unrelated advice:
Be wary of unwarranted glosses and translations.
jan Sonja said telo is sauce, so I guess it is.
If I say telo means rocket fuel, it’s an unwarranted translation unless there is some text to set that up.

Posted in lexicography, machine assisted conlanging, Virginian Algonquian | Comments Off

Paragraphs and machine assisted conlang work

When you are teaching a machine to do something with a language, you get a surprising set of challenges that are not mentioned much in your traditional reference grammar.

On the internet, people post texts and line breaks are unreliable indicators of paragraphs. White space can appear for a variety of reasons.

e.g. Line break in middle of sentence– exceedingly common.
jan ilo li wile

e.g. Double space in middle of sentence– enough to make a double space an unlikely paragraph mark.
jan ilo li wile


e.g. Line break is actually HTML. Sometimes HTML p tags aren’t really paragraphs.
jan ilo li wile
< br >

e.g. Tab starting a paragraph is actually a few spaces. Sometimes those spaces disappear and sometimes they are just spaces.
   jan ilo li wile pali.

Explicit paragraph marker, for example, four dashes centered on a page, like the divider you see in some novels between “scenes”.
Assume double space is a paragraph. This is wrong a lot of the time.
Synthetic paragraphs. Apply rules such as this: Any sentence ending in ni: is in the same paragraph as the following sentence. Any vocative followed by a sentence is in the same paragraph. Quoted text initiates a new paragraph. However, I suspect this would be a lot of work and would fail, resulting in too many synthetic paragraphs that ‘consume’ the entire text.
Ignore the problem and turn everything into a series of sentences, or a huge single paragraph.
Two parsing modes. Strict and Loosey-Goosey. In strict mode, paragraphs are started by tabs. In Loosey-Goosey mode, tabs, blank lines are assumed to be paragraph breaks and it is just accepted that this will be wrong a lot of the time.

Posted in machine assisted conlanging, toki pona | Comments Off

Inferences from toki pona utterances

So I’m noodling again the idea of knowledge representation with toki pona.

If I say: jan li laso. And then ask: seme li laso? Then it should be easy enough to look up jan. (That is to bind seme to jan)

But if the structure is different but equivalent, then I need to generate the equivalencies or inferences (in the case where the derived sentence has less information)

ni li ilo jan.
jan li jo e ilo. (possession)
ilo li jan. (adjective.)
ilo jan li ni. (equivallence)

jan li moku e kili.
jan li moku. (direct objects are optional)
!= moku li jan. (can’t reverse order)
? kili li moku. (the fruit is eaten?)

ni li ilo laso jan.
ilo li laso li jan. (Both adjectives, unlikely reading here.)
ilo li laso. jan li jo e ilo laso. (adjective & possession)

ni li ilo jan laso.
jan-ilo li laso. (compound with adjective)
ilo li jan li laso. (unlikely reading)

jan li soweli tomo e soweli.
tenpo pini la soweli soweli tomo ala.
tenpo ni la soweli li soweli tomo.

ni li ilo pi jan Mato
* jan li Mato.
jan Mato li jo e ilo. (posession)
ilo li jan Mato. (modifier)
ilo pi jan Mato li ni. (equivallence)

jan lon ma Mewika li jan pali lon tomo pali.
jan li lon ma Mewika.
jan li jan.
jan li pali. pali li lon tomo pali.

mi moku e kili tan ma Asija.
mi moku.
mi moku. moku ni li tan ma Asija. (what is the scope of the PP?)
kili li tan ma Asija. (what is the scope of the PP?)

Posted in machine assisted conlanging, toki pona | 1 Comment

The Heart of Esperanto

So I was wondering what a parser for Esperanto would look like if I only used the 14 rules.

I re-read them and quickly decided that actually, at the time of the 14 rules, the bulk of the language specification must have been in the dictionary and sample texts. Zamenhof didn’t know how to write a formal grammar, he’d probably have to live another like 50-75 years before formal grammars were in the popular imagination.

One fascinating and na’vi like feature of Esperanto is that modifiers can lead or follow the noun the modify. A lorem ipsum generator could help test if these possibilities are workable. I suspect not– in a maximal phrase you wouldn’t be able to coordinate modifiers with what is being modified. I could be wrong, so lets write a parser and find out.

Stems. Esperanto has like 800 stems. This is a small lie of Esperanto because with borrowing, this has since turned into 8000+

Words. Words are prefixes plus one or more stems plus derivational suffixes plus grammatical suffixes, which include part of speech suffixes.

Phrases. Phrases are noun phrases, verb phrases or prepositional phrases. I’m guessing there are also subordinate clauses.

Sentences. Sentences, it appears, are unordered collections of phrases. This is a little lie of Esperanto, because people in practice follow an order rigid enough to make the accusative unnecessary. Sentences can contain other sentences.

There is more too it, but I think I can write a mini-parser with just the above.

Posted in Esperanto | Comments Off

Math and logic subsystems in languages

So I can open a novel, and as part of being a human raised in an English speaking community, I pretty much understand everything. I can open a textbook on Calculus on Logic and while I can read the whole thing in English– there is even awkward but grammatically correct ways to read off the formulae– I’m not going to understand it just because I know English. I think this is some pretty conservative evidence that math and logic are not really natural languages, they are more like a foreign language embedded into a natural language.

So I was trying to deal with conjunction in toki pona. Sometimes they are made unnecessary by the “chain pattern”– one similar structure after another implies “and”. Sometimes they indicate discourse connectors, by tagging a sentence with “or” or “but”. Those two forms of logic are effortless to parse (except when people ignore the chain pattern and try to explicitly add “and” words) Finally we get these monsters:

1) jan li suli. (simple, no “and”)
2) jan li suli li laso. (chain patter, one right after another implies “and”)
3) jan li suli en laso. (different structures imply different meaning, maybe the qualities are mixed, like blue and red can be mixed)
4) jan li suli taso mute.
5) jan li suli anu mute.
6) jan li jan suli en mute.
7) jan li jan suli anu mute.
8) jan li jan pi suli en mute.
9) jan li suli en mute anu soweli. (mixed and, or, but)

3, 4, and 5 imply that you can “and”/”or”/”but” qualities without a head, so modifier phrases would be something like:

To parse all of the above, 1-8, you need a data structure that looks like this… and will lead to some monstrous maximal forms.

Head modifier (optional)
(Maybe a pi, depends on if you are predicate or modifier of a headed phrase)
Ands: en + modifiers — repeated
Ors: anu + modifiers — repeated
buts: taso + modifiers — repeated

And maximally something like:

jan li suli en mute taso lili taso laso anu soweli anu waso. (Grouped)
jan li taso lili en suli anu soweli en mute taso laso anu waso. (Jumbled up.)

How to parse this? I have no idea, it reads like a logic puzzle and you’d have to introduce a foreign logic system to do something with it. It looks syntactically valid. So I’m thinking my parser should represent a modifier chain as above, but make no claims about what it means. So it parses one way, and if someone (ha! unlikely) ever decided to implement a logic subsystem, they could take this parse and then transform it into all the possible meanings, truth tables and so on.

But for these applications, we don’t care:

grammar check– it’s valid syntax.
glossing– It glosses to English, and is equally ambiguous and unintelligible in English.
syntax highlighting– you only need to recognize an “and”/”or”/”but” sequence to color the text, you don’t need to know what it means or parse it as just one parse tree.
chat bot– A chat bot would never explore these corners of possible meaning in the universe of representable meanings that toki pona can represent.

Other Observations.

1) * jan li kepeken ilo en kepeken soweli. (Don’t use can to combine prep phrases)
2) */? jan li tawa en kama. (Don’t use en when you can use li– but if this was a modifier chain, and a predicate sentence, then its probably okay)
3) * jan li kepeken ilo anu kepeken soweli. (Don’t “or” prep phrases)
4) * jan li moku e ilo anu e soweli. Don’t use both anu and e, don’t use both taso and e [Update, changed to moku because kepeken has had some recent POS confusion from toki pona version pu)
5) */? ante jan li kepeken e ilo. Don’t use anything but anu or taso as a tag-conjunction.
6) * en jan li kepeken ilo. Don’t start sentence with en. (En is implied, although it would have made for a nice audible sentence demarcation)
7) ? waso pi laso en pimeja li pona tawa mi. This is really hard to parse. “and”ing modifiers in the subject slot is only sometimes distinguishable from mistakes and “and”ing subjects.

Posted in machine assisted conlanging, toki pona | 2 Comments

Observations While Editing Toki Pona

1) People create incorrect proper modifiers constantly.
2) People give up trying to use proper modifiers and just use their native language for all proper modifiers.
3) li get dropped for 3rd person subjects.
4) People essentially invent new constructions all the time. (It’s easier to enforce a ban on new words than it is to enforce a ban on new grammatical constructuions)
5) The complexities of parsing punctuation, ellipsis, fragments, asides/parentheticals, diglossia, direct speech, numbers, “theater script text, e.g. person1: “…” person2:”…” can be as great as the complexities of the core grammar (sentence, subject, object, modifier, etc) When there is no guidance, people just make stuff up.
6) People will do diglossia and mix in dates, times, numbers from a foreign language when the language spec doesn’t include them (or discourages using them!)
7) No one wants to punctuate poetry, which makes finding sentence termination incredibly difficult.

Posted in machine assisted conlanging, toki pona | Comments Off

Abusing the word morphology as a pragmatic matter

So I have a syntax that relies on sentence templates like this:

[subject] li [verb phrase] li [verb phrase] [Prep Phrase] [Prep Phrase].

Phrase are made of words and phrases have their slots. Then come along negation, emphasis, and uncertainty (with a desire for clarification, also known as questions). These can be slot in in many places, which makes them feel like modifiers (adjectives and adverbs). But they can be several words (e.g. anu seme, x ala x) and they resist taking modifiers themselves, e.g. the following:

? jan li wile kin pi suli mute
? jan li wile ala pi suli mute
? jan li wile ala mute wile

I’m going to treat these as morphology. They have a different scope (or different type of scoping rules)– usually one word. (They can have a larger scope). And they can be inserted (almost) anywhere, more places than adjectives. Negation has an uncertain scope, but usually one word.

jan ala pi pali mute. Scope of 1 word
jan pi pali ala mute. Scope of 1 word.
jan pi pali mute ala. Ambiguous scope, but probably read as scoped to mute.

? jan ala pi pali ala mute ala. Slots in everywhere and means something in each slot.
jan laso pi pali laso mute laso. laso doesn’t actually imply anything different in the 2nd half. The heavy work load is still blue.

The ability to put ala into many place in a modifier chain imply that ala behaves like morphology– it works on a single word, not on a phrase.

jan pona kin.
jan ala pona.
jan pona ala.
jan pali ala
jan pali kin. Emphasis on pali. (Or on jan pali)
jan kin pali. Emphasis on jan. (And not on pali.)


jan suli walo == jan walo suli. Order didn’t matter. (assuming that jan suli and wan walo don’t have special meanings the way jan pona does.

Anyhow, not sure how to state it, but these feel like they should be modeled as a feature of a single word. (And maybe also as a feature of phrases).

I think this differs from, say a phrasal modifier, e.g. pi telo nasa, in that pi telo nasa modifies a certain phrases of a certain sort and individual words, while ala can modify just about any word, even propositions, verbs, modals. kin likewise can. And X anu X can be applied to any word in the verb phrase, maybe to prepositions.

They also mess up the template, because now the template as a lot of slots.

[subject(ala/kin/anu seme)] li [verb phrase(ala/kin/anu seme)] li [verb phrase(ala/kin/anu seme)] [Prep Phrase(ala/kin/anu seme)] [Prep Phrase(ala/kin/anu seme)].

If fits into so many slots it better to think of it as a way of transforming a word. So we get something that looks like morphology in a largely analytic language. And this should make the code for the parser easier to write (rather than adding numerous ala/kin/anu seme flag each phrase)

Posted in toki pona | 1 Comment

toki pona minutia

These are all unsettled, controversial ideas in toki pona.

1) kule lon palisa li pona tawa mi.

2) kule pi lon palisa li pona tawa mi.

Prepositional phrases as modifiers. These are most obvious when they appear in the subject. When they appear at the end of a sentence, then you could always argue that the phrase modifies the whole sentence or what ever the reader thinks makes sense. Almost no one does 2). I plan to write the parser to treat prepositional phrases as exchangeable with any tail, single word modifier (i.e. they are modifiers that come last)

1) ni li lon insa poka.

2) ni li lon insa poka suli.

3) ni li lon insa pi poka suli.

4) jan li tawa en tan ma suli.

Complex prepositions. People act like they can be joined, and’ed and or’ed. Some argue that the tail of a complex preposition is actual a noun, but then no on adds the pi for it’s modifiers, like in 3). And in 4) it is impossible to not look at it as a complex preposition. I plan to parse preps as potentially complex as in 2) and 3).

1) 543 = 5 (hundred) 4 (tens) 3 (ones)

A decimal system is ten digits and some of the place names, for example, a place name every three places. The choice of words isn’t so important (words for commonly measured, e.g. tenpo or approximate values, suli, mute, lili are bad choices), I plan to implement numbers as parameterizable templates.

1) jan li toki e kalama Wakawaka.

2) jan li toki e Wakawaka.

3) mi kute e Tinton li pini lape.

4) soweli li Mijawu li Mijawu.

5) Mijawu!

Onomatopoeia is so rare that the particular answer should not matter much. Forcing these to be proper modifiers sort of works, except in #5, it pedantic to make such an exclamation a modifier and in 2) it probably redundant to mention again that this is a sound, human sound, or animal sound. (kalama, nimi, mu) These should be allowed to stand bare. But possibly punctuated to distinguish from accidental unheaded proper modifiers.

1) jan li kepeken ilo kepeken soweli.

2) jan li kepeken ilo en soweli.

toki pona has these things that look like chains linked by particles. Sometimes the pattern seems to fight against recursive patterns, sometimes it seems to give up and just do recursion. For example, x en y en z can be read as an unordered list of elements. This is a very simple data structure, it fits with the design goals of a simple languages. a pi b c en a2 pi b2 c2 is something a of a tree structure at least. There are chains joined by chains. The prepositional phrases are odd balls because if you string them together, the particle is … blank, e.g. jan li sama soweli kepeken ilo. The chain of prep phrases either can be analyzed as having a particle with six forms, or a blank particle and the prepositional phrases is headed, or possibly conjunctions are ordinary chains. People seem to use 2) above, i.e. prepositional phrases are not chains like en, or pi. Anyhow, 1) and 2) are semantically equal and there isn’t a compelling reason to make 2) illegal– for one it makes texts clunkier, implies that a string of prep phrases should group together, e.g. sama A sama B kepeken X kepeken Y. I sort of want to make sure you can treat prepositional phrases as chains (it’s sort of elegant to treat all the phrases as the same datastructure), but won’t write my parser to expect them to be treated as chains.

Posted in toki pona | 3 Comments

Object Oriented Conlanging

If you don’t write code, you may feel like the wrong audience.

So recently I’ve been on a kick of learning about object oriented library writing, especially for the most basic types. I notice that MSDN’s advice for writing basic types and the exercise of writing a library acts sort of as a model exploration exercise that leads to surprising discoveries.

Why a Conlang?
Sure natural language processing libraries exist for real languages like French and English? Yeah, they do, but French and English are so mind boggling complex that writing code to parse them is a task that has befuddled developers for decades. Even established conlangs are much smaller, even those that weren’t initially designed to be small or limited in their complexity.

Also, during the creation of a conlang, out of necessity, it is small, if only because the vast majority hasn’t been imagined yet.

One more reason: because you may overly restrict yourself by considering datastructures that are peculiar to English and French when one possible goal of a conlang is to create a human communication system that goes beyond what can be done in well known natural languages (this compares to the naturalistic-conlang goal of creating a reference grammar that fools professional linguists– a goal that will put some limits on breaking new ground into new techniques for communication strategies)

Basic Types in a Conlang
Words. The basic type should be a token, not a string. Strings are not atomic enough.
Morphology. The language I am using (toki pona) as my model doesn’t have any morphology to speak of.
Phrases. Here is where I discovered the most. It turned out to be helpful to have many specialized classes that can exploit information that goes beyond what a formal grammar might know. For example, some words are likely agents, and those agents can own things, they can be the agent of a transitive verb, and so on.
Forgotten Structures. We almost never formally deal with punctuation, quoted text, diglossia, numbers, dates, animal sounds and so on. When you work with real sample texts, you encounter these issues immediately and realize that they can’t be ignored and deserve to be treated in terms of importance as concepts on par with subject, object and prepositional phrase.
Alternative parsing. A machine parser will yield only one parse, automatically, for free because you don’t have any other choice. However, the formal grammar will suggest that for any given text many parsings are legal. I think that letting the machine pretend there is only one parsing is pragmatic– it forces the language creator to decide how much ambiguity is tolerable.

Basic Applications of a Conlang Library
Word Processing. If you can parse it, it is potentially valid syntactically. A prerequisite to parsing is dictionary lookup, so if you can parse, you can also support spell-check.

Lorem Ipsum Generation. If you can turn text into a data structure, you can turn data structures into text and you can do it randomly. This will illustrate which sort of structures are possible, both validating the language design and the

Knowledge Representation. I wrote about this recently, but in short, if you can generate random sentences, you can take a sentence template and generate complete sentences using a data table, say a phone book. And with a parser, you can turn natural language questions into queries and commands to get data back from a phone book.

Concordance And concordance, although I would almost suggest that existing concordance tools are good enough and don’t need to be language specific.

Posted in machine assisted conlanging | Comments Off

Corpus linguistics and little sins

Soooo, this toki pona project of mine. I’m parsing community generated texts. I’ve got incompatible goal. On one hand, I want a parser that will work. So if you type English, it should blow up. But if you type toki pona, it should parse it if it is beautiful, conservative toki pona. But if it is sloppy toki pona, I don’t really want to make a big deal of it. So you forgot to capitalize, forgot a period, forgot a closing quote, forgot the li, added a period instead of a comma before a li, and on. I’m not going to add new rules to try to deal with these. So at the moment, I normalize them. I just fix them.

Then there is this:

meli li tawa en tan lon palisa.

Ignoring what it might mean, it’s a compound propositional phrase, just like English, “No smoking in or around the school.” It could have been written:

meli li tawa lon palisa li tan lon palisa kin.

But that would just sound pedantic. Some stuff is sort of borderline.

How to you deal with noises?

jan li owi. => The guy said ouch!

I could fix it to:

jan li mu owi. => They guy made noise like ouch.

But that looks pedantic and I don’t like the word mu all that much. It sounds too much like a cow and you have to add another word to indicate the actual sound.

How do you deal with defective names?

nimi mi li nimi ‘jan Laowi’

Fixing them would make parsing easier, but would get in the way of communication, especially if a name is already well known.

Posted in toki pona | 2 Comments

Edge cases of grammar

These things will make your grammar more complicated, but you can expect them to show up in any community generated corpus, like immediately

Onomatopoeia. Fart noises and the like. In English, they get italicized and I’m sure someone has written a paper on what part of speech they are.

Direct speech. Though you didn’t have embedded sentences? Now you do.

Fragments. If a sentence is cut off, either on purpose or accident, grammatically what can you do with it? If understanding a sentence is a process of parsing by applying syntactical rules, what are the rules for dealing with fragments?

Diglossia. If you mix English and another language, what are the rules for swapping parts out?

Errors. This one stumps me. When computers parse, if one word is off, the computer completely fails. It can’t do anything with that program. But for human speech, if ten things are wrong, we apply a set of syntax like rules to fix it up and we don’t even notice. Sound absurd? This is essentially how modems worked with error correcting and check sums. If you’ve ever used an application called resharper, it does a similar thing for programming languages. It uses static analysis to find syntax mistakes and suggest corrections.

Punctuation. Think you don’t need punctuation? Think again– a simple grammar can yield dozens of alternative parsings. Punctuation brings that down to a manageable level. If it needs to be audible punctuation like lojban, that’s another story.

Compound words. (and neologisms) Think you only have 1000 words? Almost immediately, phrasal compound words will appear. Interestingly, if you treat them as compound words, the parser does better glosses and you have fewer alternative (wrong) parsings. Toki pona & Klingon have this issue– since both languages have a fixed number of bound & unbound morphemes (one by design, one by community choice). The alternative of just imagining all these common word pairings to be “ad hoc” phrases is really just dishonest.

I’m running into all of these issue when trying to machine parse a toki pona corpus– that language only has about 10 rules in the formal grammar. But my parser just keeps getting more and more lines of code to deal with issues like the above.

Posted in conlang design, conlang use | Comments Off