Dictionaries for toki pona

I read about the dictionary making for Algonquin, a highly synthetic language with few unbound morphemes. Everything of interest is a bound morpheme. Full words necessarily drag along with them a lot of other cruft, as if a dictionary had a definition for unsympathetically but the word sympathetic wasn’t allowed to be a stand alone word.

Surprisingly, toki pona is like that. toki pona has compound words, which if you are a grumpy cat, you can call them colocations (words that appear together commonly), or just call them compound words– because they behave rather similar to two stem words in languages with bound morphemes. Beyond that, we have “templates”.

Noun Phrases (content phrase)
jan pona. This is a perfect compound word. It takes modifiers, resists splitting, and it has two “slots”– stuff goes before it and after it.

These phrases have little internal structure. These are useful for machine parsing, the traditional dictionary just works. You could look up words by their head word and life is beautiful.

kin la. == really. Also a good compound word, it has two slots, you can put more la phrases before, a sentence after and that is it.

Verb Phrase
Verbs phrases are closer to templates because the head verb is one word.
kama sona. This isn’t a perfect compound word, it has three slots: [0] kama [1] sona [2]. The head verb is still kama and you can add modals before, negation, intensification and adverbs after kama. stuff after sona describes sona, not the kama sona phrase.

Templates are a lousy fit for a traditional dictionary. The head word could be in a variety of places. Sometimes the template doesn’t rely on any specific word, e.g.

x li y tawa z. x feels y towards x. (where y is usually pona or ike, x is usually an agent.)

I don’t even know where to put that in dictionary alphabetical order. I feel like I’m back in Algonquin again.

Sentences
mun li pimeje e suno. eclipse. This almost doesn’t feel like a template anymore. To use it in a sentence requires extensive rework. It has at least 4 template points not counting adding all the optional things available to the maximal sentence.

Other patterns.
kule lon palisa luka. Fingernail polish. This is also a template with significant internal structure.

Advice:
Keep the templates separate from untemplated definitions.
Be explicit about the slots in templates.

Unrelated advice:
Be wary of unwarranted glosses and translations.
jan Sonja said telo is sauce, so I guess it is.
If I say telo means rocket fuel, it’s an unwarranted translation unless there is some text to set that up.

Posted in lexicography, machine assisted conlanging, Virginian Algonquian | Leave a comment

Paragraphs and machine assisted conlang work

When you are teaching a machine to do something with a language, you get a surprising set of challenges that are not mentioned much in your traditional reference grammar.

On the internet, people post texts and line breaks are unreliable indicators of paragraphs. White space can appear for a variety of reasons.

e.g. Line break in middle of sentence– exceedingly common.
jan ilo li wile
pali.

e.g. Double space in middle of sentence– enough to make a double space an unlikely paragraph mark.
jan ilo li wile

pali.

e.g. Line break is actually HTML. Sometimes HTML p tags aren’t really paragraphs.
jan ilo li wile
< br >
pali.

e.g. Tab starting a paragraph is actually a few spaces. Sometimes those spaces disappear and sometimes they are just spaces.
   jan ilo li wile pali.

Solutions.
Explicit paragraph marker, for example, four dashes centered on a page, like the divider you see in some novels between “scenes”.
Assume double space is a paragraph. This is wrong a lot of the time.
Synthetic paragraphs. Apply rules such as this: Any sentence ending in ni: is in the same paragraph as the following sentence. Any vocative followed by a sentence is in the same paragraph. Quoted text initiates a new paragraph. However, I suspect this would be a lot of work and would fail, resulting in too many synthetic paragraphs that ‘consume’ the entire text.
Ignore the problem and turn everything into a series of sentences, or a huge single paragraph.
Two parsing modes. Strict and Loosey-Goosey. In strict mode, paragraphs are started by tabs. In Loosey-Goosey mode, tabs, blank lines are assumed to be paragraph breaks and it is just accepted that this will be wrong a lot of the time.

Posted in machine assisted conlanging, toki pona | Leave a comment

Inferences from toki pona utterances

So I’m noodling again the idea of knowledge representation with toki pona.

If I say: jan li laso. And then ask: seme li laso? Then it should be easy enough to look up jan. (That is to bind seme to jan)

But if the structure is different but equivalent, then I need to generate the equivalencies or inferences (in the case where the derived sentence has less information)

ni li ilo jan.
jan li jo e ilo. (possession)
ilo li jan. (adjective.)
ilo jan li ni. (equivallence)

jan li moku e kili.
jan li moku. (direct objects are optional)
!= moku li jan. (can’t reverse order)
? kili li moku. (the fruit is eaten?)

ni li ilo laso jan.
ilo li laso li jan. (Both adjectives, unlikely reading here.)
ilo li laso. jan li jo e ilo laso. (adjective & possession)

ni li ilo jan laso.
jan-ilo li laso. (compound with adjective)
ilo li jan li laso. (unlikely reading)

jan li soweli tomo e soweli.
tenpo pini la soweli soweli tomo ala.
tenpo ni la soweli li soweli tomo.

ni li ilo pi jan Mato
* jan li Mato.
jan Mato li jo e ilo. (posession)
ilo li jan Mato. (modifier)
ilo pi jan Mato li ni. (equivallence)

jan lon ma Mewika li jan pali lon tomo pali.
jan li lon ma Mewika.
jan li jan.
jan li pali. pali li lon tomo pali.

mi moku e kili tan ma Asija.
mi moku.
mi moku. moku ni li tan ma Asija. (what is the scope of the PP?)
kili li tan ma Asija. (what is the scope of the PP?)

Posted in machine assisted conlanging, toki pona | 1 Comment

The Heart of Esperanto

So I was wondering what a parser for Esperanto would look like if I only used the 14 rules.

I re-read them and quickly decided that actually, at the time of the 14 rules, the bulk of the language specification must have been in the dictionary and sample texts. Zamenhof didn’t know how to write a formal grammar, he’d probably have to live another like 50-75 years before formal grammars were in the popular imagination.

One fascinating and na’vi like feature of Esperanto is that modifiers can lead or follow the noun the modify. A lorem ipsum generator could help test if these possibilities are workable. I suspect not– in a maximal phrase you wouldn’t be able to coordinate modifiers with what is being modified. I could be wrong, so lets write a parser and find out.

Stems. Esperanto has like 800 stems. This is a small lie of Esperanto because with borrowing, this has since turned into 8000+

Words. Words are prefixes plus one or more stems plus derivational suffixes plus grammatical suffixes, which include part of speech suffixes.

Phrases. Phrases are noun phrases, verb phrases or prepositional phrases. I’m guessing there are also subordinate clauses.

Sentences. Sentences, it appears, are unordered collections of phrases. This is a little lie of Esperanto, because people in practice follow an order rigid enough to make the accusative unnecessary. Sentences can contain other sentences.

There is more too it, but I think I can write a mini-parser with just the above.

Posted in Esperanto | Leave a comment

Math and logic subsystems in languages

So I can open a novel, and as part of being a human raised in an English speaking community, I pretty much understand everything. I can open a textbook on Calculus on Logic and while I can read the whole thing in English– there is even awkward but grammatically correct ways to read off the formulae– I’m not going to understand it just because I know English. I think this is some pretty conservative evidence that math and logic are not really natural languages, they are more like a foreign language embedded into a natural language.

So I was trying to deal with conjunction in toki pona. Sometimes they are made unnecessary by the “chain pattern”– one similar structure after another implies “and”. Sometimes they indicate discourse connectors, by tagging a sentence with “or” or “but”. Those two forms of logic are effortless to parse (except when people ignore the chain pattern and try to explicitly add “and” words) Finally we get these monsters:

1) jan li suli. (simple, no “and”)
2) jan li suli li laso. (chain patter, one right after another implies “and”)
3) jan li suli en laso. (different structures imply different meaning, maybe the qualities are mixed, like blue and red can be mixed)
4) jan li suli taso mute.
5) jan li suli anu mute.
6) jan li jan suli en mute.
7) jan li jan suli anu mute.
8) jan li jan pi suli en mute.
9) jan li suli en mute anu soweli. (mixed and, or, but)

3, 4, and 5 imply that you can “and”/”or”/”but” qualities without a head, so modifier phrases would be something like:

To parse all of the above, 1-8, you need a data structure that looks like this… and will lead to some monstrous maximal forms.

Head modifier (optional)
(Maybe a pi, depends on if you are predicate or modifier of a headed phrase)
Ands: en + modifiers — repeated
Ors: anu + modifiers — repeated
buts: taso + modifiers — repeated

And maximally something like:

jan li suli en mute taso lili taso laso anu soweli anu waso. (Grouped)
jan li taso lili en suli anu soweli en mute taso laso anu waso. (Jumbled up.)

How to parse this? I have no idea, it reads like a logic puzzle and you’d have to introduce a foreign logic system to do something with it. It looks syntactically valid. So I’m thinking my parser should represent a modifier chain as above, but make no claims about what it means. So it parses one way, and if someone (ha! unlikely) ever decided to implement a logic subsystem, they could take this parse and then transform it into all the possible meanings, truth tables and so on.

But for these applications, we don’t care:

grammar check– it’s valid syntax.
glossing– It glosses to English, and is equally ambiguous and unintelligible in English.
syntax highlighting– you only need to recognize an “and”/”or”/”but” sequence to color the text, you don’t need to know what it means or parse it as just one parse tree.
chat bot– A chat bot would never explore these corners of possible meaning in the universe of representable meanings that toki pona can represent.

Other Observations.

1) * jan li kepeken ilo en kepeken soweli. (Don’t use can to combine prep phrases)
2) */? jan li tawa en kama. (Don’t use en when you can use li– but if this was a modifier chain, and a predicate sentence, then its probably okay)
3) * jan li kepeken ilo anu kepeken soweli. (Don’t “or” prep phrases)
4) * jan li moku e ilo anu e soweli. Don’t use both anu and e, don’t use both taso and e [Update, changed to moku because kepeken has had some recent POS confusion from toki pona version pu)
5) */? ante jan li kepeken e ilo. Don’t use anything but anu or taso as a tag-conjunction.
6) * en jan li kepeken ilo. Don’t start sentence with en. (En is implied, although it would have made for a nice audible sentence demarcation)
7) ? waso pi laso en pimeja li pona tawa mi. This is really hard to parse. “and”ing modifiers in the subject slot is only sometimes distinguishable from mistakes and “and”ing subjects.

Posted in machine assisted conlanging, toki pona | 2 Comments

Observations While Editing Toki Pona

1) People create incorrect proper modifiers constantly.
2) People give up trying to use proper modifiers and just use their native language for all proper modifiers.
3) li get dropped for 3rd person subjects.
4) People essentially invent new constructions all the time. (It’s easier to enforce a ban on new words than it is to enforce a ban on new grammatical constructuions)
5) The complexities of parsing punctuation, ellipsis, fragments, asides/parentheticals, diglossia, direct speech, numbers, “theater script text, e.g. person1: “…” person2:”…” can be as great as the complexities of the core grammar (sentence, subject, object, modifier, etc) When there is no guidance, people just make stuff up.
6) People will do diglossia and mix in dates, times, numbers from a foreign language when the language spec doesn’t include them (or discourages using them!)
7) No one wants to punctuate poetry, which makes finding sentence termination incredibly difficult.

Posted in machine assisted conlanging, toki pona | Leave a comment

Abusing the word morphology as a pragmatic matter

So I have a syntax that relies on sentence templates like this:

[subject] li [verb phrase] li [verb phrase] [Prep Phrase] [Prep Phrase].

Phrase are made of words and phrases have their slots. Then come along negation, emphasis, and uncertainty (with a desire for clarification, also known as questions). These can be slot in in many places, which makes them feel like modifiers (adjectives and adverbs). But they can be several words (e.g. anu seme, x ala x) and they resist taking modifiers themselves, e.g. the following:

? jan li wile kin pi suli mute
? jan li wile ala pi suli mute
? jan li wile ala mute wile

I’m going to treat these as morphology. They have a different scope (or different type of scoping rules)– usually one word. (They can have a larger scope). And they can be inserted (almost) anywhere, more places than adjectives. Negation has an uncertain scope, but usually one word.

jan ala pi pali mute. Scope of 1 word
jan pi pali ala mute. Scope of 1 word.
jan pi pali mute ala. Ambiguous scope, but probably read as scoped to mute.

? jan ala pi pali ala mute ala. Slots in everywhere and means something in each slot.
jan laso pi pali laso mute laso. laso doesn’t actually imply anything different in the 2nd half. The heavy work load is still blue.

The ability to put ala into many place in a modifier chain imply that ala behaves like morphology– it works on a single word, not on a phrase.

jan pona kin.
jan ala pona.
jan pona ala.
jan pali ala
jan pali kin. Emphasis on pali. (Or on jan pali)
jan kin pali. Emphasis on jan. (And not on pali.)

But

jan suli walo == jan walo suli. Order didn’t matter. (assuming that jan suli and wan walo don’t have special meanings the way jan pona does.

Anyhow, not sure how to state it, but these feel like they should be modeled as a feature of a single word. (And maybe also as a feature of phrases).

I think this differs from, say a phrasal modifier, e.g. pi telo nasa, in that pi telo nasa modifies a certain phrases of a certain sort and individual words, while ala can modify just about any word, even propositions, verbs, modals. kin likewise can. And X anu X can be applied to any word in the verb phrase, maybe to prepositions.

They also mess up the template, because now the template as a lot of slots.

[subject(ala/kin/anu seme)] li [verb phrase(ala/kin/anu seme)] li [verb phrase(ala/kin/anu seme)] [Prep Phrase(ala/kin/anu seme)] [Prep Phrase(ala/kin/anu seme)].

If fits into so many slots it better to think of it as a way of transforming a word. So we get something that looks like morphology in a largely analytic language. And this should make the code for the parser easier to write (rather than adding numerous ala/kin/anu seme flag each phrase)

Posted in toki pona | 1 Comment

toki pona minutia

These are all unsettled, controversial ideas in toki pona.

1) kule lon palisa li pona tawa mi.

2) kule pi lon palisa li pona tawa mi.

Prepositional phrases as modifiers. These are most obvious when they appear in the subject. When they appear at the end of a sentence, then you could always argue that the phrase modifies the whole sentence or what ever the reader thinks makes sense. Almost no one does 2). I plan to write the parser to treat prepositional phrases as exchangeable with any tail, single word modifier (i.e. they are modifiers that come last)

1) ni li lon insa poka.

2) ni li lon insa poka suli.

3) ni li lon insa pi poka suli.

4) jan li tawa en tan ma suli.

Complex prepositions. People act like they can be joined, and’ed and or’ed. Some argue that the tail of a complex preposition is actual a noun, but then no on adds the pi for it’s modifiers, like in 3). And in 4) it is impossible to not look at it as a complex preposition. I plan to parse preps as potentially complex as in 2) and 3).

1) 543 = 5 (hundred) 4 (tens) 3 (ones)

A decimal system is ten digits and some of the place names, for example, a place name every three places. The choice of words isn’t so important (words for commonly measured, e.g. tenpo or approximate values, suli, mute, lili are bad choices), I plan to implement numbers as parameterizable templates.

1) jan li toki e kalama Wakawaka.

2) jan li toki e Wakawaka.

3) mi kute e Tinton li pini lape.

4) soweli li Mijawu li Mijawu.

5) Mijawu!

Onomatopoeia is so rare that the particular answer should not matter much. Forcing these to be proper modifiers sort of works, except in #5, it pedantic to make such an exclamation a modifier and in 2) it probably redundant to mention again that this is a sound, human sound, or animal sound. (kalama, nimi, mu) These should be allowed to stand bare. But possibly punctuated to distinguish from accidental unheaded proper modifiers.

1) jan li kepeken ilo kepeken soweli.

2) jan li kepeken ilo en soweli.

toki pona has these things that look like chains linked by particles. Sometimes the pattern seems to fight against recursive patterns, sometimes it seems to give up and just do recursion. For example, x en y en z can be read as an unordered list of elements. This is a very simple data structure, it fits with the design goals of a simple languages. a pi b c en a2 pi b2 c2 is something a of a tree structure at least. There are chains joined by chains. The prepositional phrases are odd balls because if you string them together, the particle is … blank, e.g. jan li sama soweli kepeken ilo. The chain of prep phrases either can be analyzed as having a particle with six forms, or a blank particle and the prepositional phrases is headed, or possibly conjunctions are ordinary chains. People seem to use 2) above, i.e. prepositional phrases are not chains like en, or pi. Anyhow, 1) and 2) are semantically equal and there isn’t a compelling reason to make 2) illegal– for one it makes texts clunkier, implies that a string of prep phrases should group together, e.g. sama A sama B kepeken X kepeken Y. I sort of want to make sure you can treat prepositional phrases as chains (it’s sort of elegant to treat all the phrases as the same datastructure), but won’t write my parser to expect them to be treated as chains.

Posted in toki pona | 3 Comments

Object Oriented Conlanging

If you don’t write code, you may feel like the wrong audience.

So recently I’ve been on a kick of learning about object oriented library writing, especially for the most basic types. I notice that MSDN’s advice for writing basic types and the exercise of writing a library acts sort of as a model exploration exercise that leads to surprising discoveries.

Why a Conlang?
Sure natural language processing libraries exist for real languages like French and English? Yeah, they do, but French and English are so mind boggling complex that writing code to parse them is a task that has befuddled developers for decades. Even established conlangs are much smaller, even those that weren’t initially designed to be small or limited in their complexity.

Also, during the creation of a conlang, out of necessity, it is small, if only because the vast majority hasn’t been imagined yet.

One more reason: because you may overly restrict yourself by considering datastructures that are peculiar to English and French when one possible goal of a conlang is to create a human communication system that goes beyond what can be done in well known natural languages (this compares to the naturalistic-conlang goal of creating a reference grammar that fools professional linguists– a goal that will put some limits on breaking new ground into new techniques for communication strategies)

Basic Types in a Conlang
Words. The basic type should be a token, not a string. Strings are not atomic enough.
Morphology. The language I am using (toki pona) as my model doesn’t have any morphology to speak of.
Phrases. Here is where I discovered the most. It turned out to be helpful to have many specialized classes that can exploit information that goes beyond what a formal grammar might know. For example, some words are likely agents, and those agents can own things, they can be the agent of a transitive verb, and so on.
Forgotten Structures. We almost never formally deal with punctuation, quoted text, diglossia, numbers, dates, animal sounds and so on. When you work with real sample texts, you encounter these issues immediately and realize that they can’t be ignored and deserve to be treated in terms of importance as concepts on par with subject, object and prepositional phrase.
Alternative parsing. A machine parser will yield only one parse, automatically, for free because you don’t have any other choice. However, the formal grammar will suggest that for any given text many parsings are legal. I think that letting the machine pretend there is only one parsing is pragmatic– it forces the language creator to decide how much ambiguity is tolerable.

Basic Applications of a Conlang Library
Word Processing. If you can parse it, it is potentially valid syntactically. A prerequisite to parsing is dictionary lookup, so if you can parse, you can also support spell-check.

Lorem Ipsum Generation. If you can turn text into a data structure, you can turn data structures into text and you can do it randomly. This will illustrate which sort of structures are possible, both validating the language design and the

Knowledge Representation. I wrote about this recently, but in short, if you can generate random sentences, you can take a sentence template and generate complete sentences using a data table, say a phone book. And with a parser, you can turn natural language questions into queries and commands to get data back from a phone book.

Concordance And concordance, although I would almost suggest that existing concordance tools are good enough and don’t need to be language specific.

Posted in machine assisted conlanging | Leave a comment

Corpus linguistics and little sins

Soooo, this toki pona project of mine. I’m parsing community generated texts. I’ve got incompatible goal. On one hand, I want a parser that will work. So if you type English, it should blow up. But if you type toki pona, it should parse it if it is beautiful, conservative toki pona. But if it is sloppy toki pona, I don’t really want to make a big deal of it. So you forgot to capitalize, forgot a period, forgot a closing quote, forgot the li, added a period instead of a comma before a li, and on. I’m not going to add new rules to try to deal with these. So at the moment, I normalize them. I just fix them.

Then there is this:

meli li tawa en tan lon palisa.

Ignoring what it might mean, it’s a compound propositional phrase, just like English, “No smoking in or around the school.” It could have been written:

meli li tawa lon palisa li tan lon palisa kin.

But that would just sound pedantic. Some stuff is sort of borderline.

How to you deal with noises?

jan li owi. => The guy said ouch!

I could fix it to:

jan li mu owi. => They guy made noise like ouch.

But that looks pedantic and I don’t like the word mu all that much. It sounds too much like a cow and you have to add another word to indicate the actual sound.

How do you deal with defective names?

nimi mi li nimi ‘jan Laowi’

Fixing them would make parsing easier, but would get in the way of communication, especially if a name is already well known.

Posted in toki pona | 2 Comments

Edge cases of grammar

These things will make your grammar more complicated, but you can expect them to show up in any community generated corpus, like immediately

Onomatopoeia. Fart noises and the like. In English, they get italicized and I’m sure someone has written a paper on what part of speech they are.

Direct speech. Though you didn’t have embedded sentences? Now you do.

Fragments. If a sentence is cut off, either on purpose or accident, grammatically what can you do with it? If understanding a sentence is a process of parsing by applying syntactical rules, what are the rules for dealing with fragments?

Diglossia. If you mix English and another language, what are the rules for swapping parts out?

Errors. This one stumps me. When computers parse, if one word is off, the computer completely fails. It can’t do anything with that program. But for human speech, if ten things are wrong, we apply a set of syntax like rules to fix it up and we don’t even notice. Sound absurd? This is essentially how modems worked with error correcting and check sums. If you’ve ever used an application called resharper, it does a similar thing for programming languages. It uses static analysis to find syntax mistakes and suggest corrections.

Punctuation. Think you don’t need punctuation? Think again– a simple grammar can yield dozens of alternative parsings. Punctuation brings that down to a manageable level. If it needs to be audible punctuation like lojban, that’s another story.

Compound words. (and neologisms) Think you only have 1000 words? Almost immediately, phrasal compound words will appear. Interestingly, if you treat them as compound words, the parser does better glosses and you have fewer alternative (wrong) parsings. Toki pona & Klingon have this issue– since both languages have a fixed number of bound & unbound morphemes (one by design, one by community choice). The alternative of just imagining all these common word pairings to be “ad hoc” phrases is really just dishonest.

I’m running into all of these issue when trying to machine parse a toki pona corpus– that language only has about 10 rules in the formal grammar. But my parser just keeps getting more and more lines of code to deal with issues like the above.

Posted in conlang design, conlang use | Leave a comment

Puctuating toki pona- Community Proposal

toki pona has a few constraints, without which the game of toki pona is rendered silly. Foremost is that there are only about 125 or so morphemes (fairly non-bound). This isn’t as much of a problem and I think tp community proposals can stick to this.

The next idea is that numbers, dates and so on are lacking, as if this were the language of an ancient tribe (despite missing a fully formed system of naming plants, animals and extended family relations). This is problematic for working with data on computers. Numbers and Dates are basic types, without them certain computer experiments are harder than necessary.

I’m writing a parser and I need to make a few modifications to make tp easily parsable. My parser does a two pass parse– 1st phase I normalize the text & make best efforts to add punctuation as described below. It is difficult and error prone. If I didn’t have to do this normalization, the parser would parse more text on the 1st try and get better glosses on the 1st try.

Summary
1) Phrasal compounds are joined with dashes. jan-pona. jan-pi-sijelo-pona.
2) Prefix numbers with #, e.g. #wan If it is a two word number, it is hyphenatied, e.g. #wan-tu
3) Direct quotes are in << >>. e.g. jan li toki e << toki! >> (everyone else appears to use English convention of single or double quotes– but I need an escape character, see below)
4) Prepositional phrase must start with , e.g. mi li, lon ma ni. jan li moku, kepeken ilo.
5) Non toki pona text is escaped with double quotes. mi toki kepeken toki “English”

We have compound words. We pretend we don’t, but we do. These are lexemes, phrasal compound words. Compound words are joined by -’s

jan-pona = friend.
jan-pi-sijelo-pona = doctor.

Why? Because you can’t accurately machine gloss jan pona to friend. Why should we pretend that jan-pona is anything but a phrasal compound and gloss it as good person, healthy person, friend, etc. Without hyphens, I have to gloss using a list of alternatives. With hyphens, I can dispense with a list of alternatives and home in on a single gloss.

Unhyphenated.
jan li ike li tawa jan pi sijelo pona li kama jan pona.

Hyphenated.
jan li ike li tawa jan-pi-sijelo-pona li kama jan pona.

We have “rovers”/syntactical infix. I don’t know what these are really called.

jan-mute-pi-sijelo-pona = doctors.
jan-pi-sijelo-pona-mute = doctors.

We need numbers. The shall be words prefixed by #
#ala
#wan
#tu

#luka

I will have to look up 3,4,6,7,8,9 from the forum. I know there are many proposals, I’ll look for community ones and then I plan to implement the ones that are base 10, don’t introduce new words, positional and reasonably efficient, e.g. no worse than English in expressing large numbers.

Some numbers are legacy numbers with some degree of officialness and will have to be supported.

#wan-tu-tu = 4
#luka-luka = 10
#MMLW = 20+20+5+1

But I don’t recommend using legacy numbers if you are trying to communicate.

Watch this space!

We need direct quotes. They shall be wrapped in << >> (or the « » if you can find those keys on the keyboard)

jan li toki e << mi jo e soweli! >>
He said, “I have a dog.”

I hope I don’t regret this choice because < and > mean something in HTML and might cause problems in some content management systems. Oh well.

Anything in direct quotes markers is syntactically a content word.

We need commas.
People currently add commas before or after la, but actually we don’t need them there. I have no opinion about what people do there. Also I have no opinion about commas in pi-phrases.

mi pali, kepeken ilo sona, lon tomo, pali tawa mani.
I work with computers in the office for money.

When there is nothing to distinguish a preposition from a content word, it is valid to parse every word after pali as a string of adverbs:

mi pali(kepeken ilo sona lon tomo pali tawa mani).

Humans can realize that is unlikely, but a machine can’t. Humans can parse invalid toki pona and realize that someone is mixing Russian and English and toki pona rules and, with some effort, realize the intended correct toki pona. This sort of parsing is a huge effort to implement. On the other hand, commas make parsing mechanically effortless.

We need an escape character
The corpus texts are full of mixed language material, from accidents in transliteration to people just trying to communicate. After transliterating to toki pona, normally the original is unrecognizable– it might as well be a completely new word. So toki pona texts that interact with the real world, will need to have foreign text. And that text should be in double quotes.

nimi mi li “Matthew Martin” li jan Mato.

Anything in double quotes syntactically is a content word.

Dates
The current date system is something like

tenpo suno wan, mun wan, sike suno wan = 1/1/1

You can find some variant of this on the wikia for toki pona. It uses legacy numbers and is to cumbersome for anyone to want to use it.

I’m going to recommend this format: y-m-d
S1-M1-T1
Because it will be easier to sort.

Also, for this to work, numbers have to be reasonably efficient and be able to cope with numbers from 1 to 2015.

Watch this space!

Posted in machine assisted conlanging, toki pona | 2 Comments

Robot Languages

By the way, it looks like Dothraki has a published spec. Now on to other topics.

As someone with the facial expressions of a robot, I’ve always been partial to robots and some of my earliest attempts at programming were to create chat bots and AI. I failed, of course. But now I have some ideas on how to make it work.

Our human brains have some sort of knowledge representation system, it turns our stage– the world around us, into facts represented by neurons linked by axons and dendrites, which chatter using neurochemicals. We lack a technology to accurately and usefully use a neurological model to represent reality. But, hey, we got other ways to represent reality. For example, we use documents and relational databases to keep track of inventory and the business activities of all large businesses and government in the world.

Normally, when this need to be communicated, we use protocols like HTTP to send (often technology independent) serializations of database records that can be sent across a wire. We then using UI’s and binding to turn this into human consumable materials.

But lets get back to robots. Robots are machines that would want to be like people, and thus use a natural language. That means they could possibly deal with people directly. But English is hard, so maybe a conlang or restricted version of English would be better.

Representations of reality:
Name – Phone Number
Joe – 555-1234
Jane – 444 – 5678

If this was toki pona, we could serialize this as:
nanpa pi jan Joe li 555 – 1234.

By some complicated system of equalities, we could work out that this is the same as:
jan Joe li jo e nanpa ni: 555 -1234

If the robot heard a sentence, it would attempt to use deserialization & equality checks to transform the utterance into a known data type:

jan Mato li jo e nanpa ni: 111 -8989 ==> Mato – 111 -8989

A lojban style processor could also answer utility questions like,

nanpa Jane li 444-4678 la ona li toki tawa mi.
If Jane’s number is 444-4678, then she’s talking to me.

And the robot would respond, after binding & processing pronouns:
jan Jane li toki tawa sina.
Indeed, Jane is talking to you.

Or utility questions might involve common computer tricks like, “how many digits are in Jane’s phone number? What is the sum of the digits in Jane’s phone number?” A human actually excels at this arbitrary discussion, where as a robot has to be programmed for each exchange of that sort.

Pronouns seem like something that would be really, really hard for a computer. If my computer only had a knowledge representation system for the phone book, it would need to know who is a person, who is capable of having a phone number and so on. People excel at common sense, modern code doesn’t. Databases rely on nonce, unique names and variables that might be bound to anything are used only in limited scopes to make sure that they do only bind to 1 thing at a time.

Next, is the chat bot problem.

Chat bots respond to what ever you ask. Usually it’s modeled as a command. But human languages only sometimes use commands.

If Jane’s number is X, then she’s talking to me. (Implied, asking for confirmation)
I know Jane. (Implied, asking for additional information about Jane, e.g. Oh, you do? I know her too, her number is X)

Another thing a chat bot should be able to do, is serialize things into something that is suitable for saying over the phone. Most code dumps text to the screen, often in a grid format. A good robot would be able to tell a story in a way that takes into account attention span. A bad robot would read all 5000 phone numbers. A smart robot would say, after reading two, “and so on” or “do you want me to keep going or are you looking for someone in particular”

State– some of the best chat bots are sadly stateless. They don’t incorporate anything you say into their base of knowledge. Some do, but it’s kind of wonky– they just remember that after saying “Good day” people usually just repeat “Good day”.

A good robot takes all utterances and converts them into a system of knowledge.

My phone book robot, if I said:

mi jo e soweli.

Would interpret that as asking the database to create a new table like so:

who – inventory
jan Mato – soweli

And if two minutes later I asked:

mi jo e seme?

The robot should be able to look it up even though 10 minutes ago, this robot only knew phone numbers.

This is the flip side of serialization– turning language back into the knowledge representations system.

Anyhow, this has been done before, MS SQL had a natural English processor, it was probably similar to what I have described, although I bet it only dealt with turning english into SELECT statements and turning the tables of data, maybe into English sentences. Turning English into tables that can be queried again is probably hard.

A tp fact database would rely heavily on equality tests:

mi jo e soweli lon tomo mi.
Does this factually contain the following?
mi jo e soweli. Yes.

Anyhow, hopefully personal life will allow the free time to write such a thing. So to recap:

Knowledge representation system: E.g. relational tables.
Serialization system: E.g. turns rows and tables into sentences
Deserialization system: Creates tables and binds utterances to a table, then inserts 1 or more rows.
Persistence: All commands, factual or otherwise, become part of the system of knowledge.
Query language: Questions, or statements that prompt retrieving information and serializing it back to the interlocutor.
Utility: Processing tasks that are not really related to retrieving and updating a representation of knowledge. For example, answering if at least 3 people in the phone book have names starting with “G”
Equality and Transformations. Natural languages can serialize into many equivalent forms.

Posted in machine assisted conlanging | 1 Comment

The 5 Word Langage

Finally, I sort of have the time to do a quick review of the 5 word language. I’m a big fan of small languages (as in small vocab, small numbers of rules, small number of morphemes)

This is really, 5 morphemes. I think all small languages, in practice have about 2000 lexemes (set phrases that behave as words that you just have to memorize). But no need to quibble about the number of morphemes, this looks like it really is 5 morphemes. That is small.

What is totally awesome about this, is it is 2 morphemes short of what you can memorize in a single view–short term memory is about 7 (plus or minus 2) items.

The vocab is laid out in a grid and the entire grid is used. This compares to toki pona which ignores the diagonal– i.e reduplication in tp doesn’t mean anything.

I think logotome is a real word… shoot me if it isn’t… and the logotome of a language is the set of all possible words (or lexemes) that the phonotactic system lets you create. Toki pona’s logotome is huge– even with the small alphabet and CV(CV)(N) structure, you have like 10000s of possible short words. A five word language has 25 two word phrases,

Good & Potential Applications
This might be a useful conlang creation technique. Create a dozen small 5 word conlangs, then turn the best of those into a large language. If the large language was compatible with the small one, then you’d have a conlang with a core that someone could learn before they lose interest, which I suspect is about 2 maybe 3 hours. Irrr

I like the idea of using numbers– it has an obvious application for text messaging on phones (there is a (failed) app for that! it’s an emoji type conlang). If I were to learn/create a dozen of these language though, I’d worry about having to remember that 42 means one thing in the 5 word language in another thing in the 7 word language. Already I constantly mix up telo, which in toki pona means water and in Russian means body. I read somewhere that when you hear a word that is the same in 3 languages (like, say, chocolate or tea) you the area of your brain in charge of that word for each language lights up. So cross language interference may be a real thing to consider when making small languages that draw on the same phonotactics (or logotactics– I don’t see if the 5 word language has a spoken format).

Domain specific languages. A domain specific language is an idea borrowed from software development– where you create a mini-language to deal with a specific topic. Then the language can be optimized for talking about that specific topic. It’s like an extreme version of slang and technical jargon, which can feel like a mini language. Except English jargon still will follow English grammar and syntax. In the lexicon, words are available for diabetes and God, so I figure this language makes it easier to talk about those topics. And if you want to talk about something else that doesn’t suit the language’s lexicon, create another 5 word language!

Areas for Improvement
Like many combinatorial languages (i.e. a fixed set of morphemes that are combined in all possible combinations), this particular description doesn’t say much about grammar. Is the grammar isolating? Do we have bound morphology? (i.e. do any of the morphemes only occur in a fix relationship to other morphemes, ie. do we have a tense suffix?) What are the basic sentence patterns? I.e. is it S-V-O, OVS, do we have prepositions, post positions, do we branch left (like japanese) right (like english) (Did I get that backwards?) or mixed branching (like toki pona). Do we have Part of Speech rules or are all words content words? Are some words “semantically bleached” and don’t mean much on their own, but mean something when in a sentence… examples from English– the, to, in, of, have, going (future), etc.

Posted in conlang | 1 Comment

Undeveloped Public Domain Conlangs

Barsoomian is public domain, unless it is the most recent movie version.

But here is one that I just notice: Parrot from Doctor Dolittle:

“Ka-ka-oi-ee, fee-fee”
“Is the porridge hot yet?”

Oh, boy. How shall we do an interlinear gloss for that?

I’m going to guess ka-ka-oi-ee is a compound word meaning porridge and fee-fee means ‘now’ and reduplication means a tag question. Since reduplication indicates something you don’t know, then ka-ka would be “mystery” as in mystery-meat. So “mystery-food, now-now?”

And the dog speaks a constructed sign language.

And I’m going to guess that after page two the author lost interested in actually describing the fake languages. A pity.

But hey! It’s public domain. You are allowed to fully develop the language and sell it. Go forth and do so!

Posted in Uncategorized | Comments Off