Why we don’t give a crap about languages once English has won

This is why we don’t care about foreign languages and why when we do, Na’vi is just as good as the next, in fact better.

If you can think of an easy way for me to get linguistic samples from dying languages that I can analyze in my home, then by all means, I will learn a dying language. If learning Korean will provide me with a large but close-knit community of people, easily and readily accessible and willing to share, whom I know share the same exact interests as me, down to films and books, then by all means, I will work harder at learning Korean. Many of this have this as just another language tacked onto the pile: Prrton knows, what is it, seven? Wm knows at least five. And I personally tire of dealing with all of the irregularities of greek and latin, the inaccessability of Japanese orthography, and the lack of immediate community for German and French. I would love to learn a dying language. Tell me how, and I’ll start today.

Richard Littauer (http://www.slate.com/id/2248683/)

How to learn a language, for example Na’vi

Solve these problems and you’ll learn na’vi (and they have nothing to do with grammar and vocabulary)

  • Who to talk to?
  • How to find the time?
  • How not to forget to do something with it every day?
  • How to keep the motivation going for the long run?

Find the Community

Sign up for a variety of community sites and know how to use them.

Mailing lists are for announcements and messages that are being ignored on forums.  Mailing lists are not good for posting translations for critique, and only so-so for question-answer.  When the traffic on forums dies out, mailing lists are better than forums regardless to message content because everyone gets the message on a mailing list. On a forum, only people who think to visit the site that day see the message.

Forums are good for posting question & answers and translations for critique. Translations for critique tend to be a good deal of work for both reader and readee.  Forums let the message stand until someone motivated can get to it.  On the other hand, messages deferred become messages ignored on mailing lists.

Blogs are good for messages that are huge, or controversial, or have no particular audience.  Mailing lists and forums can’t deal with a fifteen page essay.  Controversial messages can generate flame wars and can ruin mailing lists and forums– a blog is more indirect.  Blogs also tend to hang around forever, so you’re article will reach it’s audience in five years when someone interested in the reconstruction of proto-Na’vi pops into existence.

Wikis are good for moving the community forward, especially if the language is new.  If there isn’t enough content for a language, the language will die out.  Collaborative document writing lets a scattered group of people generate enough interesting content for the community to want to learn the language.

Miniblogging is good for doing a little bit in the language everyday without feeling overburdened by the need to catch up on the mailing list and unread forum posts.  Twitter

Meetups are essential, too,  because they are so efficient at identifying the people in your community who are also interested in the same language.  Internet communities subconsciously know there is a difference between online and offline communities or else the phrase “in real life” (IRL) wouldn’t be needed.

Motivation

In a natural language, you have the advantage of being able to tap into pop media, like manga, bandes desines, music, etc, etc.  Na’vi almost has that, being a language embedded in a pop culture movie, but there is very little. I hope that who ever writes the first large work in Na’vi picks something approachable like Winnie the Pooh, instead of Hamlet, which is half unreadable in the original.

Motivation is again related to community.  If you have online and IRL friends in the language community, keeping those contacts alive is reason enough to keep working on the language.

The last big reason for learning a language is because it is a language.  Every language is a little laboratory for testing out ideas about how we communicate and finding out who we are.

Na’vi Pronunciation, the Wikipedia way

We have the IPA for na’vi.  So let’s get cracking.

ì: First, what is a high front i? That is the i with the accent grave, or the : ì.  Wikipedia has two sound files and sample words for a dozen languages.  It is the i in English “bit”, Swedish “sill” (herring)

ä: What is a low front a? That is the a with the umlaut, or the ä.  Wikipedia has one sound file.  It is like the “a” in “cat” or Swedish päron (pear)

Pseudo-vowels ll and rr.  I’m not getting a lot of hits on pseudo-vowels, normally these are called syllabic constants.  The ll is kind of like the final l in English “bottle” or French “table”, which is an example of a syllabic l.  Syllabic r lets Czech do sentences like “Strč prst skrz krk “, which has 4 syllabic r’s.

Ejectives. px, tx, kx …. okay, I’m just going to point you to the sound files.  It’s like p, t and k with a bottled-up, unusually forceful puff of air.   Just remember back to your K’ekchi lessons in highschool, you don’t? Ok listen to the k’a (bitter) file.  And Lakota has all three, the p’, t’, and k’ In Lakota it sound more “clicky”  My cat was afraid while I practiced these aloud.

Flapping those r’s.  Also called aveolar tap.  Sounds to me like you start to trill an r, but only do one “trill”. Mind bogglingly, it is also the sound of the “tt” in “better” or “latter” for US, Auz and NZ speakers.

Well, more later.

So you have a successful constructed language?

I wrote this article thinking about what direction the na’vi language community is likely to go in the next few months.  Maybe Paul Frommer will notice this article and give us some pointers on internet words, neologisms, what is canon, what are ways to signal politeness in an email and what the official a to z ASCII alternative is for the two diacritics.

The Internet. The audience of a constructed language is the global community of internet users.  If we count the number of people who will learn enough to write a sentence, this audience numbers in the 100s, maybe 1000s.  All other audiences, such as local face to face communities, number in the the tens.  A functioning constructed language will need words for internet, mailing list, email, computer, spam, flame, moderator, troll, community, bulletin board.

Neologism.  The first flame wars will be about neologisms.  Does the constructed language encourage mass importation of loan words like Esperanto? Or does it encourage building up words from internal sources?  If you build a word, what does it mean? Is the community allowed to coin new words that follow the general phonetic structure?  Or to coin idioms, which may be unintelligibly without an explicit definition? Or must the community stick to transparent compound words and nominalizers, like “speak device” or “it-speaks (n)” for “phone”?

The Canon. The next flame wars will be resolved by appeal to authority.  If the canon is missing or ill defined, then reasonably intelligent people can’t win arguments.  A good language designer will let the community know when and what text is archaic.  Eventually, this leads to language boards and formal approval processes for blessing community generated innovations, Laadan and the KLI come to mind.  In the world of natural languages, with a few exceptions like the “standardization” movements, language boards  are often ignored by the speaking community.   Versions 2+ of the constructed language should note that successful language design pre-empts the language community innovations. Once the language community has adopted an innovation, it is better to adopt it than to stop it out.  Alot of English teachers will attest to the power of the language community to coin new words and resist prescriptions to undo do it, e.g. a lot vs alot.

The Conculture.  In the case of the Na’vi the conculture is richly defined.  The translation of a sentence will make sense in the context of the Na’vi ethos and customs. But when it comes time to wish someone a Merry Christmas, we have to remember that the culture is also the international community of internet users.  A sentence will and should also mean what it would mean in a hypothetical in person convention with attendees from every corner of the world.  For example, if I was at a convention I would might know to call the Japanese attendants “san” or “sir” regardless to what language I’m speaking.  In human to human interaction, you are still in your own culture no matter what language you are speaking and so is your interlocutor.  Other example, the US President would not be referred to as “his majesty” even if that is what the Na’vi might say. (Actually I don’t know what they would call their top leader)  The US President would have to be referred to with the honorific that the Na’vi use for addressing equals, or we’d have to use an English loan word.  A language designer can design the grammar and prescribe a system of honorifics, but the culture of constructed language fans will remain rigidly the same.

Polite words.  When the community transitions from speaking in English to your constructed language, the initial messages may sound rude and overly forthright.  Much of the flame wars could have been prevented in the first place if the language has means for being polite to strangers.   If you’re langue has lots of polite words for children, friends and family, it won’t help. It needs to have polite words to deal with strangers.  In particular, you need to be able to give praise, to signal that you think the other person is just as good as or smarter than you, to signal that you’re uncertain about what your saying.  In English, if we’re using new and unfamiliar jargon, we’ll make face and use intonation to show we are uncertain if we are to call our boss “Dr” or “Professor” or “Mr”.  On the web, it will need to be visible.

Diacritics and fonts. I have never seen an entirely non-roman font catch on in a constructed language community.  The diacritics are lead anchors around the neck of any constructed language that uses them.  Other communities typically use a postposition x to mean, “this letter would have a diacritic, but damn it, I am a language fan not computer wizard and I can’t find that key on my keyboard”

What can we say from a wordlist?

(word list here)  … and Unofficial yahoo mailing list here

EDIT: Ah, man this has become like one of my most famous posts on my blog and it’s sub-par.  It was written before the movie was released but after the word list had been released and I was wondering how much of the language could be “pre-constructed” from just the info in the word list.  Some of the speculation proved to be wrong.  The wizard(s) that maintain(s) the wikipedia article, is pretty impressive though and has received kind words from Dr. Frommer.

This is an interesting question, not just because it is still a few days before the Avatar movie shows and we don’t have a grammar.  This is the exact situation a lot of mostly dead languages are in.  Just before the language dies out, a linguist writes down a word list and doesn’t have the time to learn the language.  So the language dies leaving only a word list.  If there are enough similar languages, the grammar can be reconstructed, the same way Virginian Algonquian was reconstructed for the recent movie the New World.

We can’t do that with Na’vi.  We don’t have any speakers on the planet not covered by NDA (non disclosure agreements) at the moment.

We do have some clitics and function words that have been identified in the glossary, we know abit about the handling of loan words so we can start on extrapolating the transliteration rules, we can see they count in octal, we have polite language!  Can’t see anything about inflection, and the verb is a mystery. Not a lot of adjectives, so they must have grammaticalizations to turn nouns or verbs into adjectives.

Only 2 diacritic.  0 would be better, but 2 is okay.

(note : is just a semicolon, it isn’t part of any Na’vi word)

Word Generation Strategies

spe’e : capture
spe’etu : captive  … anyone else thing that +tu is a nominalizer?

Loan Words

ìnglìsì : English
kunsìp: gunship

Useless Words, at least until we match them up with Earthly equivallents:

Yerik: hexapede

Words that will be immediately broadened

au: drum (made of skin)   * The language community will use this word for all drums. Nit pickers will decry this.

waytelem: songchord
Function & Possible Clitics
a: which, that
fa: with (by means of)
fpi: for the sake of
ftu: from (direction)
ka: across
mì: in
na: like, as
ne: to (direction)

ta: from (various uses)
ta’em: from above
sì: and

ulte: and
fu: or

slä: but
tengfya: as (= same way as)
tengkrr: while (= same time as)

to: than; comparative mark

Proforms, Questions, Demonstratives, Negatives

oe: I
oeng: we two (you and I)
ohe: I (deferential or ceremonial form)
po: he, she
poan: he
poe: she

nga: you

ngenga: you (honorific form)

ayfo: they

ayoe: we (exclusive)
ayoeng: we (inclusive)
mefo:  they (those two)

moe: we two (exclusive) — Duals all over the place, cool.

upe: what (thing)
fì’u: this (thing)
fìfya: this way, like this  – Very rich in demonstratives (this/that/etc).  Is this a Esperanto influence?
fìkem: this (action)
po: this one (person or thing)
tseng(e): here, this placefyape: how

hu: with (accompaniment)
ìlä (also: ìlä): by, via, following

kawkrr: never
kawtu: no one
ke: not
ke’u: nothing
kea: no (before a noun)
kehe:  no
kempe: what (action)
pe: what (before a noun)
peyfa: how
pehem: what (action)
pehrr: when
pelun: why
pesu: who
peu: what (thing)

tsa’u

that (thing)

tsakem: that (action)
tsakrr: then, at that time
tsat: that (as object)  (Ah! accusative language)
tsatseng: there, that place
tsatu: that person
Verbs
lu: be (am, is, are)

Na’vi word list and link to 1st book

UPDATE.   This is the most popular post on my blog.  You probably got here from wikipedia.  The na’vi community is almost entirely on learnnavi.org where you can find better resources.  This particular page isn’t the best online na’vi word list, it was merely the 2nd.  Also, don’t forget the Na’vi wikibooks pages, which is where the wikipedia article wen t after it grew too large.

This came from someone who copied the word list from the Avatar Guide:
Buy book here: Avatar Guide
Forum of first poster Dec 9, 09 : Avatar Forum

’ampi
touch
’aw
one
’awpo
one individual
’awsiteng
together
’awve
first
e’al
worst
eko
attack
ekong
beat (rhythmic)
engeng
level
eveng
child
evi
kid (affectionate form of “child”)
’ì’awn
remain, stay
’ìheyu
spiral
ìnglìsì
English language
it
bit, small amount
’itan
son
ite
daughter
’ok
remembrance
’ong
unfold, blossom
upe
what (thing)
’upxare
message
a
which, that
äie
vision
alak
ready
alìm
far away, at a distance
Amhul
child’s name
Änsìt
child’s name
apxa
large
atan
light
atokirina
atokirina, seeds of the great tree
ätxäle
request
atxkxe
land
au
drum (made of skin)
aungia
sign, omen
ayfo
they
aylaru
to the others (contraction of aylaberu)
aynga
you
ayoe
we (exclusive)
ayoeng
we (inclusive)

E

ean
blue
eltu
brain
eltu si
pay attention, quit goofing off
eltungawng
brainworm
emza’u
pass (a test)
eo
before, in front of
eyk
lead
eyktan
leader
Eyaye
Plant species (Warbonnet)
Eywa
“World spirit” – guiding force and deity (equivalent to Gaia); Eywa PN
Eywa ngahu
good-bye, Eywa (be with you)

F

fa
with (by means of)
faheu
smell
faketuan
“Alien” (i.e. non-Na’vi)
fì’u
this (thing)
fìfya
this way, like this
fìkem
this (action)
po
this one (person or thing)
tseng(e)
here, this place
fkarut
peel
fkeu
mighty
fko
one
fmawn
news, something to report
fmetok
test
fmi
try
fnu
quiet (= be quiet)
fo
they
fpak
hold off, suspend action
fpe’
send
fpeio
challenge (ceremonial)
fpi
for the sake of
fpom
well-being; peace
fpxafaw
medusa (animal)
fpxäkìm
enter
fra’u
everything
frapo
everyone
ftang
stop
fte
so that
fteke
so that not, lest
ftia
study
ftu
from (direction)
ftue
easy
ftxey
choose
fu
or
fya’o
path, way
fyape
how
fyawìntxu
guide

H

ha
so (or “in that case”)
hapxì
part
hasey
done, finished
hawnu
protect, shelter
’i
small
hiyìk
funny (strange)
hrrap
danger
hu
with (accompaniment)
hufwe
wind
hum
leave, depart

I

i’en
stringed instrument
Iknimaya
Thundering rocks, Stairway to Heaven (floating mountains)
ikran
mountain banshee
ìlä (also: ìlä)
by, via, following
ioang
animal, beast
irayo
thank you
iveh k’nivi s’dir
baby carrier/sling

K

ka
across
go
kaltxì
hello
makto
ride out
kame
see (spiritual sense)
kangay
valid
karyu
teacher
kato
rhythm
kawkrr
never
kawng
bad, evil
kawtu
no one
ke
not
ke’u
nothing
kea
no (before a noun)
kehe
no
kelku
home
kelku si
live, dwell
Kelutrel
Hometree (tree for living in)
kempe
what (action)
kenong
model, represent, exemplify
kerusey
dead
ketuwong
alien
kewong
alien
keye’ung
insanity
kì’ong
species of fruit or vegetable
kifkey
world (physical, solid)
kìm
spin
kin
need
ki
seven
kinam
leg
kinamtil
knee
kìng
thread
kip
among
te’e
service
kìyevame
good-bye, see you again soon
kllfriyo’
be responsible
kllkulat
dig up
kllkxem
stand
kllpxìltu
territory
kllte
ground
krr
time
krrnekx
take/consume time
krrpe
when
kunsìp
gunship
kurakx
drive out
kxam
middle, midpoint
kxamtseng
center (or place in the middle)
kxangangang
boom
kxanì
forbidden
kxawm
perhaps, maybe
kxener
species of fruit or vegetable
kxetse
tail

L

lahe
other
lam
seem, appear
lapo
other one (person or thing)
latem
change
law
clear, certain
lehrrap
dangerous
Lenay’ga
Slinger (predatory creature)
lertu
colleague
lì’u
word
lok
close to
lonu
release, let go
lrrtok
smile
lu
be (am, is, are)
lumpe
why

M

makto
ride
mawey
calm
mefo
they (those two)
menari
eyes (two)
meuia
honor
meyp
weak
in
mikyun
ear
mìn
turn
mìso
away (position)
mllte
agree
moe
we two (exclusive)
mokri
voice
muiä
proper, fair, right
mun’i
cut
mune
two
munge
take, bring
muntxa
mated

N

na
like, as
na’rìng
forest
Na’vi
the People (name for themselves)
nang
particle for surprise or exclamation
nantang
viperwolf
nari
eye
nari si
watch out, be careful
nawm
great, noble
ne
to (direction)
ne kllte!
get down! (literally, “to the ground!”)
ne’ìm
back (direction)
nekx
burn, consume
neto
away (direction)
neu
want
nga
you
ngawng
worm
ngay
true
ngenga
you (honorific form)
ngop
create
nì’aw
only
nì’awtu
alone (as one person)
nì’awve
first
nì’it
small amount, a bit
nì’ul
more
niä
grab
Nìayoeg
like us (pronounced nayweng)
ftue
easily
ftxavang
passionately, with all heart
nìhawng
too, excessively
nìltsan
well
nìmun
again
nìn
look at
nìngay
truly
nìtam
enough
nìtut
continually
nìtxan
much
nìwin
fast
nìwotx
all (of), in toto, completely
nulkrr
longer (time)
nume
learn

O

oe
I
oeng
we two (you and I)
ohe
I (deferential or ceremonial form)
olo’
clan
olo’eyktan
clan leader
Omatikaya
Blue Flute Clan
omum
know
ontu
nose

P

hem
arrive
pak
particle for disparagement
palulukan
Dry Mouth Bringer of Fear (Thanator)
pa’li
Direhorse
pam
sound
pamtseo
music
pänutìng
promise (a thing to someone)
pätsì
badge
pawm
ask
pe
what (before a noun)
peyfa
how
pehem
what (action)
pehrr
when
pelun
why
peng
tell
peseng
where
pesu
who
peu
what (thing)
pey
wait
pizayu
ancestor
plltxe
speak
po
he, she
poan
he
poe
she
pongu
group of people, party
pxan
worthy
pxasul
fresh, appealing as food
pxay
many
pxel
like, as
pxi
sharp
pxun
arm
pxuntil
elbow

R

rä’ä
do not
ral
meaning
ralpeng
interpret
Ralu
child’s name
rawke
alarm cry
renu
pattern
rey
live
rikx
move, shift position
rim
yellow
rina’
seed
riti
stingbat
rol
sing
ronsem
mind
rutxe
please

S

sa’nok
mother
san
saying; quote
sa’nu
mommy
nume
teaching, instruction
sat
that (after ftu only)
sempu
daddy
sempul
father
set
now
sevin
pretty
seyri
lip
seze
blue flower
si
do, make
and
sìk
unquote
sìlronsem
clever (thing)
sìltsan
good
ska’a
destroy
skxawng
moron
slä
but
slu
become
sngä’i
begin, start
sngä’ikrr
beginning, start time
snumìna
dim (of a person)
som
hot
spe’e
capture
spe’etu
captive
srak(e)
marker for yes-no questions
srane
yes
sreu
dance
srung
help, assistance
steftxaw
examine
stum
almost
sutx
track, lock up
swaw
moment
swirä
creature
swizaw
arrow
swok
sacred
swotu
sacred place
syaw
call

T

ta
from (various uses)
ta’em
from above
täftxu
weave
ftxuyu
weaver
takuk
strike
talioang
sturmbeest
tam
suffice, “do”
tangek
trunk (of a tree)
tan
star
taron
hunt
taronyu
hunter
taw
sky
tawng
dive
tawng
duck
Tawtute
Sky Person
te
particle used in full names
telem
cord
tengfya
as (= same way as)
tengkrr
while (= same time as)
terkup
die
teswotìng
grant
teya
full
teylu
beetle larva(e)
tìfmetok
test
tìftang
stopping
hawnu
protection
tìkawng
evil
kenong
example
tìkin
need
til
joint, hinge
tìng
give
tìng mikyun
listen (usually pronounced: tìm mikyun)
tìng nari
look (usually pronounced: tìn nari)
tìngay
truth
tìran
walk
tirea
spirit
tireafya’o
spirit path
tireaioang
spirit animal
tìrey
life
tìrol
song
tìtxur
strength
to
than; comparative mark
tok
be at, occupy a space
toktor
doctor
tokx
body
tompa
rain
toruk
last shadow (great leonopteryx)
trr
day
tsa’u
that (thing
Tsahaylu
bond (neural connection)
Tsahìk
matriarch
tsakem
that (action)
tsakrr
then, at that time
tsam
war
tsampongu
war party
tsamsiyu
warrior
tsap’alute
apology
tsat
that (as object)
tsatseng
there, that place
tsatu
that person
tsawke
sun
tsawl
big (in stature)
tse’a
see (physical sense)
tseng(e)
place
tsengpe
where
tseo
art
tsìng
four
tsìvol
thirty-two (octal: 40)
tsko
bow (weapon)
tsko swizaw
bow and arrow
tskxe
rock, stone
tskxekeng
training, exercise
tslam
understand
tsleng
false
tslolam
Got it. I understand.
tsmuk, tsmuktu
sibling
tsmukan
brother
tsmuke
sister
tsnì
that
tspang
kill
tsranten
matter, be of import
tsteu
brave
tsun
can, be able
tswayon
fly
tsyal
wing
tukru
spear
tul
run
tung
allow
tupe
who
tute
person
tutean
male (person)
tutee
female (person)
txan
great (in quantity); much
txantslusam
wise, much-knowing
txe’lan
heart
txele
matter (subject)
txen
awake
txep
fire
txey
halt
txìm
butt, rear end
txìng
leave, abandon
txo
if
txoa
forgiveness
txokefyaw
if not, or else
txon
night
txopu
fear
txum
poison
txur
strong

U

ulte
and
ultxa
meeting
unil
dream
Uniltaron
Dream Hunt
uniltìrantokx
avatar; dreamwalker body
uniltìranyu
dreamwalker
utral
tree
Utral Aymokriyä
Tree of Voices

V

virä
spread, proliferate
vofu
sixteen
vrrtep
demon
vul
branch (of a tree)

W

way
song
waytelem
songchord
wintxu
show
wrrpa
outside
wutso
dinner, served meal

Y

ye’rìn
soon
Yerik
hexapede
yey
straight
yìm
bind
yol
long (of time)
yom
eat
yomtìng
feed
yur
wash

Z

za’ärìp
pull
za’u
come
zamunge
bring
zekwä
finger
zene
must
sìt
year
zong
save
zongtseng
safe place, refuge
zoplo
offence, insult

Language Pre-contruction

NEW.  I just created the unofficial Na’vi language mailing list on yahoo.   Please consider joining if you have any interest in deciphering the Na’vi language.

I’m reading the Wheel, the Horse and Language, which is about PIE (and who doesn’t like PIE?)   and I realized that if we can re-construct PIE, we should be able to pre-construct Na’vi, the as of yet unreleased language.  We have about 10 words of Na’vi, or fewer words than were used to reconstruct Virginian Algonquian.  We don’t have any related language, although all attested human languages  are potential sources–any new language is likely to rehash something that’s already been done before.   Using the existing 10 words, we probably can construct an alphabet, we know that words are going to be 2 -15 symbols long, will have a similar phonetic distribution to the samples, will follow the same internal phonetic rules, etc.

What’s the alphabet? Identification of the full alphabet might be the first challenge.  If the 10 sample words didn’t cover the whole alphabet, we’d could at best generate the odds that the language has more sounds by comparing it against other common human languages.  Human languages have between 11 and 112 phonemes–with a mid-range of 61!  I’d be worried about calculating an average because an an average of the # of phonemes in human languages today will be heavily biased by survivors and correlations between families (i.e. all romance languages will have similar number of phonemes).  Languages end up with phoneme distributions that maximize the distance between vowels (i.e., a language wouldn’t have only i, e, y because they all sound similar)  So if the sample attested language had clusters (too many frictives, too few stops) and few phonemes, we could infer that we were missing letters.

Finding the Uniform Distributions and the contingent distributions. I’ve been working on what a word generator would look like.  I figure it would follow some sort of chain process, like, the first letter will follow distribution, where each distribution would have the odds of a letter of the alphabet being the next letter, all odds summing to 1 of course.

1st Letter: a .05, b .03, c .02 … etc for each letter of the alphabet.

2nd Letter when 1st is a:  a: 0 , b .1,  c .2 … etc

3rd Letter when 1st is ab: a: .02, b:0, c:0 …. etc

For an alphabet of 20 letters and words 5 letters long, this distribution has 20^5 parameters (3.2 million), although most of these are going to be zeros.  Hmm.  Maybe this needs to be constrained in advance.

So if we have say three categories of sounds, 2 kinds of consonants and vowels, we would have a distribution like:

1st letter : voiced consonant : 30%, unvoiced consonant : 65%, vowel 5%

2nd letter if 1st letter is voiced consonant: voiced consonant : 50%, unvoiced consonant : 5%, vowel 45%

The above distributions would have 3^5 =243 parameters.

If distributions have a short memory–i.e. the odds of a letter appearing depend only on the last 2 letters and the distributions are re-used on each syllable break, then we have only a few parameters. For example if the maximum syllable is 4 letters, after which any new letters have the same distribution as if a new word was starting, we’d only have 81 parameters to estimate.

I’ll leave as an exercise to the reader the search for additional constraints to get the number of parameters down to less than the number of data points.

Speaking of data points, how many data points do we have with a word list of 10 words? We have about 50 letters–not enough to even get the probability distribution for the alphabet.  If we had 20 unique letters, then we’d have 20 probabilities to estimate, or about 2 data points per parameter. The confidence intervals would be huge.  We probably could estimate the distribution of vowels vs consonants or fricatives vs sibilants better than the odds of each particular letter– so if we had 30% frictives, we’d divide 30% by the number of fricatives to get the odds of ‘f’.

Another possible pattern. The above phonetic model doesn’t take into consideration “inbetween-ness”  So if I have a word generator that generates just the consonants, then there would be odds for vowels falling in between certain combinations. This likewise has lots of parameters.  If there are 20 consonants, 5 vowels, then there will be 20 pairs.  Thats 2000 probabilities, admittedly many of them are 0.

Illegal Combinations. If the first letter is a, then the 2nd letter will follow distribution Y, and so on.  If certain probabilities fall below a threshold, round down to zero because it means it is an illegal phonetic combination–some sort of truncated distribution.  I’d say on account of data being short, we’d first assume that unobserved combinations were illegal.  The type and the parameters of the distributions could be estimated from available data using jackknifing and bootstrapping.  Or simulation.  I guess the series of probability distributions that was most likely to generate the sample data would be the probability distribution for generating new words.

Stability. The resulting set of words would then have to go through a stability test.  Certain sound changes in languages are universal, such as the loss of large consonant clusters and maybe things like Grimm’s Law (g–>k–>h–>0)  Words that repeatedly drop out from the vocabulary after simulated test of time would have to be discarded as too unstable.

Root words.  All of the attested words probably have root words in them or are compound words, but there won’t be an easy way to say what the root is. Still, out of 1000s of randomly generated words, the group that has chunks of the attested words will have a higher hit rate than the other words.  There of course has to be an upper limit to this– at most a few dozen words can be generated this way.

Distance between words. The other constraint would be distance between words.  Language end up with lots of minimal pairs, so minimal pairs are okay.  But beyond a certain threshold, I think the language would either prohibit word pairs or have to introduce stress and tone.  For example, if the generating distributions created an extremely narrow set of words, then we can infer the language is tonal.  Examples would be Chinese which has a small number of syllables, which are given tones to generate more words.

Vowel and consonant harmony. The tongue is lazy.  On my favorite podcast, PC Mag Radio, the announcers keep saying AppScout as if it were Abscout. I’m sure it has something to do with the distance the tongue has to move to get from ‘A’ to ‘P’ and back to ‘S’.  Interestingly, AppScout is a portmanteau created by the artificial constraint of someone needing an available domain name, so unsurprisingly it is an illegal English word that is so instable it has already changed a consonant in spoken speech in only a few years.