Other languages

shokhan · November 25, 2019, 6:18am

Hi community!
Wanted to clarify on how to use skills in non-english languages.

As far as I get, to use other languages:

one have do deploy “Web+NLP” version of Almond,
train custom NLP model which translates text in custom language to the thingtalk code,
submit skills with examples in that custom language to the Public ThingPedia,
link to NLP server

Questions:
-Did I understand it correctly how to deploy on custom language(above steps)?
-Will the skill in custom language be accepted to Public ThingPedia(which is completely in English now I guess)
-Or any plans to diversify the PublicThingPedia according to language?

gcampax · November 27, 2019, 7:05pm

Hi shokan,

This is certainly one way to support different languages - but in reality, the limitation is much more core than just training a model.

To support a new language, you need:

a version of almond-tokenizer that supports that language
a Genie construct template pack for that language
translations for all Almond libraries (thingtalk, almond-dialog-agent, thingengine-core)
translations of Thingpedia metadata (slot-filling questions, canonical forms, confirmation strings)
translations of Thingpedia primitive templates (dataset.tt files)
a minimal set of string value sets that include at least tt:location, tt:word, tt:short_free_text and tt:long_free_text

Optionally, you might also want

special purpose postprocessing / augmentation in Genie
translations for the Almond frontends
skill-specific string value sets

Once all the required pieces are in place (with at least one skill) we’ll be happy to train a model and deploy it on the public server.

The current status is:

For Chinese (Simplified + Traditional)

tokenizer is done
Genie construct templates are available in a branch
we have a translation of a number of dataset.tt files but we have not uploaded it
we have not translated the rest of the Thingpedia metadata

For Italian:

tokenizer is done
we are working on Genie construct templates (as low priority)
we don’t have any translation of Thingpedia yet (only one skill for testing)

For other language supported by Stanford CoreNLP (Arabic, French, German, Spanish, Russian, Swedish, Danish)

tokenizer could use CoreNLP, but number and time normalization would need to be provided by a separate package (e.g. HeidelTime)

For all other languages, works needs to be done from scratch.

vincen · December 25, 2021, 9:56am

Any update on this ? as I really appreciate how Genie works natively with Home Assistant but unhappy only english and we are french native…

gcampax · December 28, 2021, 8:31pm

Hi @vincen,

We don’t yet support other languages, but I can tell that our infrastructure has grown a lot since this was asked in 2019, and we’re very close. We will need community involvement from developers who are native speakers though!

In particular, one needs to:

Add support to Genie for the language. This includes things like identifying and normalizing numbers, dates, times, singular/plural, grammatical correction, etc. Currently supported: English, Italian, Mandarin Chinese. See example at genie-toolkit/italian.ts at master · stanford-oval/genie-toolkit · GitHub and genie-toolkit/italian.ts at master · stanford-oval/genie-toolkit · GitHub
Fully translate the core Genie templates and the basic skills. We use standard Gettext tools, you can take a look at the PO files at genie-toolkit/po at master · stanford-oval/genie-toolkit · GitHub
Translate all the skills at GitHub - stanford-oval/thingpedia-common-devices: Thingpedia interface code for commonly used devices
Train a natural language understanding model for the target language. This is mainly compute time, but will require some amount of dialogue data for validation.

If anyone would like to help, I’ll be happy to guide them in the right direction!