Genie Community Forum

Other languages

Hi community!
Wanted to clarify on how to use skills in non-english languages.

As far as I get, to use other languages:

  1. one have do deploy “Web+NLP” version of Almond,
  2. train custom NLP model which translates text in custom language to the thingtalk code,
  3. submit skills with examples in that custom language to the Public ThingPedia,
  4. link to NLP server

Questions:
-Did I understand it correctly how to deploy on custom language(above steps)?
-Will the skill in custom language be accepted to Public ThingPedia(which is completely in English now I guess)
-Or any plans to diversify the PublicThingPedia according to language?

Hi shokan,

This is certainly one way to support different languages - but in reality, the limitation is much more core than just training a model.

To support a new language, you need:

  • a version of almond-tokenizer that supports that language
  • a Genie construct template pack for that language
  • translations for all Almond libraries (thingtalk, almond-dialog-agent, thingengine-core)
  • translations of Thingpedia metadata (slot-filling questions, canonical forms, confirmation strings)
  • translations of Thingpedia primitive templates (dataset.tt files)
  • a minimal set of string value sets that include at least tt:location, tt:word, tt:short_free_text and tt:long_free_text

Optionally, you might also want

  • special purpose postprocessing / augmentation in Genie
  • translations for the Almond frontends
  • skill-specific string value sets

Once all the required pieces are in place (with at least one skill) we’ll be happy to train a model and deploy it on the public server.

The current status is:

For Chinese (Simplified + Traditional)

  • tokenizer is done
  • Genie construct templates are available in a branch
  • we have a translation of a number of dataset.tt files but we have not uploaded it
  • we have not translated the rest of the Thingpedia metadata

For Italian:

  • tokenizer is done
  • we are working on Genie construct templates (as low priority)
  • we don’t have any translation of Thingpedia yet (only one skill for testing)

For other language supported by Stanford CoreNLP (Arabic, French, German, Spanish, Russian, Swedish, Danish)

  • tokenizer could use CoreNLP, but number and time normalization would need to be provided by a separate package (e.g. HeidelTime)

For all other languages, works needs to be done from scratch.

Any update on this ? as I really appreciate how Genie works natively with Home Assistant but unhappy only english and we are french native…

Hi @vincen,

We don’t yet support other languages, but I can tell that our infrastructure has grown a lot since this was asked in 2019, and we’re very close. We will need community involvement from developers who are native speakers though!

In particular, one needs to:

  1. Add support to Genie for the language. This includes things like identifying and normalizing numbers, dates, times, singular/plural, grammatical correction, etc. Currently supported: English, Italian, Mandarin Chinese. See example at genie-toolkit/italian.ts at master · stanford-oval/genie-toolkit · GitHub and genie-toolkit/italian.ts at master · stanford-oval/genie-toolkit · GitHub
  2. Fully translate the core Genie templates and the basic skills. We use standard Gettext tools, you can take a look at the PO files at genie-toolkit/po at master · stanford-oval/genie-toolkit · GitHub
  3. Translate all the skills at GitHub - stanford-oval/thingpedia-common-devices: Thingpedia interface code for commonly used devices
  4. Train a natural language understanding model for the target language. This is mainly compute time, but will require some amount of dialogue data for validation.

If anyone would like to help, I’ll be happy to guide them in the right direction!