Hi shokan,
This is certainly one way to support different languages - but in reality, the limitation is much more core than just training a model.
To support a new language, you need:
- a version of almond-tokenizer that supports that language
- a Genie construct template pack for that language
- translations for all Almond libraries (thingtalk, almond-dialog-agent, thingengine-core)
- translations of Thingpedia metadata (slot-filling questions, canonical forms, confirmation strings)
- translations of Thingpedia primitive templates (dataset.tt files)
- a minimal set of string value sets that include at least tt:location, tt:word, tt:short_free_text and tt:long_free_text
Optionally, you might also want
- special purpose postprocessing / augmentation in Genie
- translations for the Almond frontends
- skill-specific string value sets
Once all the required pieces are in place (with at least one skill) we’ll be happy to train a model and deploy it on the public server.
The current status is:
For Chinese (Simplified + Traditional)
- tokenizer is done
- Genie construct templates are available in a branch
- we have a translation of a number of dataset.tt files but we have not uploaded it
- we have not translated the rest of the Thingpedia metadata
For Italian:
- tokenizer is done
- we are working on Genie construct templates (as low priority)
- we don’t have any translation of Thingpedia yet (only one skill for testing)
For other language supported by Stanford CoreNLP (Arabic, French, German, Spanish, Russian, Swedish, Danish)
- tokenizer could use CoreNLP, but number and time normalization would need to be provided by a separate package (e.g. HeidelTime)
For all other languages, works needs to be done from scratch.