Genie Community Forum

How to enroll another language thingpedia device

Hello. Almond Developers.

I am a graduate student in Korea.
I got to know almond community through the introduction of Mehrad.
My colleague and I are trying to train GenieNLP with KR dataset, and test almond with the model.
We are trying to make kr dataset from scratch, and we have some questions about it.
The questions below are not that easy and simple to reply, but please help us to join this project.

  1. How can we make new thingpedia device using Korea(non-English)’s website api?

We are trying to make manifest.tt, dataset.tt for a new thingpedia device. I saw how to make an English version of the device in Almond Dogs. Can you tell us how we can make tt files from scratch to make a device for Korean?

  1. How can we enroll new thingpedia device to local server?

We found that enrolling new device is possible in your dev almond server. But we couldn’t find the same skill in local versions of almond-server.
(If we make a new model, we want to test the model in almond)

  1. How can we make the files needed for making KR dataset?

Now we are referring below page to make kr training dataset.
https://github.com/stanford-oval/genie-toolkit/blob/master/doc/tutorial-basic.md
As written in the page, I found out that to create a new device, I needed files like thingpedia.tt, entities.json, etc.
Can you help us to make those files and how can we go through the process?

Thanks!

  1. How can we make new thingpedia device using Korea(non-English)’s website api?

The easiest way is to take the English version of a manifest file, and translate all translatable annotations (which are noted by #_[ instead of #[).
The Thingpedia and ThingTalk documentation should also have additional pointers on the syntax for classes (manifest.tt files) and datasets (dataset.tt).

  1. How can we enroll new thingpedia device to local server?

You should follow the instructions in the Thingpedia testing guide. Basically, you make a folder containing a subfolder with your Thingpedia device, with the subfolder named as the Thingpedia device ID, and then you point almond-server to your folder.

Indeed, the easiest is to start from the thingpedia-common-devices repository, which is already set up that way, and also has the Genie Makefiles to generate the dataset and train the model.

  1. How can we make the files needed for making KR dataset?

Thingpedia.tt is the concatenation of all manifest.tt of all the skills that you want to make a model for. Similarly, to make the dataset.tt you concatenate all the dataset.tt of the individual skills.
You download entities.json from https://thingpedia.stanford.edu/thingpedia/api/v3/entities/all

Finally, you need parameter-dataset.tsv. That one you will need to write yourselves because you need the Korean version. The format is a TSV file mapping a #[string_values] identifier to a file path with the actual parameters. You can start from the English version which you download using the thingpedia-cli. If you use the Genie Makefiles in thingpedia-common-devices or genie-toolkit the English version is prepared automatically.

The big piece though will not be the skill specific files, it will be the domain-independent templates. Those in Genie are at https://github.com/stanford-oval/genie-toolkit/tree/next/languages/thingtalk
You basically need to take all the template files under “en”, and translate them to Korean.
We have some work in progress to make the translation a bit less painful (extracting all translatable strings using the gettext workflow). If you want to help us finish that work and make Genie more translatable, that would be wonderful!

For best results, you will also need a language-specific module in Genie. This is do things like split into words, recognize and parse times, dates, numbers in words, convert words to plural and past tense (for languages where that is a thing), and a few more things. If you don’t have the module, you get the default implementation, which splits every ideographic character and recognizes only digits for times/dates/numbers, and doesn’t do any inflection.