Almond Tokenizer is at https://github.com/stanford-oval/almond-tokenizer
It’s purpose is to preprocess inputs from the user, to identify numbers, dates, times, etc. It uses Stanford CoreNLP for this purpose.
The easiest way to run it is through docker, as in:
docker run -p 8888:8888 stanfordoval/almond-tokenizer
The NLP server is part of Almond Cloud. You run it as:
almond-cloud run-nlp --port ...
or if running inside a git checkout:
node ./main.js run-nlp --port ...
Like other parts of Almond Cloud, we offer systemd .service files and example k8s manifests that might be helpful to get started, but site-specific customization is often needed.
In particular, for the NLP server, you need to specify the URL (NL_SERVER_URL
in the config), and the path to the directory where the models are stored (NL_MODEL_DIR
). The latter can be a local path or an Amazon S3 URL.
If you plan to train your own model, the model directory needs to be accessible from the machine running the training server. This can be accomplished using S3, using a network file system like NFS or SMB, or by setting up password-less SSH/rsync and using a file://
URL with the hostname of the NLP machine as the model directory.
The training server also needs access to the FILE_STORAGE_DIR
path (or S3 URL), which is shared with the frontend server. This one cannot use rsync, only S3 or NFS/SMB.
(If you’re on AWS, S3 is by far the easiest option. NFS (Amazon EFS) is also easy but quite a bit more expensive. On Azure, SMB (Azure Files) is the best option. Everywhere else, NFS is easiest if you have used that before.)
If you don’t train your own model, you might survive by downloading a pre-trained model from https://almond.stanford.edu/luinet/models
(log in to see the “Download” button).
This is ok only if your Thingpedia does not diverge from the public one too much. Otherwise your accuracy will be very poor as the NLP server will discard most of the neural network predictions.