Genie Community Forum

Pulseaudio event sequence

Can someone please post a network diagram for pulseaudio events such that an extension might have access to the recorded audio and the recognized text?

Hi @jsalsman,

I’m not sure what you’re asking. PulseAudio is a low-level audio library used on Linux - it knows nothing of recognition or text.

We use PulseAudio as the audio backend on the Almond Home Server and Almond Desktop (GNOME) platforms. You can find the code to activate based on speech at https://github.com/stanford-oval/almond-server/blob/master/service/speech_handler.js and the actual recognizer (using the MS Speech API) at https://github.com/stanford-oval/almond-server/blob/master/service/speech_recognizer.js.
After recognition, text goes to the AssistantDispatcher. AssistantDispatcher is platform-specific (linked is the home server version), and in platform-specific way identifies the current conversation to pass the command to. A conversation is an instance of the almond-dialog-agent library, which calls to the NLP server and does the dialog management, execution and replies.

There are no hooks to get raw sound access at the moment - nor are there hooks to get text (although you are of course welcome to fork and/or extend the core Almond code).
Maybe if you clarify what is your use case, I can see if it is already supported, or what’s the best way to support it?

1 Like

Thank you, Giovanni, I really appreciate that help. The use case is to populate an API such as https://www.speechace.co/api_sample/ but based on authentic listener intelligibility models instead of speech recognition probability scores as shown in https://www.docdroid.net/gvbP0Jc/paslides.pdf with the recognized text and recorded audio, and provide an audio (not HTML) remedial feedback, [splitting post due to two link limit…]

… and perhaps the kind of visual feedback shown at https://blog.google/products/search/how-do-you-pronounce-quokka-practice-search/

I will take a look at your links and get back with you later in the week probably.

Best regards, -Jim

This is seems very interesting! And although at the UX level it’s a natural fit for a virtual assistant, I am not sure our platform is in the best position to support this at the moment.
Perhaps it would be best to develop this as a temporary fork of Almond, to reuse the conversation/cross-platform UI without the NLU/dialog parts.

Note though that we’re in the process of refactoring our speech handling code to switch to MS Speech SDK. @euirim has been working on this, and might be able to help.

1 Like

@gcampax @euirim I will try to get a working Mozilla Common Voice enhancement https://wiki.mozilla.org/WeeklyUpdates/2019-12-09#Speakers to enable intelligibility remediation at something like https://repl.it/@jsalsman/recorder soon, for single recordings.

Sure, well let us know if we can help in any way, and good luck with your project! :slight_smile:

I could really use some help with https://discourse.mozilla.org/t/can-db-vote-be-a-boolean-union-with-a-utf-8-string/45941 the enhancement proposal enabling open source transcription collection. @gcampax please use your judgement on how best to support it, and let Euirim and I know how you think we can best support it, too.