Vosk on-device Speech-to-text

Since I’ve started using GrapheneOS, a deGoogled Android build, I’ve missed several services you typically get from Apple or Google on my device, one of those core services is Speech-to-Text. It helps a lot to speed up note taking, writing text messages, etc.

I’ve been using a very crude Vosk keyboard on Android to fill the gap. I’d love to try to improve upon this project, but for today I’m interested in getting this functionality in Gnome, my Desktop of choice on Ubuntu Linux. This is not meant to be a tutorial, but more of a journal entry.

Documentation for gnome extension are scant. Here is what I could find:

Here is a great playlist on YouTube to get more familiar with creating gnome extensions.

Damn, don’t you hate when you don’t save your work? I just lost a bunch of work. DOH!

Let’s see, I was astonished to see that, in general, the Gnome extensions area is not super active.

Development is a little rough, I have to switch from Wayland to X11, which makes reloading extensions a little easier. In wayland, you have to log out and back in for extensions to refresh. Yikes.

Here’s a directory of existing extensions: https://extensions.gnome.org/

I like to learn from other code. So I installed this extension, which allows you to manage your system clipboard: https://github.com/Tudmotu/gnome-shell-extension-clipboard-indicator

I haven’t found anything preinstalled to manage extensions. Seems like something that would be readily available in “Settings”. 😮‍💨

Anyways, I started this at 9am, I hope to have something working by noon, but time is dwindling. I just spent some time on creating an icon in figma. No matter what I do, it’s still hard to see the “TXT” in the icon. I may just ditch it and just use the mic, but I’ll leave it in for now. Anyway, I hope Gnome supports SVG, which might render a little nicer. Let’s move on. We have some functionality to create.

Golly, documentation is THIN for gnome extensions.

I’m simply trying to get a button in the tray, when clicked it will change color. Also, reloading extensions is still a CHORE. I have to log out, then log back into gnome each time. Tedious.

I found a solution to that here: https://www.reddit.com/r/gnome/comments/eb4pn9/how_do_i_reload_a_gnome_shell_extension_during/

I’m using a reload.sh script to load up another session of gnome, which naturally reloads all the extensions.

dbus-run-session -- gnome-shell --nested --wayland

My SVG isn’t looking great in there though. I may have to use a ready-made system icon.

As you can see, the icon is squished, and also doesn’t change color when clicked.

I’ve got the icon working now, but there still is styling issue, where the icon seems a little small.

I’ve messed with getting Vosk working appropriately. I’ve tried a few of the suggested methods, but I’m having a lot of issues making my microphone accessible in nodejs with the ‘mic’ library.

I’m currently leaning towards running vosk as a docker service with the following docker-compose.yml

version: '3'

services:
  vosk:
    image: alphacep/kaldi-en
    ports:
      - "2700:2700"

So far, only one test script that I’ve tried actually worked.

#!/usr/bin/env python3

import asyncio
import websockets
import sys
import wave

async def run_test(uri):
    async with websockets.connect(uri) as websocket:

        wf = wave.open(sys.argv[1], "rb")
        await websocket.send('{ "config" : { "sample_rate" : %d } }' % (wf.getframerate()))
        buffer_size = int(wf.getframerate() * 0.2) # 0.2 seconds of audio
        while True:
            data = wf.readframes(buffer_size)

            if len(data) == 0:
                break

            await websocket.send(data)
            print (await websocket.recv())

        await websocket.send('{"eof" : 1}')
        print (await websocket.recv())

asyncio.run(run_test('ws://localhost:2700'))

The problem here is that it’s sending a .wav file, not opening the microphone and transcribing the output.

That’s enough for today. I’ll pick this project back up at some point.