Grengle #1: generating code-switching audio with Facebook MMS

I previously mentioned my attempts to learn Greek, and I haven't yet talked about my experiments with Grozer (and I won't right now, but they involve flash cards, clozes, and code-switching text). The big problem I've had with flash cards and other techniques is that they require my full attention, and take time out of the day. The result is that I eventually just drop them when something more important (or interesting) comes along.

So I've been wondering for a while whether there was a way to generate audio and make it comprehensible. I have a long commute to work every day, and I tend to use that for listening to the Bible in English. Is there a way to instead listen to the Bible in English+Greek; a code-switching audiobook that over time becomes more and more Greek?

Grengle is my set of experiments to attempt that.

The idea is to have a list of words I "know", and a handful of words that I'm "learning". Then as I listen to a passage, the words I "know" or am "learning" will be played in Greek, and the rest in English. Over time, it will automatically move the "learning" words into the "known" category, and pick new words to learn.

The first thing I tried was to get espeak to read Greek and English for me. I've been using espeak to read chapter headings for the Bible for a while now, and though it sounds like a robot, I can understand it. Greek, on the other hand, was terrible. Robotic Greek was beyond me, and so this experiment reached a dead end.

Recently, Facebook AI research put out a set of models that could (amongst other things) do text-to-speech for over a thousand languages. And Greek was one on the list. I finally got around to giving it a go, and once I got it installed, it generated some audio that at first seemed promising.

Then I plugged it into the Grengle interleaver engine, which slices a text into short phrases like the following:

GRK Καί λέγει αυτοῖς. ουκ οίδατε τήν
ENG parable
GRK ταύτην, καί πῶς πάσας τάς
ENG parables will you understand?
ENG sowing

The engine then used the model to render a bunch of audio files meant to be glued together.

The result was disappointing. There were a few dealbreaking issues:

  • Most importantly, the Greek speech was way too fast for me.
  • The Greek audio seemed to miss words occasionally, eg. a lone "ό" like the one above (which, yes, wouldn't normally occur in speech).
  • The English audio was rather muffled.

Thus, this second experiment failed.

I have a few ideas of how else I could get an interleaved audio system working:

  • Generate the text and record it myself. Which, since the words that should be played in Greek vs English is constantly changing, defeats the purpose of getting a model to read it for me.
  • Attempt to train a model myself (possibly a model that can handle both languages at once). I've never done this before.
  • Give up on the code-switching idea. Chop up some existing recordings into verse snippets, and play each one twice as English then Greek (or vice versa). However I suspect this will be too large a gap to get good Comprehensible Input at the word level.