0064 - making audio versions of my posts

I always feel weird when I'm about to do a "technical post", because it's so different from the sort of thing I usually write about. Which is strange, considering I live immersed in this tech world and AI stuff.

...

For a long while now I've wanted to add audio versions of my posts. At first I entertained the idea of reading them myself, but eventually dismissed it because I just didn't have the time. The next obvious idea was to leverage some of the new (seemingly impressive) advances in AI audio generation (specifically for speech).

Initially this was a no-go because the only decent speech generation models were behind paywalls, and I didn't really want it enough to pay for it. But around a year or so ago we started seeing the release of many open-source speech generation models that rival proprietary ones.

Since then, I've been working on and off on a side project called bumblebee-tts. It's just a simple wrapper around various TTS (Text To Speech) models that can take a written corpus of text plus a reference voice sample and produce an audio clip of some new text that sounds more or less like the provided sample.

My wrapper takes care of preprocessing and chunking the input to improve the output of the audio generation, and it's been basically unchanged for half a year or so now. The main reason I hadn't used it yet was that model quality, while decent, wasn't yet really all that good. However, that changed recently when Alibaba released Qwen3-TTS, which is indeed really good! Some days ago I incorporated it into bumblebee and wired everything up so now my posts have audio transcripts of them! It didn't feel ethically correct to use a random person's voice, so I used a voice clip of myself. People close to me say it doesn't sound like me at all, but I don't know, to me it sounds quite similar!

Even though the generated audio is good, there are still some weird things, like the pitch of the voice sometimes changing to a deeper register and then fluctuating back. I've seen this happen especially in longer passages. But what really bothers me most is that it doesn't always get the correct inflection. Sometimes it reads a passage that I imagined was happy in a somber tone, while other times it reads a serious sentence in a chirpy tone. Overall I'd say it does a good job, maybe getting it correct around 80% of the time. This space is evolving so quickly that I'm sure these gaps will soon be negligible.

This current post likely doesn't have an audio version yet (depending on when you're reading it), so if you want to check out how all of this looks, you can go to an older post like this one that does have an audio player at the top. I think the player code ended up being quite neat, so if you're interested, feel free to check it out here. (The implementation on the website was really easy thanks to the awesome wavesurfer.js library.)

While adding the audio, I stumbled (by accident; don't exactly remember where) on a description of how podcast RSS feeds work. It turns out they're basically the same as the normal RSS feed I was already using for my posts but have an extra enclosure object that defines an "audio" asset related to the entry. The cool thing about this is that by adding this enclosure object, my existing feed can now be dropped into any podcast player, and it will be registered as a podcast! I guess I'm a podcaster now 😅? Though I suspect very few people will want to listen to my posts as such. Still, it's nice to at least have the option.

...

And yes, now all of this begs the question: "why"? I'm not really that sure, to be honest. Initially, I just thought these TTS models were cool and wanted to play with them since I could run them on my local GPU. But then I guess the thing just evolved. I found myself wondering, "I have this whole pipeline in place, it's easy to run locally, it's fast, then why not use it"? It's the classic example of someone realizing they can do something before stopping to think whether they should, or even asking why for that matter.

But well. For now I'll leave them be. Design-wise, I really like how the player turned out, but at the same time I feel it occupies quite a bit of space at the top. I also feel sort of bad about using the electricity necessary to create these audio files (though it probably isn't all that much). I know some visually impaired users might benefit from them (which would be awesome), but I also feel it's a sort of vanity on my side?

Anyway... ethical quandaries can be dealt with later. For now, I wanted to give a brief primer on how you can generate this audio yourself.

For the moment, my tool does require you to be comfortable at a terminal (though if folks are interested, I might try to package it up as a normal desktop app).

To get started, clone the repo and follow the setup instructions in the README.

Then you'll need to "source" an audio segment of ten or so seconds of you reading some text. I just used my phone's voice recorder, though I did have to do it a couple of times. The models are very sensitive to sample quality, so try to do it in a quiet space (this is a must) and use a good microphone if you have one. I saved my sample in .wav. The other thing you'll need is an actual Markdown file you want to create an audio of.

Once the prerequisites are out of the way, you can generate an audio file for the Markdown file of your choice by using the following command, which takes around 10 minutes to run on my PC for a ~1k-word file. It will generate an output .m4a file in the output/ folder.

Be sure to replace all the <> comments in the command below!

bumblebee-tts generate \
	"<here put the path to your input markdown file>" \
	"<here put the path to your voice clip sample>" \
	--ref-text "<this is the text you read when you did your sample audio clip>" \
	--chunk-size "300" \
	--no-cleanup \
	--crossfade "0" \
	--blog-post \
	--tts-engine "qwen3-tts" \
	--cfg-weight 1.8 \
	--m4a

That's it! If you try it out, I would love to hear how it goes :)

Also, disclaimer: I've only tried this with English text. I don't know how well the conversion works for other languages.