Can you hear me now?

For a while now, my posts have been made available as audio files in a podcast format through odiogo. This service takes a look at my RSS feeds and converts them to audio files through a Text To Speech (TTS) API. There are a few problems with it though. First, the MP3 files are not appended to my individual posts on wordpress. The other problem is that they like to insert there little intro before they start reading my post aloud. There is also a timing issue as well. I never know when the audio will be replicated. The page that people see to subscribe to the podcast is not all that pleasant to look at either.

Since I have been looking at automation lately, I started to take a look at what it would take to create my own audio files. The first thing that I stumbled upon was Microsoft’s Speech API, otherwise known as SAPI. The library is available on my computer as a .Net library called System.Speech. It didn’t take me long to identify how to use the Speech Synthesizer and start creating WAV files. I did run into a bit of a problem though, as there is only one voice available named “Microsoft Anna”. I hunted for others on the internet, but nothing seemed to be recognized by the speech synthesizer. Most voices appear to be fore SAPI 4.0, and are not compatible with the latest version. A familiar voice, “Microsoft Sam”, was not included in the latest version due to security problems. Many people seem to be upset with this who are also looking for a male voice.

Download Source Code and Binary Files

I started looking into alternatives. I even considered using some old 8-bit allophones that I had played around with a couple years ago and donated to the vault at SL3B, but they would have been too robotic for what I needed. I started looking into Linux solutions and found eSpeak. eSpeak is a light-weight voice synthesizer that converts text to phonems, and then plays the specific audio for each part. It’s multi-lingual and understandable, but almost on the edge of sounding like a robot. It has a handy command line utility that makes it simple to read text files and save them as WAV files.

I started looking into services and found NeoSpeech. This service offers a few options. For TTS on-demand, you can paste your text in a form, preview the audio, and purchase it if you like the results. You’ll get some credits to try the service out for yourself. The next option is a TTS web service that permits developers to call a web service on the internet. There are different pricing plans available, including a free one with a few limitations (500 words per call, 100,000 words per month) as well as embedded advertisements. They also offer the engine itself for single-user applications as well as a voice server.

One of the remarkable things that I found on NeoSpeech was that Paul was there. Paul is the engine that odiogo is using to convert my RSS feed into podcasts. I also found Paul available for download as a free trial, or the full version for $35. The site (at free downloads place) claims that the server for the trial version is offline. With my luck, the SAPI5 voice engine will probably be incompatible with my operating system. Either that, or this is a fly by night operation that tries to scam people out of their money.

Another idea that I’m having is to use MorphVOX to augment the voice. Although I can make the SAPI speak directly to the default audio device, which MorphVOX controls, I am uncertain if MorphVOX can also handle saving it to the file system directly. What I need to determine is if MorphVOX exposes a library or has a console application to apply voice filters to audio files, and then write that output to a new file.

With the research that I’m doing into these kind of things, I have two licensing issues to consider. The first is for personal use to post to my blog on a daily basis. The second would be the potential to create an automated service later on for the masses. For now, I’m just concentrating on me.

2 Responses to Can you hear me now?

  1. I used text-to-speech in a RL project a few years ago and ended up with Cepstral. They had the best voices back then.

  2. Thanks Peter. I’ll check them out.

%d bloggers like this: