Was curious if Google's text-to-speech API might be good enough for generating audio versions of stories on-the-fly. Google has offered traditional computer voices for awhile, but last year made available their premium WaveNet voices, which are trained using audio recorded from human speakers, and are purportedly capable of mimicking natural-sounding inflection and rhythm.
Pretty good...but I honestly can't tell the difference between the standard voice and the WaveNet version, at least when it comes to intonation and inflection. The first 2 grafs of this NYT story, roughly 85 words/560 characters, took less than 2 seconds to process. The result in both cases is a 37-second second audio file.
-
The MP3 audio for Google Cloud's text-to-speech:
- en-us-Standard-B.mp3 (the "B" standard male voice)
- en-us-Wavenet-B.mp3 (the "B" WaveNet male voice)
-
For comparison's sake, here's AWS's Polly, attempting the same text in the voice of "Matthew". And for funsies, here's "Justin", if you want your NYT articles read by what sounds like a 10-year-old boy
-
The sample text.synthesize POST request for the Wavenet-B voice.
-
The sample response, which is basically just a JSON struct containing a single string of the base64-encoded audio (which is what the API returns by default)
The text input is taken from the first 2 paragraphs from the story currently on the NYT's homepage: As McKinsey Sells Advice, Its Hedge Fund May Have a Stake in the Outcome (~85 words, ~560 characters):
The sins of Valeant Pharmaceuticals are well known. Instead of spending to develop new drugs, Valeant bought out other drugmakers, then increased prices of lifesaving medicines by as much as 5,785 percent. Patients had no choice but to pay.
Valeant’s chief executive, J. Michael Pearson, was hauled into a 2016 Senate hearing and verbally thrashed by lawmakers. “It’s using patients as hostages. It’s immoral,” said Claire McCaskill, then the Democratic senator from Missouri. One executive went to prison for fraud. The company’s share price collapsed.
Google offers about 60 voices, including 28 WaveNet voices for English (and several European and Asian languages), male and female. The cost for WaveNet is $16 for 1 million characters, which is 4x the price of a standard voice. If you create a Google Cloud Platform account, the first million characters per month is free.
The v1 API itself is pretty straightforward. You use the text.synthesize POST method, which you can try in the GCP interactive console here.
POST https://texttospeech.googleapis.com/v1/text:synthesize?fields=audioContent&key={YOUR_API_KEY}
If you've downloaded the JSON response as response.json
, you can deserialize it in Python like this:
from base64 import b64decode
import json
from pathlib import Path
INFILE = 'response.json'
data = json.loads(Path(INFILE).read_text())
audio = b64decode(data['audioContent'])
Path('audio.mp3').write_bytes(audio)
You can try the API for yourself without a Google dev account, I think, by going to https://cloud.google.com/text-to-speech/ and scrolling down midway:
Amazon has its own text-to-speech service, which is named "Polly". I didn't bother trying to programmatically use the API because Polly's landing page is easy enough to cut-and-paste into. Polly is definitely more robotic-sounding than Google's WaveNet. And in this small sample text, it's less accurate on the proper nouns, e.g. pronouncing "Valeant" as VAIL-e-ent -- though that's less surprising than the fact that WaveNet somehow "knows" Valeant's correct pronounciation ("valiant"). Polly charges $4.00/million characters, which is the same as Google's standard (i.e. non-WaveNet) API, but as I mentioned above, I had a very hard time telling the difference between the premium WaveNet voice and its standard version.