Watson Text to Speech – The Costs of Personalization


I tend to write these blog posts to share interesting things that I have learned when working with our customers.  Just this past week I have had 2 or 3 blog worthy events happen, so I hope to be publishing these posts at a brisker pace in the coming months.

This week I had a customer that is using the Watson Text to Speech service.  They are using it to do short utterances, things like street names, addresses, and city names.  The utterances are relatively short.  They told me that they had no idea how they were being charged for the service.

This particular customer has a focus on producing a positive customer experience.  No tinny, mechanical voice for this customer!!  They are tweaking the speaking voice and customizing it, using SSML (Speech Synthesis Markup Language) to modify and “humanize” the synthesized speech from the Watson Text-to-Speech (TTS) service.  You have the ability to modify things like the emotion used in the speech generated (called expressive SSML), to more basic things like the pitch and glottal tension (and yes, I had to look up the definition of glottal tension).  The typical curl call that they use looked similar to this:

curl -X POST -u apikey:*****************************--header "Content-Type: application/json" --header "Accept: audio/wav" --data "{\"text\":\"<speak><voice-transformation type=\\\"Custom\\\" breathiness=\\\"35%\\\" pitch=\\\"-80%\\\" pitch_range=\\\"60%\\\" glottal_tension=\\\"-40%\\\" >$text</voice-transformation></speak>\"}" --output $finalFile "
https://stream.watsonplatform.net/text-to-speech/api/v1/synthesize?voice=en-US_MichaelVoice"

So this curl command will ask for some text (referenced by the $text parameter) that will have breathiness set to 35%, pitch at -80%, the pitch range set to 60%, with a glottal tension of -40%.  I’m sure that someone played with these values, before settling on this combination.  It’s a great way to customize the sound and the tone of your automated speaking responses. 

How Does This Impact Cost?

 The cost of doing something like this will vary, and this is where I learned how some small changes can have a HUGE impact on the costs associated with your Watson solution.  The basic price for using the Watson TTS service is $0.02 per thousand characters.  There are some interesting things to keep in mind here.  Whitespace is NOT counted, so only count the non-whitespace characters.  Also, remember that the voice customizations and everything between the “<speak>” and the “</speak> ” are included in this count.

Now let’s assume that the text being converted was a home address, something like, “9 Marine Drive, Round Rock, Texas 78681”.  Let’s also assume that the user is being referred to by name.  There will also be some other text (a meter reading, a service interruption, etc.) as well, informing the end user about something about to happen near their home address.  We want to figure out the monthly costs for something like this if we estimate that we’ll build and issue 100,000 of these notices in a month.  A sample utterance might sound/look like this:

“This message is for Dan Toczala.  We are informing you of a service interruption tomorrow morning at 9 Marine Drive, Round Rock, Texas 78681.  Please call us at 1-800-123-4567 if you have questions.”

Breaking It Down

Your application can look up the customer name and address, and build this entire text string for each individual event, and then submit each one to the Watson Text To Speech service.  Your typical call would look like this:

curl -X POST -u apikey:*****************************--header "Content-Type: application/json" --header "Accept: audio/wav" --data "{\"text\":\"<speak><voice-transformation type=\\\"Custom\\\" breathiness=\\\"35%\\\" pitch=\\\"-80%\\\" pitch_range=\\\"60%\\\" glottal_tension=\\\"-40%\\\" >This message is for Dan Toczala.  We are informing you of a service interruption tomorrow morning at 9 Marine Drive, Round Rock, Texas 78681.  Please call us at 1-800-123-4567 if you have questions.</voice-transformation></speak>\"}" --output $finalFile "
https://stream.watsonplatform.net/text-to-speech/api/v1/synthesize?voice=en-US_MichaelVoice"

For the purposes of this discussion, we’re going to just focus on the “payload”, or the part in the data section of the curl command.  The part that impacts what your costs are.  So this chunk:

<speak><voice-transformation type=\\\"Custom\\\" breathiness=\\\"35%\\\" pitch=\\\"-80%\\\" pitch_range=\\\"60%\\\" glottal_tension=\\\"-40%\\\" >This message is for Dan Toczala.  We are informing you of a service interruption tomorrow morning at 9 Marine Drive, Round Rock, Texas 78681.  Please call us at 1-800-123-4567 if you have questions.</voice-transformation></speak>\

Now in this example, we count ALL non-whitespace characters inside of the quotes.  We have 336 non-whitespace characters.  Multiply that by 100,000 notices in a month, and I get a rate of 33,600,000 characters a month.  Apply the TTS cost of $0.02 per thousand characters, and you get a final monthly cost of $672.

Now let’s see what happens if we change the way that we think about this.  What if we quit customizing so much of the voice?  Then we would end up with something looking like this:

<speak><voice-transformation>This message is for Dan Toczala.  We are informing you of a service interruption tomorrow morning at 9 Marine Drive, Round Rock, Texas 78681.  Please call us at 1-800-123-4567 if you have questions.</voice-transformation></speak>\

So for the non-customized example, we have 225 non-whitespace characters.  Multiply that by 100,000 notices in a month, and I get a rate of 22,500,000 characters a month.  Apply the TTS cost of $0.02 per thousand characters, and you get a final monthly cost of $450.  Customizing my voice could be looked at as a cheap way to have an impact on customer satisfaction (it’s only $222 a month), or a really expensive way to do this (it’s about 49% more expensive than the base translation).  Remember, it all depends on how you want to look at things.  I suggest focusing on your problem and the overall costs of your solution.

Now let’s look at a final example.  In this example, we’ll keep our customized voice, but we’ll try to stop converting the same text over and over again.  What if our message was built in a way that minimized what needed to be converted each time?  What if we converted a basic message once, and the rest of the customized part for each customer?  So we could do this for each customer:

<speak><voice-transformation type=\\\"Custom\\\" breathiness=\\\"35%\\\" pitch=\\\"-80%\\\" pitch_range=\\\"60%\\\" glottal_tension=\\\"-40%\\\" >This message is for Dan Toczala, who resides at 9 Marine Drive, Round Rock, Texas 78681</voice-transformation></speak>\

And then follow that with this “standard” section which we would only need to convert once (for a one time cost of fractions of a cent):

<speak><voice-transformation type=\\\"Custom\\\" breathiness=\\\"35%\\\" pitch=\\\"-80%\\\" pitch_range=\\\"60%\\\" glottal_tension=\\\"-40%\\\" >We are informing you of a service interruption tomorrow morning. Please call us at 1-800-123-4567 if you have any questions.</voice-transformation></speak>\

So for the modified script example, we have 244 non-whitespace characters.  Multiply that by 100,000 notices in a month, and I get a rate of 24,400,000 characters a month.  Apply the TTS cost and you get a final monthly cost of $488.

Final Conclusions

So let’s look at all of these options together:

ApproachCharacters /
Msg.
Characters /
Month
Monthly
Cost
% Change
Basic
22522,500,000$4500%
Full
Customization
33633,600,000$67249.3%
Modified
Customization
24424,400,000$4888.4%

Looking at things in this way helped us make a rational decision on what things really cost, and helped us look at ways we could maximize our impact and minimize our costs.

P.S.  For those of you who were patient enough to read through this entire article, you can save yourself even more by removing the <speak> and </speak> tags.  These are assumed by the Watson Text To Speech service, so you can omit using them and save yourself 15 characters per message.  For the purposes of this example, that would reduce the monthly cost of each of the above approaches by $30 a month.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.