top of page

Machine Learning services in AWS (part 2)

In the previous post we started an overview of Machine Learning and Artificial Intelligence services in AWS, including Amazon Sagemaker and Amazon Rekognition. In this one we will take a look at Amazon Polly, Amazon Translate, Amazon Transcribe, Amazon Comprehend and Amazon Textract.


Amazon Polly

Amazon Polly is a service that turns text into lifelike speech, allowing you to create applications that talk, and build entirely new categories of speech-enabled products. Polly's Text-to-Speech (TTS) service uses advanced deep learning technologies to synthesize natural sounding human speech. There are 31 Languages and 9 different voices (may vary according to language) supported by Amazon Polly.


In addition to Standard TTS voices, Amazon Polly offers Neural Text-to-Speech (NTTS) voices that deliver advanced improvements in speech quality through a new machine learning approach. Polly’s Neural TTS technology also supports a Newscaster speaking style that is tailored to news narration use cases.


There are several output file formats available such as MP3, OGG, PCM and Speech Marks with different sample rates (8000Hz, 16000Hz, 22050Hz, 24000Hz).


The Web console of Amazon Polly just contains a couple of small tabs, where you can test it.


You can also use Amazon Polly to generate speech from documents marked up with Speech Synthesis Markup Language (SSML). Using SSML-enhanced text gives you additional control over how Amazon Polly generates speech from the text you provide.


For example, you can include a long pause within your text, or change the speech rate or pitch (example below).

<speak>
     Mary had a little lamb <break time="2s"/>Whose fleece was white as snow.
</speak>

Other options include:

  • using phonetic pronunciation

  • using the Newscaster speaking style.

  • including breathing sounds

  • emphasizing specific words or phrases (example below)

<speak>
     I already told you I <emphasis level="strong">really like</emphasis> that person.
</speak>
  • Whispering (example below)

<speak>
     When any voice is made to whisper, <amazon:effect name="whispered">
<prosody rate="-10%">the sound is slower and quieter than normal speech
</prosody></amazon:effect>
</speak>

You can also customize the pronunciation of specific words and phrases by uploading lexicon files in the PLS format.


You can try Amazon Polly within the Free tier. Free tier includes 5 million characters per month for speech or Speech Marks requests, for the first 12 months, starting from your first request for speech.

After 1 year Amazon Polly’s Standard voices are priced at $4.00 per 1 million characters for speech or Speech Marks requests. Amazon Polly’s Neural voices are priced at $16.00 per 1 million characters for speech or Speech Marks requested.


Amazon Transcribe

Amazon Transcribe is an automatic speech recognition service that uses machine learning models to convert audio to text.

Amazon Transcribe’s features allow you to ingest audio input, produce easy-to-read transcripts, improve accuracy with language customization, and filter content to ensure customer privacy. Practical use cases for Amazon Transcribe include transcribing and analyzing customer-agent calls and creating closed captions for videos.

With Amazon Transcribe, you can add speech-to-text capabilities to any application.


Amazon Transcribe allows you to perform real-time transcription, submit transcription jobs, and train custom language models for audio that is specific to your use case. The transcription accuracy of a custom language model can be better than that of the general model. You can also create a custom vocabulary that is a collection of words or phrases that improves the transcription accuracy of special terms. These terms are generally domain-specific. You can create a vocabulary filter from a text file containing a list of words that are profane, offensive, or otherwise undesirable to show to the readers of your transcripts. You can use this filter to mask or remove words from the results in your transcription job. You can mask, remove, or tag words in your real-time streams.


There are two sub services such as Call Analytics and Transcribe Medical that may be useful for specific companies.

Amazon Transcribe supports 12 languages, e.g. English, Chinese, French, German, Italian, Spanish, Japanese, Koorean, etc. It can identify or redact one or more types of personally identifiable information (PII) in your transcript.


With Amazon Transcribe, you pay-as-you-go based on the seconds of audio transcribed per month. It’s easy to get started with the Amazon Transcribe Free Tier. Upon signup, start analyzing up to 60 audio minutes monthly, free for the first 12 months. After 12 month pricing depends on the type of functionality that you use, volume of data and AWS region. For example, standard batch transcription costs $0.02400 per minute for the first 250,000 minutes in N. Virginia.


Amazon Textract

Amazon Textract is a service that automatically detects and extracts text and data from scanned documents. It goes beyond simple optical character recognition (OCR) to also identify the contents of fields in forms and information stored in tables


How Textract works:


Amazon Textract currently supports PNG, JPEG, TIFF, and PDF formats.

It can detect raw text:


Or even table:


It perfectly works with receipts and invoices:


You can get started for free with the AWS Free Tier. For the first three months after account sign-up, new customers can analyze up to 1,000 pages per month using the Detecting Document Text API and up to 100 pages per month using the Analyze Document Text API. After 3 months “Detect Document Text API” for the first 1 Million pages will cost $0.0015 per page. Over 1 Million pages will cost $0.0006 per page.


Amazon Comprehend

Amazon Comprehend uses natural language processing (NLP) to extract insights about the content of documents. Amazon Comprehend processes any text file in UTF-8 format, and semi-structured documents, like PDF and Word documents. It develops insights by recognizing the entities, key phrases, language, sentiments, and other common elements in a document.


Amazon Comprehend allows to perform real-time analysis, submitting jobs, create custom classifications and use the service for Medical field:


Some of the insights that Amazon Comprehend develops about a document include:

  • Entities – Amazon Compr