Machine Learning services in AWS (part 2)

Updated: Aug 30

In the previous post we started an overview of Machine Learning and Artificial Intelligence services in AWS, including Amazon Sagemaker and Amazon Rekognition. In this one we will take a look at Amazon Polly, Amazon Translate, Amazon Transcribe, Amazon Comprehend and Amazon Textract.


Amazon Polly

Amazon Polly is a service that turns text into lifelike speech, allowing you to create applications that talk, and build entirely new categories of speech-enabled products. Polly's Text-to-Speech (TTS) service uses advanced deep learning technologies to synthesize natural sounding human speech. There are 31 Languages and 9 different voices (may vary according to language) supported by Amazon Polly.


In addition to Standard TTS voices, Amazon Polly offers Neural Text-to-Speech (NTTS) voices that deliver advanced improvements in speech quality through a new machine learning approach. Polly’s Neural TTS technology also supports a Newscaster speaking style that is tailored to news narration use cases.


There are several output file formats available such as MP3, OGG, PCM and Speech Marks with different sample rates (8000Hz, 16000Hz, 22050Hz, 24000Hz).


The Web console of Amazon Polly just contains a couple of small tabs, where you can test it.


You can also use Amazon Polly to generate speech from documents marked up with Speech Synthesis Markup Language (SSML). Using SSML-enhanced text gives you additional control over how Amazon Polly generates speech from the text you provide.


For example, you can include a long pause within your text, or change the speech rate or pitch (example below).

<speak>
     Mary had a little lamb <break time="2s"/>Whose fleece was white as snow.
</speak>

Other options include:

  • using phonetic pronunciation

  • using the Newscaster speaking style.

  • including breathing sounds

  • emphasizing specific words or phrases (example below)

<speak>
     I already told you I <emphasis level="strong">really like</emphasis> that person.
</speak>
  • Whispering (example below)

<speak>
     When any voice is made to whisper, <amazon:effect name="whispered">
<prosody rate="-10%">the sound is slower and quieter than normal speech
</prosody></amazon:effect>
</speak>

You can also customize the pronunciation of specific words and phrases by uploading lexicon files in the PLS format.


You can try Amazon Polly within the Free tier. Free tier includes 5 million characters per month for speech or Speech Marks requests, for the first 12 months, starting from your first request for speech.

After 1 year Amazon Polly’s Standard voices are priced at $4.00 per 1 million characters for speech or Speech Marks requests. Amazon Polly’s Neural voices are priced at $16.00 per 1 million characters for speech or Speech Marks requested.


Amazon Transcribe

Amazon Transcribe is an automatic speech recognition service that uses machine learning models to convert audio to text.

Amazon Transcribe’s features allow you to ingest audio input, produce easy-to-read transcripts, improve accuracy with language customization, and filter content to ensure customer privacy. Practical use cases for Amazon Transcribe include transcribing and analyzing customer-agent calls and creating closed captions for videos.

With Amazon Transcribe, you can add speech-to-text capabilities to any application.


Amazon Transcribe allows you to perform real-time transcription, submit transcription jobs, and train custom language models for audio that is specific to your use case. The transcription accuracy of a custom language model can be better than that of the general model. You can also create a custom vocabulary that is a collection of words or phrases that improves the transcription accuracy of special terms. These terms are generally domain-specific. You can create a vocabulary filter from a text file containing a list of words that are profane, offensive, or otherwise undesirable to show to the readers of your transcripts. You can use this filter to mask or remove words from the results in your transcription job. You can mask, remove, or tag words in your real-time streams.


There are two sub services such as Call Analytics and Transcribe Medical that may be useful for specific companies.

Amazon Transcribe supports 12 languages, e.g. English, Chinese, French, German, Italian, Spanish, Japanese, Koorean, etc. It can identify or redact one or more types of personally identifiable information (PII) in your transcript.


With Amazon Transcribe, you pay-as-you-go based on the seconds of audio transcribed per month. It’s easy to get started with the Amazon Transcribe Free Tier. Upon signup, start analyzing up to 60 audio minutes monthly, free for the first 12 months. After 12 month pricing depends on the type of functionality that you use, volume of data and AWS region. For example, standard batch transcription costs $0.02400 per minute for the first 250,000 minutes in N. Virginia.


Amazon Textract

Amazon Textract is a service that automatically detects and extracts text and data from scanned documents. It goes beyond simple optical character recognition (OCR) to also identify the contents of fields in forms and information stored in tables


How Textract works:


Amazon Textract currently supports PNG, JPEG, TIFF, and PDF formats.

It can detect raw text:


Or even table:


It perfectly works with receipts and invoices:


You can get started for free with the AWS Free Tier. For the first three months after account sign-up, new customers can analyze up to 1,000 pages per month using the Detecting Document Text API and up to 100 pages per month using the Analyze Document Text API. After 3 months “Detect Document Text API” for the first 1 Million pages will cost $0.0015 per page. Over 1 Million pages will cost $0.0006 per page.


Amazon Comprehend

Amazon Comprehend uses natural language processing (NLP) to extract insights about the content of documents. Amazon Comprehend processes any text file in UTF-8 format, and semi-structured documents, like PDF and Word documents. It develops insights by recognizing the entities, key phrases, language, sentiments, and other common elements in a document.


Amazon Comprehend allows to perform real-time analysis, submitting jobs, create custom classifications and use the service for Medical field:


Some of the insights that Amazon Comprehend develops about a document include:

  • Entities – Amazon Comprehend returns a list of entities, such as people, places, and locations, identified in a document.


  • Key phrases – Amazon Comprehend extracts key phrases that appear in a document. For example, a document about a basketball game might return the names of the teams, the name of the venue, and the final score.


  • Language – Amazon Comprehend identifies the dominant language in a document. Amazon Comprehend can identify 100 languages.


  • PII – Amazon Comprehend analyzes documents to detect personal data that could be used to identify an individual, such as an address, bank account number, or phone number.


  • Sentiment – Amazon Comprehend determines the emotional sentiment of a document. Sentiment can be positive, neutral, negative, or mixed.


  • Syntax – Amazon Comprehend parses each word in your document and determines the part of speech for the word. For example, in the sentence "It is raining today in Seattle," "it" is identified as a pronoun, "raining" is identified as a verb, and "Seattle" is identified as a proper noun.


Amazon Comprehend pricing depends on features that are used and data volume.


Amazon Translate

Amazon Translate is a neural machine translation service that delivers fast, high-quality, affordable, and customizable language translation. Neural machine translation is a form of language translation automation that uses deep learning models to deliver more accurate and more natural sounding translation than traditional statistical and rule-based translation algorithms.

With Amazon Translate, you can localize content such as websites and applications for your diverse users, easily translate large volumes of text for analysis, and efficiently enable cross-lingual communication between users.