Speech-to-Text Conversion

You can test API methods on the interactive API browser page and without writing code
Some of the resources described on the page may not be available by default due to the type of application (see Creating and authorizing applications).

There are two types of speech-to-text conversion: 

  1. Voice navigation rule (voice_helper) in the voice menu. 

  2. Sending recognized text by events during a conversation. 

It operates only when speech_to_text function is activated in the client configuration (this setting may be configured by a platform administrator only).

Voice Navigation Rule in Voice Menu 

Before defining voice navigation rules, a platform administrator must turn on the ability to use the voice_helper rule in the voice menu. Otherwise, the rule cannot be created, and if this functionality is disabled, the rule in the voice menu settings will be ignored. The API application permissions must be all (they are also granted by the platform administrator). The setting is configured similarly to other voice menu rules in the context options; the quantity of rules is also not limited. During recognition, the final recognition result and guesses are available (the guesses are available if recognition is interrupted by a dialing or by a timeout).

Description of Data Structures 

Name 

Type 

Description 

Name 

Type 

Description 

action 

string 

voice_helper: to set the voice navigation rule. 
The main options for this action to work:  soundvoice_helper_digits_max,  voice_helper_rules,  voice_helper_timeout, voice_helper_classic_term, voice_helper_final_count 

sound 

integer 

The identifier of the sound file to be played may be got using the resource: GET /client/{client_id}/sound/
When the file is playing, the caller speech recognition begins that ends by pressing the terminator button #, dialing the extension dialing digits (voice_helper_digits_max), or by timeout (voice_helper_timeout

voice_helper_digits_max 

integer 

Quantity of extension dialing digits, which speech recognition ends after. If at the moment before the required digit quantity is set, the final recognition result is not available, we will use assumptions 

voice_helper_rules 

string 

Forwarding rules list in the form of an array {"to_option": option_number, "transcription": "recognized word" }
This context option must be created, otherwise it is skipped during the call. The option is not checked during the rule creation.
The option is created by the POST rule /extension/{extension_id}/ivr/context/{context_id}/options/
The transcription can be of the form "hi|hello" (words are listed with "|", spaces are stripped or must be absent, the character case does not matter. It is not necessary to indicate the entire word (if the word or its part specified in the transcription is contained in the recognized word, this will be considered a match)

voice_helper_timeout  

integer 

The time, after which speech recognition ends, in milliseconds. If during this time the final result is not available, we will use assumptions. The minimum value is 3000

voice_helper_classic_term 

boolean 

Enable or disable classic extension dialing from the keyboard (it is disabled by default and used only as recognition interrupt, see the option voice_helper_digits_max).
The logic is similar to the extension dialing of the IVR rule Play sound, but before going to the option or to the extension number, voice_helper_rules are checked and, if there are matches, then they will be used for transition, not the extension number (the saying, not dialing, has priority, although the dialing ends recognition of the saying)

Creation of Voice Menu Rule

For example, a call comes to an extension number with extension_id 204 in context 1 on the “start“ option. We will assume that the context is empty and there are no rules in it yet. We add a rule and send a request: 

System response (other parameters are irrelevant and are excluded from the example): 

{   "voice_helper_sound": 52,   "voice_helper_timeout": 7000,   "id": 39,   "voice_helper_digits_max": "2",   "voice_helper_rules": [     {        "transcription": "hi|hello|whatsup",       "to_option": 1     },     {       "transcription": "bye|goodbye",       "to_option": 2     }   ],   "final": true,   "action": "voice_helper", }

The request response will contain the created rule identifier: "id": 39. As a result of the above request, a voice navigation rule will be created, according to which the default melody will be played. 

You must create the to_option options (they are not automatically created) where the words from the transcription will go during the message. The transcription option can contain either an exact word or a part of a word. 

Creation of Sound Greeting 

If you need to set a specific audio greeting to inform in it that a particular word is expected from the caller, you need to prepare a sound file in advance. The resource "Sound Files” will allow you to download a file and find out its identifier. Then you need to update the voice helper rule parameters by specifying the identifier of the required sound file. 

Updating Voice Helper Rule Settings 

You can update any rule parameter using the method
PUT /api/ver1.0/extension/{extension_id}/ivr/context/{context_id}/options/{option_digits}/rules/{rule_id}

For example, to set an up-to-date greeting with the required file identifier SOUND_ID, you should send the above request with the body
{"sound": SOUND_ID} 

You can update the same way any parameter of the rule voice_helper. For example, you can replace the rules voice_helper_rules with the request
PUT /api/ver1.0/extension/{extension_id}/ivr/context/{context_id}/options/{option_digits}/rules/{rule_id} with its body

{"voice_helper_rules": [     {       "to_option": 1,       "transcription": "food|meal"     },     {       "to_option": 2,       "transcription": "comics|comic|mix"     }   ] }

It is not recommended to set many conditions in one rule voice_helper_rules (preferably no more than 500). 

The option to_option is the context options (start, invalid, timeout, 1, 2, 3, 4, etc.). The rule voice_helper is created in the option start and the options 1-10 (or, for example, 4-40) will be used for voice navigation. 

The options are created by therequest
POST /api/ver1.0/extension/{extension_id}/ivr/context/{context_id}/options/ with its body

{"digits": "string"}

where string is the context options (start, invalid, timeout, 1, 2, 3, 4, etc.). 

Getting Recognized Data by Remote Server 

It is possible to get recognition data on your remote server. 

The "Call Interactive" function allows, as an action of the context option voice menu (IVR), to initiate an HTTP request to the specified URL and process the response to it. With the request, a permanent set of parameters is transmitted that contains information about the call in the IVR. To control actions after recognition, there are optional request options of the function "Call Interactive":

  • voice_navigator_DTMF: extension dialing from the telephone terminal during the voice menu rule action='voice_helper';

  • voice_navigator_STT: contains voice recognition during the voice menu rule action='voice_helper'. 

The "Call interactive" function with the POST request http://mysite.com/myscript?check_number returns the desired greeting with additional options besides TTS (play_now="false", save_to_var="true"). In this case, voice_helper with the specified option play_sound_from_variable ignores the greeting set in it. 

The call enters the starting context (start) where, in addition to the standard context options (start, timeout, invalid), custom options are configured, for example: “1” – ‘call_interactive’, “2” – ‘voice_helper'. The system waits for the caller to say something or to dial an option (this is declared in the rule voice_helper). For example, if the caller says "operator", the caller will enter to option "0", and if the caller says any of the specified words (for example: "know, date, ready, readiness, shipped, shipment, goods,  invoice") or dials 1 the caller will go to option "1", where the rule "Call Interactive" will work, according to which a POST request http://mysite.com/myscript?check_stt_res will be sent to the server, after which the server will receive the data: voice_navigator_STT=%D1%85%D0%BE%D1%87%D1%83+%D1%83%D0%B7%D0%BD%D0%B0%D1%82%D1%8C+%D0%B4%D0%B0%D1%82%D1%83+%D0%B3%D0%BE%D1%82%D0%BE%D0%B2%D0%BD%D0%BE%D1%81%D1%82%D0%B8+%D0%BA+%D0%BE%D1%82%D0%B3%D1%80%D1%83%D0%B7%D0%BA%D0%B5+%D1%82%D0%BE%D0%B2%D0%B0%D1%80%D0%B0 
url decode voice_navigator_STT=I want to know the date the goods ready to be shipped 
or voice_navigator_DTMF=1 

If no option in the rule voice_helper was not activated (neither "0", nor "1" had been dialed), then the default greeting sounds in the start context prompting to connect with the operator ("Say "operator" or press "0"). In this case, you can add "Call interactive" with the request POST http://mysite.com/myscript?no_option_voice_helper, where voice_navigator_STT, whether it even is there, contains the value that is alternative to "0" and "1" (for example, if the caller asks: "Where did I get to?"). After that, you can set other actions both by controlling from "Call Interactive" and by static rules in IVR. 

Events with Recognized Text

To receive events with the final recognized text during a conversation, you need to use the following scheme. On an event in an extension number (dial-in for incoming calls in IVR, answer for incoming and outgoing calls from an extension number of the "phone terminal" type), depending on the event CallFlow, you need to remember the extension_id (the extension number identifier: it is CalledExtensionID for in and CallerExtensionID for out) and CallAPIID and use the resource:
PUT /extension/{extension_id}/speech_to_text/{call_api_id}

Description of Data Structures 

Name 

Type 

Description 

Name 

Type 

Description 

extension_id 

string 

Identifier of the extension number 

call_api_id 

string 

The identifier of the call to begin speech recognition 

action  

string 

Action, may be start or stop 

direction  

string 

Direction of the recognized speech relative to extension_id: out if the voice goes from the extension number, in if the voice goes to the extension number 

url  

string 

URL to send events with speech-recognized text 

If you try again the same action with the same direction for the same conversation, you will get an error message! 

The events look like this: 

You can read the content length from Content-Length.
The recognized text in readable form is after url decode utf8. 

"Extension Number” Section Resources 

PUT /extension/{extension_id}/speech_to_text/{call_api_id} 

URL Options 

Name 

Type 

Name 

Type 

extension_id 

string 

call_api_id 

string 

Request Options 

Name 

Type 

Name 

Type 

action  

string 

direction  

string 

url  

string