Speech Application Platform Glossary

Published: June 20, 2005



























A standard compression algorithm, used in digital communications systems of the European digital hierarchy, to modify and optimize the dynamic range of an analog signal for digitizing.


Augmented Backus-Naur Form. A metasyntax used for expressing context-free grammars (CNFs).


Access Control List. A list that specifies the rules for access to a particular resource.


Acoustic Model. A visual representation of a sound that shows the characteristics and behaviors of that sound.



A stream of speech that continues beyond a time limit set by the application. For example, if a telephony application user begins a conversation with a colleague instead of responding to the application, and the duration of the user’s conversational utterance exceeds the time limit for a response set by the application, the application treats the user’s speech as babble.


The ability of the user to voice-interrupt ("barge in on") the system during a prompt. The bargein attribute allows the user to speak or use dual tone multi-frequency (DTMF) input to interrupt a prompt.



Computer and Business Equipment Manufacturers Association. An organization of hardware vendors and manufacturers in the United States involved in standardizing information processing and related equipment.


The file name extension for context-free grammar files.


Context-free grammar.

coarticulatory effects

In the context of speech-recognition systems, the acoustic effects produced by the influence of one phone on the articulation of a neighboring phone.

confidence score

A value indicating the likelihood that the word or phrase recognized by the speech engine matches the word or phrase actually uttered by the speaker.


Computer Supported Telephony Application. A set of API calls that provide an international standard interface between network servers and telephone switches. Established by the European Computer Manufacturers Association (ECMA).


Computer-Telephony Integration. The enabling of computer applications to integrate and control telephony functions.



Direct Dial Inward. A telephone service that provides companies or businesses with a block of numbers for calling into their Private Branch Exchange (PBX) system. With DDI, outside callers can dial individuals directly without intervention from a switchboard operator.


A programming construct, component, or element that structures, enables, or manages dialogue.


A turn-taking exchange of audio, such as a human-to-human or human-to-computer exchange.

directed dialogue

Also known as system-initiative. A speech dialogue in which the application prompts for specific information and recognizes only the requested information at that point in the application.


Dialed Number Identification Service. A telephone service that enables the receiver of a call to determine the number that the caller dialed. Commonly used by companies that have multiple 1-800 or 1-900 numbers.


Document Object Model. A programming interface specification developed by the World Wide Web Consortium (W3C) that enables HTML and XML pages and documents to be created and manipulated as objects.


Design Requirements Specification


Digital Signal Processing. High-speed data manipulation, typically of audio or image data, that improves the accuracy and reliability of digital communications.


Dual Tone Multi-frequency. The signaling system used in telephones with touch-tone keypads, in which each digit is associated with two specific frequencies.



See explicit confirmation


European Computer Manufacturer\'s Association. An industry organization based in Geneva, Switzerland, whose goal is the standardization of Information and Communication Technology (ICT) systems. Developed Computer Supported Telecommunications Applications (CSTA) standards, which define the functionality and content of messages used by the SALT application. The American counterpart is the Computer and Business Equipment Manufacturer\'s Association (CBEMA).


Enterprise Instrumentation Framework. A Microsoft technology that provides an extensible event schema and unified API that leverage existing event, logging, and tracing mechanisms built into Microsoft Windows.


A processor or component that determines how the application manages and manipulates data. Three examples of specific types of engines used by Microsoft sas-formal (sas) serv-speech (serv-speech-acro) are:

  • Prompt Engine

  • Speech Recognition Engine

  • TTS Engine


File name extension for a Windows Event Trace log.

Event Trace log

A file containing event trace log data.

explicit confirmation (EC)

The most basic form of confirmation. Of the three styles of confirmation (implicit, explicit, and short time-out), explicit requires the most user time, because it introduces an extra prompt to explicitly confirm information that the user has provided.


A segment of a prompt that can combine dynamically with other extractions at run time to create a prompt.



A property whose value determines how confirmation is handled in a dialogue. If the value of FirstInitialTimeout is zero, then the QA control performs normal confirmation and the user must explicitly accept or deny the confirmation. If the value of FirstInitialTimeout is non-zero, then the QA waits for that number of milliseconds before raising the silence event.

form factor

In computer hardware, the size, configuration, or physical arrangement of a computer case or chassis, or one of its internal components. In computer software or programming, form factor typically refers to the size of the program or the amount of memory required to run the program effectively. Analogous to footprint.



A structured list of rules that identify the words or phrases that can be used for speech recognition.

GRN Referencing

Grammar Rule Name (GRN) Referencing. A type of semantic markup language (SML) script referencing in which the script expression evaluates semantic values of, or assigns semantic values to, the Rule Variable (RV) of the rule element that contains the expression.

GRN Rule Variable

Grammar Rule Name Rule Variable. A predefined object that holds a semantic value which may be composed of multiple properties. Every rule element in a grammar has a single GRN Rule Variable. The GRN Rule Variable is identified by a dollar sign ($).

GRR Referencing

Grammar Rule Reference (GRR) Referencing. A type of semantic markup language (SML) script referencing in which the script expression evaluates semantic values of the Rule Variable (RV) of a rule element outside of the rule element that contains the expression.


The file name extension for XML Form grammar documents, as adopted by the W3C Voice Browser Working Group. The gets in Microsoft sdk-formal sdk-ver creates files in this format.


Globally unique identifier. A program-generated number that creates a unique identity for an object.



Hypertext Markup Language. The language most commonly used in World Wide Web pages.



See implicit confirmation


Internet Engineering Task Force. The international community of network designers and professionals that defines standard Internet protocols and addresses Internet architecture issues.

implicit confirmation (IC)

The confirmation method that combines the confirmation question with the next information retrieval question to form a single prompt. Uses fewer prompts than explicit confirmation (EC).

inbound call

A telephone call originated by a user and directed toward the telephony server. Synonymous with incoming call.

inline grammar

Grammar logic that exists as XML markup in the code of an .aspx page rather than in a separate grammar file.

inline prompt

Static text that the prompt engine plays when the application activates a control. Only one inline prompt exists per control. If a control has an inline prompt, that prompt is the only prompt the control plays. An application cannot change an inline prompt at run time.


An individual object of a particular class. A class is the definition of a type, and an actual occurrence of the class is called an instance. Each instance of a class can have different values for its variables.


Inverse text normalization. Enables a spoken numeric or symbolic value to appear as a number or symbol when translated by a speech recogntion program. For example, if "twenty three" is dictated, it appears as "23" on the computer screen.


Interactive Voice Response. A telephony application that leads a telephone caller through a hierarchy of menus, delivers voice responses, collects voice and data inputs, and performs other operations on behalf of the caller or the program sponsor.






Multi-purpose Internet Mail Extensions. An e-mail protocol extension that enables the exchange of different file types, such as audio, video, and applications, through e-mail.

mixed-initiative dialogue

A speech dialogue in which the application prompts for specific information, but the user may respond with additional or different information that the application recognizes.


Microsoft Management Console. An application that provides a graphical user interface (GUI) and an operational framework for administrative and management tools.


Abbreviation for millisecond (1/1000 of a second).


Microsoft Message Queue. Application-level messaging software that allows applications to asynchronously send and receive messages in disconnected environments.


Microsoft Speech Server


The mode that allows a combination of speech and other means of input/output. Enables a user to speak to an application while, for example, pressing a stylus on a Pocket PC or clicking a mouse on a desktop application. Also allows an application to speak to the user while it displays graphics on the screen.


An utterance that the application recognizes with a confidence level that falls below the recognition rejection threshold. A speech recognizer often classifies an utterance as a mumble when:

  • The user’s pronunciation does not match the pronunciation expected by the speech recognizer.

  • Excessive noise (background noise or line noise) is present in the input.



The recognition results in which the speech recognition engine has the highest levels of confidence. \'N\' is the number of results returned.


Natural language. A human language, as opposed to a command or programming language traditionally used to communicate with a computer.


Shortened form of "No Recognition." NoReco is the name of the event generated by the speech recognition (SR) engine when the engine is unable to recognize input. The NoReco event is generated by one of four conditions:

  1. Sound detected, but no speech could be interpreted. The SR engine detects and parses speech, but is unable to match the parsed speech to the active grammar.

  2. Mumble. The SR engine detects and parses speech, and returns a result, but the confidence level for the result is below the recognition rejection threshold.

  3. Babble. The SR engine detects speech, but does not detect silence for the duration of time specified by the BabbleTimeout property.

  4. No Sound. The SR engine stops listening before speech is detected during the period of time specified by the value of the InitialTimeout property. A NoReco event due to no sound differs from a Silence event. A NoReco due to no sound can be generated only before the time specified by the InitialTimeout property is reached, whereas the Silence event is generated only after the time specified by the value of the InitialTimeout property is exceeded.


outbound call

A telephone call originated by the telephony server and directed toward a remote party. Synonymous with outgoing call.



Private Branch Exchange. An automatic telephone switching system that enables users within an organization to place calls to each other without going through the public telephone network. Also allows users to place calls directly to outside numbers.


Pulse-code modulation. A method of encoding information in a signal by varying the amplitude of pulses. Unlike pulse amplitude modulation, in which pulse amplitude can vary continuously, PCM limits pulse amplitudes to several predefined values.


Prompt Engine Markup Language.


The file name extension for a prompt function file.


A unique sound unit of speech.


Abstract categories of speech sounds (vowels and consonants) grouped together to create words. For example, SAPI provides two default pronunciations of the word hello: "h ax l ow" and "h eh l ow." Each group of sounds, separated by spaces, represents a phoneme.


Abbreviation for part of speech.


Optional ending words or phrase.


Optional beginning words or phrase.


A question, directive, greeting, or information spoken by a speech application.


"On what date do you wish to depart?"
"Welcome to Paris."
"Press three."

prompt engine

The component of serv-speech (serv-speech-acro) that processes text input and produces speech output by concatenating prerecorded words and phrases that match the text input. The prompt engine stores the recordings it uses on disk and indexes them in one or more prompt database files. serv-speech-acro is a component of the Microsoft sas-formal.

prompt function

Dynamically generates a prompt at run time.


The file name extension for a working file containing transcription text, extraction data, and archived versions of prompt .wav files in their original recorded format. Compiles into a .prompts file.


The file name extension for a prompt database file, a binary file that contains all the prompt information and audio data for a prompt project. Compiled from a .promptdb file.


The file name extenstion for a prompt project file.


A collection of phonological features including pitch, duration, and stress, that define the rhythm of spoken language.


QA control

Defines a single interaction with the user, which is usually, but not always a "Question and Answer" dialogue. A QA control that collects data places it in SemanticItem controls.



Request For Comments. A formal document created and reviewed by members of the Internet Engineering Task Force (IETF). Some RFCs become Internet standards, and some are informational.


Root mean sequence.


Root Rule Variable. The GRN Rule Variable of the root rule of a grammar. The RRV provides the semantic result of a recognition.

RunSpeech object

Supports dc on a client device, and is responsible for the activation of these controls and the confirmation of data they collect. Also exposes several methods and properties used in client-side scripting.



Speech Application Deployment Service.


Speech Application Language Tags. A markup language that integrates speech services into existing markup languages such as HTML and XHTML. Enables multimodal and telephony access to information and applications from PCs, telephones, and PDAs.


Speech Application Programming Interface. A set of routines, protocols, and tools that enable programmers to build speech-enabled applications for Microsoft Windows platforms.


Speech Application Software Development Kit.

semantic interpretation

The process by which a semantic interpreter generates a result based on a spoken word or phrase that matches a grammar rule or rules.


Speech Engine Services. Microsoft Speech Server component that provides speech recognition and speech output resources primarily for telephony and Pocket PC clients.

short time-out confirmation (STC)

Confirmation method that interprets silence as acceptance. With short time-out confirmation, the time period that the application waits for the user to speak is typically shorter than that in explicit confirmation.


Abbreviation for semantic interpretation.


No sound from the user is detected by the application.


Simple Messaging Extension. The communication mechanism by which SALT applications establish an asynchronous message exchange channel for sending and receiving messages between the SALT application and external components of the SALT platform.


Semantic Markup Language. An XML-based markup language that allows the application to identify and parse meaningful parts of speech recognition output.


Abbreviation for speech output.


Simple Object Access Protocol. Provides a simple mechanism for exchanging structured and typed information between peers in a decentralized, distributed environment using XML. Defines a message format in XML that travels over the Internet using HTTP.

speech application

An application in which human-computer interaction is mediated either unidirectionally or bidirectionally by speech.


Speech recognition. The ability of a computer to receive spoken-word commands and data input.

Speech Recognition Engine

The component of serv-speech (serv-speech-acro) that converts spoken input to text and delivers the text to an application. serv-speech-acro is a component of the Microsoft sas-formal.


Speech Recognition Grammar Specification. Specification developed by the World Wide Web Consortium (W3C) that defines syntax for representing grammars for use in speech recognition. Enables developers to specify the words and patterns of words to be listened for by a speech recognizer.


Speech Synthesis Markup Language. An XML-based markup language used to control various characteristics of synthetic speech output including voice, pitch, rate, volume, and pronunciation and other characteristics.


A speech dialogue in which the application prompts for specific information and recognizes only the requested information at that point in the application. Also known as directed dialogue.



When using devices like Windows Mobile-based Pocket PC (Pocket PC) or Tablet PC, tapping a control with the input stylus and then speaking to input data.


Telephony Application Services. The client that renders telephony applications in a staging and production environment.


Telephony Application Simulator. The client that render telephony applications in the sdk-formal (sdk).


Triggered Call Queue. The process running on the Web server that receives alert notifications from SSNS, routes call requests to available interpreters, and monitors call request status interpreter status.


Telephone technology (voice, fax, or modem transmissions) based on the conversion of sound into electrical signals or wireless communication.

text normalization

The process of converting non-word written symbols into words that a speaker would say when reading that symbol out loud.



time-out confirmation

Confirmation and correction of the user\'s responses using a combination of implicit confirmation (IC), short time-out confirmation (STC), and explicit confirmation (EC) strategies.


A record of a speech-based conversation converted into written text. Commonly used to analyze the performance of an speech application by matching what was said during the call with the log file of what actually happened.


Text to speech. The process of converting text into spoken language by breaking down the words of the text into phonemes, analyzing the input for occurrences that require text normalization, and generating the digital audio for playback.

TTS Engine

The component of serv-speech (serv-speech-acro) that processes text input and produces speech output by synthesizing words and phrases. serv-speech-acro is a component of the Microsoft sas-formal.



A standard analog signal-compression algorithm, used in digital communications systems of the North American digital hierarchy, to optimize the dynamic range of an analog signal prior to digitizing.


User-perceived latency. The length of time that a user perceives to occur between the end of one event and the beginning of a subsequent event.


Uniform Resource Identifier. A character string used to identify a resource (such as a file) from anywhere on the Internet by type and location. The set of Uniform Resource Identifiers includes Uniform Resource Names (URNs) and Uniform Resource Locators (URLs).


Uniform Resource Locator. An address for a resource on the Internet. Specifies the protocol used to access the resource (such as http: for a World Wide Web page or ftp: for an FTP site), the name of the server on which the resource resides (such as http: //www.woodgrovebank.com), and, optionally, the path to a resource (such as an HTML document or a file on that server).



Voice Activity Detector.


An application that is driven by using either Speech or DTMF input. Telephony applications are a type of voice-only application in which users interact with the application by speaking into the telephone or pressing buttons on the numeric keypad.


Voice Over IP. Audio streaming over a network using the TCP/IP protocol.


Virtual Private Network. A set of nodes on a public network that communicate among themselves using encryption technology so that their messages are as safe from being intercepted and understood by unauthorized.


Voice User Interface.



World Wide Web Consortium. The organization that sets standards for the Web and HTML.


Wireless Markup Language. An XML-based markup language used to specify content and the user interface for narrowband devices, including cellular phones and pagers. Part of the Wireless Application Protocol (WAP).



Extensible Hypertext Markup Language. A markup language incorporating elements of HTML and XML. Web sites designed using XHTML can be more readily displayed on handheld computers and digital phones equipped with microbrowsers.


Extensible Markup Language. A condensed form of SGML (Standard Generalized Markup Language) that allows Web developers and designers to create customized tags.


Extensible Stylesheets Language Transformations. A language used to transform an existing XML document into a restructured XML document. Primarily intended for use as part of XSL. Also called XSL Transformations.


XML namespace. A collection of names, identified by a URI reference, which are used in XML documents as element types and attribute names. XML namespaces differ from the "namespaces" conventionally used in computing disciplines in that the XML version has internal structure and is not, mathematically speaking, a set.