Text dictation using ASR can be affected by many factors. Speed is measured as the number of words identified per unit time. There is no significant increase in recognition accuracy despite Google having published a trillion-word corpus in 2006 (Information Access Division, 2009). According to the Information Access Division report in 2009, until recently, an aver- age recognition accuracy of about 80% can be achieved. Accuracy refers to the rate of error in the words identified, and it can be in terms of “Single Word Error Rate” and “Command Success Rate,” both of which are derived from improved computer performance and large source text databases. There is a tradeoff between the accuracy and speed (Junqua, 2000), which are commonly used to describe performance of an ASE engine. Ap- parently, the wider the acceptable range, the greater the chance of misrecognition. In some ASR engines, the acceptable range can be changed by the user. The sound will be correctly recognized only when the distance measures fall within an acceptable range. Such distance measures are used to in- dicate the separation between the input speech and the norm. The logarithmic scale is used to represent the nonlinearity of human perception of sounds. If LPC coefficients are used as parameters, the log likelihood distance is more commonly adopted (Junqua & Haton, 1996 Klevans & Rodman, 1997). Yet, most distance measures are just variations of either the Euclidean or Manhattan distance between two parameter vectors. Distance measures obtained from different calculation algorithms using different parameters have been proposed. In the case of ASR, x is the parameter vector extracted from the input speech of nonnative English, and y represents the vector of the norm that is stored in the database. Distance measure refers to the metric d( x, y ) that represents the “separation” between two vectors of parameters x and y. The most commonly used metric is known as the distance measure d( x, y ). After the parameters are extracted from the preprocessed acoustic signals, parameters are evaluated to obtain “measures.” These measures are metrics used to identify the recognized speech and used to compare with the templates stored in the database. Although the parameters being extracted may vary from one ASR engine to another, they usually include speaker’s pitch, formant frequencies obtained from spectrographic analysis, linear prediction coding (LPC) coefficients, and other parameters that are proprietary-owned and ASR- engine specific (Klevans & Rodman, 1997).
According to Klevans and Rodman (1997), parameter extraction involves the extraction of speaker- related information (parameters) from the acoustic signals using different algorithms and assessment of quality of the extracted parameters. Parameter or feature extraction is car- ried out during step (5). At the end of step (5), a probability and accuracy measures will be generated (Cook, 2002).
Different techniques and algorithms are available for step (5), including the use of Hidden Markov Models (HMM), frequency analysis, different analysis, linear algebra techniques, spectral distortion, and time distortion methods. Among these many steps involved, step (5) is considered the core of ASR. According to Cook (2002), despite the variety of ASR engines, all ASR follow the six steps: (1) audio recording and utterance detection, (2) prefiltering of audio signals, (3) framing and windowing, (4) optional filter- ing, (5) comparison and matching, and (6) postcomparison action.
On the other hand, a robust speaker-independent ASR system requires no learning and can be readily used for all types of speakers without the need for prior training. A speaker- adaptive ASR system requires learning from users (known as enrollment) and with mod- ern ASR technology can achieve high accuracy.
All ASR systems can be divided into two types: speaker-adaptive and speaker-independent systems (Cook, 2002). Although different ASR engines make use of different or a combination of different strategies, the basic techniques remain similar. versus analytical phonetic recognition, and (3) data-driven versus knowledge-driven recognition.