Knowledge

Speech recognition

Source 📝

1851:
pattern has to be recognized or classified into a category that represents a meaning to a human. Every acoustic signal can be broken into smaller more basic sub-signals. As the more complex sound signal is broken into the smaller sub-sounds, different levels are created, where at the top level we have complex sounds, which are made of simpler sounds on the lower level, and going to lower levels, even more, we create more basic and shorter and simpler sounds. At the lowest level, where the sounds are the most fundamental, a machine would check for simple and more probabilistic rules of what sound should represent. Once these sounds are put together into more complex sounds on upper level, a new set of more deterministic rules should predict what the new complex sound should represent. The most upper level of a deterministic rule should figure out the meaning of complex expressions. In order to expand our knowledge about speech recognition, we need to take into consideration neural networks. There are four steps of neural network approaches:
689:(GMM-HMM) technology based on generative models of speech trained discriminatively. A number of key difficulties had been methodologically analyzed in the 1990s, including gradient diminishing and weak temporal correlation structure in the neural predictive models. All these difficulties were in addition to the lack of big training data and big computing power in these early days. Most speech recognition researchers who understood such barriers hence subsequently moved away from neural nets to pursue generative modeling approaches until the recent resurgence of deep learning starting around 2009–2010 that had overcome all these difficulties. Hinton et al. and Deng et al. reviewed part of this recent history about how their collaboration with each other and then with colleagues across four groups (University of Toronto, Microsoft, Google, and IBM) ignited a renaissance of applications of deep feedforward neural networks for speech recognition. 2244:(Publisher: Springer) written by Microsoft researchers D. Yu and L. Deng and published near the end of 2014, with highly mathematically oriented technical detail on how deep learning methods are derived and implemented in modern speech recognition systems based on DNNs and related deep learning methods. A related book, published earlier in 2014, "Deep Learning: Methods and Applications" by L. Deng and D. Yu provides a less technical but more methodology-focused overview of DNN-based speech recognition during 2009–2014, placed within the more general context of deep learning applications including not only speech recognition but also image recognition, natural language processing, information retrieval, multimodal processing, and multitask learning. 1090:
assumptions and can learn all the components of a speech recognizer including the pronunciation, acoustic and language model directly. This means, during deployment, there is no need to carry around a language model making it very practical for applications with limited memory. By the end of 2016, the attention-based models have seen considerable success including outperforming the CTC models (with or without an external language model). Various extensions have been proposed since the original LAS model. Latent Sequence Decompositions (LSD) was proposed by
1343:. The report also concluded that adaptation greatly improved the results in all cases and that the introduction of models for breathing was shown to improve recognition scores significantly. Contrary to what might have been expected, no effects of the broken English of the speakers were found. It was evident that spontaneous speech caused problems for the recognizer, as might have been expected. A restricted vocabulary, and above all, a proper syntax, could thus be expected to improve recognition accuracy substantially. 1521:. Individuals with learning disabilities who have problems with thought-to-paper communication (essentially they think of an idea but it is processed incorrectly causing it to end up differently on paper) can possibly benefit from the software but the technology is not bug proof. Also the whole idea of speak to text can be hard for intellectually disabled person's due to the fact that it is rare that anyone tries to learn the technology to teach the person with the disability. 1278:
keyboard and mouse: voice-based navigation provides only modest ergonomic benefits. By contrast, many highly customized systems for radiology or pathology dictation implement voice "macros", where the use of certain phrases – e.g., "normal report", will automatically fill in a large number of default values and/or generate boilerplate, which will vary with the type of the exam – e.g., a chest X-ray vs. a gastrointestinal contrast series for a radiology system.
9113: 9093: 677:
in their 2012 review paper). A Microsoft research executive called this innovation "the most dramatic change in accuracy since 1979". In contrast to the steady incremental improvements of the past few decades, the application of deep learning decreased word error rate by 30%. This innovation was quickly adopted across the field. Researchers have begun to use deep learning techniques for language modeling as well.
665:, a type of neural network based solely on "attention", have been widely adopted in computer vision and language modeling, sparking the interest of adapting such models to new domains, including speech recognition. Some recent papers reported superior performance levels using transformer models for speech recognition, but these models usually require large scale training datasets to reach high performance levels. 1354:, employs a speaker-dependent system, requiring each pilot to create a template. The system is not used for any safety-critical or weapon-critical tasks, such as weapon release or lowering of the undercarriage, but is used for a wide range of other cockpit functions. Voice commands are confirmed by visual and/or aural feedback. The system is seen as a major design feature in the reduction of pilot 873:(or an approximation thereof) Instead of taking the source sentence with maximal probability, we try to take the sentence that minimizes the expectancy of a given loss function with regards to all possible transcriptions (i.e., we take the sentence that minimizes the average distance to other possible sentences weighted by their estimated probability). The loss function is usually the 1878:(how often it vibrates per second). Accuracy can be computed with the help of word error rate (WER). Word error rate can be calculated by aligning the recognized word and referenced word using dynamic string alignment. The problem may occur while computing the word error rate due to the difference between the sequence lengths of the recognized word and referenced word. 290:(DTW) algorithm and used it to create a recognizer capable of operating on a 200-word vocabulary. DTW processed speech by dividing it into short frames, e.g. 10ms segments, and processing each frame as a single unit. Although DTW would be superseded by later algorithms, the technique carried on. Achieving speaker independence remained unsolved at this time period. 1432:
speech recognition task should be possible. In practice, this is rarely the case. The FAA document 7110.65 details the phrases that should be used by air traffic controllers. While this document gives less than 150 examples of such phrases, the number of phrases supported by one of the simulation vendors speech recognition systems is in excess of 500,000.
1486:/other injuries to the upper extremities can be relieved from having to worry about handwriting, typing, or working with scribe on school assignments by using speech-to-text programs. They can also utilize speech recognition technology to enjoy searching the Internet or using a computer at home without having to physically operate a mouse and keyboard. 400:
independently discovered the application of HMMs to speech.) This was controversial with linguists since HMMs are too simplistic to account for many common features of human languages. However, the HMM proved to be a highly useful way for modeling speech and replaced dynamic time warping to become the dominant speech recognition algorithm in the 1980s.
2136:
devices are also accessible to visitors to the building, or even those outside the building if they can be heard inside. Attackers may be able to gain access to personal information, like calendar, address book contents, private messages, and documents. They may also be able to impersonate the user to send messages or make online purchases.
782:, then taking the first (most significant) coefficients. The hidden Markov model will tend to have in each state a statistical distribution that is a mixture of diagonal covariance Gaussians, which will give a likelihood for each observed vector. Each word, or (for more general speech recognition systems), each 387:(HMM) for speech recognition. James Baker had learned about HMMs from a summer job at the Institute of Defense Analysis during his undergraduate education. The use of HMMs allowed researchers to combine different sources of knowledge, such as acoustics, language, and syntax, in a unified probabilistic model. 1248:
sector, speech recognition can be implemented in front-end or back-end of the medical documentation process. Front-end speech recognition is where the provider dictates into a speech-recognition engine, the recognized words are displayed as they are spoken, and the dictator is responsible for editing
944:
One approach to this limitation was to use neural networks as a pre-processing, feature transformation or dimensionality reduction, step prior to HMM based recognition. However, more recently, LSTM and related recurrent neural networks (RNNs), Time Delay Neural Networks(TDNN's), and transformers have
502:
system at CMU. The Sphinx-II system was the first to do speaker-independent, large vocabulary, continuous speech recognition and it had the best performance in DARPA's 1992 evaluation. Handling continuous speech with a large vocabulary was a major milestone in the history of speech recognition. Huang
3949:
Dosovitskiy, Alexey; Beyer, Lucas; Kolesnikov, Alexander; Weissenborn, Dirk; Zhai, Xiaohua; Unterthiner, Thomas; Dehghani, Mostafa; Minderer, Matthias; Heigold, Georg; Gelly, Sylvain; Uszkoreit, Jakob; Houlsby, Neil (3 June 2021). "An Image is Worth 16x16 Words: Transformers for Image Recognition at
2135:
Speech recognition can become a means of attack, theft, or accidental operation. For example, activation words like "Alexa" spoken in an audio or video broadcast can cause devices in homes and offices to start listening for input inappropriately, or possibly take an unwanted action. Voice-controlled
1823:
Read vs. Spontaneous Speech – When a person reads it's usually in a context that has been previously prepared, but when a person uses spontaneous speech, it is difficult to recognize the speech because of the disfluencies (like "uh" and "um", false starts, incomplete sentences, stuttering, coughing,
1426:
Training for air traffic controllers (ATC) represents an excellent application for speech recognition systems. Many ATC training systems currently require a person to act as a "pseudo-pilot", engaging in a voice dialog with the trainee controller, which simulates the dialog that the controller would
1123:
Typically a manual control input, for example by means of a finger control on the steering-wheel, enables the speech recognition system and this is signaled to the driver by an audio prompt. Following the audio prompt, the system has a "listening window" during which it may accept a speech input for
2139:
Two attacks have been demonstrated that use artificial sounds. One transmits ultrasound and attempt to send commands without nearby people noticing. The other adds small, inaudible distortions to other speech or music that are specially crafted to confuse the specific speech recognition system into
1524:
This type of technology can help those with dyslexia but other disabilities are still in question. The effectiveness of the product is the problem that is hindering it from being effective. Although a kid may be able to say a word depending on how clear they say it the technology may think they are
1435:
The USAF, USMC, US Army, US Navy, and FAA as well as a number of international ATC training organizations such as the Royal Australian Air Force and Civil Aviation Authorities in Italy, Brazil, and Canada are currently using ATC simulators with speech recognition from a number of different vendors.
967:
with multiple hidden layers of units between the input and output layers. Similar to shallow neural networks, DNNs can model complex non-linear relationships. DNN architectures generate compositional models, where extra layers enable composition of features from lower layers, giving a huge learning
709:
In 2017, Microsoft researchers reached a historical human parity milestone of transcribing conversational telephony speech on the widely benchmarked Switchboard task. Multiple deep learning models were used to optimize speech recognition accuracy. The speech recognition word error rate was reported
676:
and his students at the University of Toronto and by Li Deng and colleagues at Microsoft Research, initially in the collaborative work between Microsoft and the University of Toronto which was subsequently expanded to include IBM and Google (hence "The shared views of four research groups" subtitle
6013:
pronunciation researchers are primarily interested in improving L2 learners' intelligibility and comprehensibility, but they have not yet collected sufficient amounts of representative and reliable data (speech recordings with corresponding annotations and judgments) indicating which errors affect
1431:
techniques offer the potential to eliminate the need for a person to act as a pseudo-pilot, thus reducing training and support personnel. In theory, Air controller tasks are also characterized by highly structured speech as the primary output of the controller, hence reducing the difficulty of the
1413:
As in fighter applications, the overriding issue for voice in helicopters is the impact on pilot effectiveness. Encouraging results are reported for the AVRADA tests, although these represent only a feasibility demonstration in a test environment. Much remains to be done both in speech recognition
1331:
aircraft, and other programs in the UK dealing with a variety of aircraft platforms. In these programs, speech recognizers have been operated successfully in fighter aircraft, with applications including setting radio frequencies, commanding an autopilot system, setting steer-point coordinates and
1269:
or EHR). The use of speech recognition is more naturally suited to the generation of narrative text, as part of a radiology/pathology interpretation, progress note or discharge summary: the ergonomic gains of using speech recognition to enter structured discrete data (e.g., numeric values or codes
971:
A success of DNNs in large vocabulary speech recognition occurred in 2010 by industrial researchers, in collaboration with academic researchers, where large output layers of the DNN based on context dependent HMM states constructed by decision trees were adopted. See comprehensive reviews of this
910:
Dynamic time warping is an algorithm for measuring similarity between two sequences that may vary in time or speed. For instance, similarities in walking patterns would be detected, even if in one video the person was walking slowly and if in another he or she were walking more quickly, or even if
705:
recognition, and speaker independence was considered a major breakthrough. Until then, systems required a "training" period. A 1987 ad for a doll had carried the tagline "Finally, the doll that understands you." – despite the fact that it was described as "which children could train to respond to
399:
team created a voice activated typewriter called Tangora, which could handle a 20,000-word vocabulary Jelinek's statistical approach put less emphasis on emulating the way the human brain processes and understands speech in favor of using statistical modeling techniques like HMMs. (Jelinek's group
1706:
Speech recognition by machine is a very complex problem, however. Vocalizations vary in terms of accent, pronunciation, articulation, roughness, nasality, pitch, volume, and speed. Speech is distorted by a background noise and echoes, electrical characteristics. Accuracy of speech recognition may
1471:
People with disabilities can benefit from speech recognition programs. For individuals that are Deaf or Hard of Hearing, speech recognition software is used to automatically generate a closed-captioning of conversations such as discussions in conference rooms, classroom lectures, and/or religious
987:
and to use raw features. This principle was first explored successfully in the architecture of deep autoencoder on the "raw" spectrogram or linear filter-bank features, showing its superiority over the Mel-Cepstral features which contain a few stages of fixed transformation from spectrograms. The
612:
since at least 2006. This technology allows analysts to search through large volumes of recorded conversations and isolate mentions of keywords. Recordings can be indexed and analysts can run queries over the database to find conversations of interest. Some government research programs focused on
1277:
A more significant issue is that most EHRs have not been expressly tailored to take advantage of voice-recognition capabilities. A large part of the clinician's interaction with the EHR involves navigation through the user interface using menus, and tab/button clicks, and is heavily dependent on
940:
make fewer explicit assumptions about feature statistical properties than HMMs and have several qualities making them more attractive recognition models for speech recognition. When used to estimate the probabilities of a speech feature segment, neural networks allow discriminative training in a
789:
Described above are the core elements of the most common, HMM-based approach to speech recognition. Modern speech recognition systems use various combinations of a number of standard techniques in order to improve results over the basic approach described above. A typical large-vocabulary system
749:
Modern general-purpose speech recognition systems are based on hidden Markov models. These are statistical models that output a sequence of symbols or quantities. HMMs are used in speech recognition because a speech signal can be viewed as a piecewise stationary signal or a short-time stationary
103:
into the system. The system analyzes the person's specific voice and uses it to fine-tune the recognition of that person's speech, resulting in increased accuracy. Systems that do not use training are called "speaker-independent" systems. Systems that use training are called "speaker dependent".
1850:
This hierarchy of constraints is exploited. By combining decisions probabilistically at all lower levels, and making more deterministic decisions only at the highest level, speech recognition by a machine is a process broken into several phases. Computationally, it is a problem in which a sound
1089:
in 2016. The model named "Listen, Attend and Spell" (LAS), literally "listens" to the acoustic signal, pays "attention" to different parts of the signal and "spells" out the transcript one character at a time. Unlike CTC-based models, attention-based models do not have conditional-independence
1127:
Simple voice commands may be used to initiate phone calls, select radio stations or play music from a compatible smartphone, MP3 player or music-loaded flash drive. Voice recognition capabilities vary between car make and model. Some of the most recent car models offer natural-language speech
941:
natural and efficient manner. However, in spite of their effectiveness in classifying short-time units such as individual phonemes and isolated words, early neural networks were rarely successful for continuous recognition tasks because of their limited ability to model temporal dependencies.
2225:
also uses the same features, most of the same front-end processing, and classification techniques as is done in speech recognition. A comprehensive textbook, "Fundamentals of Speaker Recognition" is an in depth source for up to date details on the theory and practice. A good insight into the
929:
Neural networks emerged as an attractive acoustic modeling approach in ASR in the late 1980s. Since then, neural networks have been used in many aspects of speech recognition such as phoneme classification, phoneme classification through multi-objective evolutionary algorithms, isolated word
656:
and can learn "Very Deep Learning" tasks that require memories of events that happened thousands of discrete time steps ago, which is important for speech. Around 2007, LSTM trained by Connectionist Temporal Classification (CTC) started to outperform traditional speech recognition in certain
1760:
e.g. the 26 letters of the English alphabet are difficult to discriminate because they are confusing words (most notoriously, the E-set: "B, C, D, E, G, P, T, V, Z — when "Z" is pronounced "zee" rather than "zed" depending on the English region); an 8% error rate is considered good for this
1146:
assessment is the use of speech recognition to verify the correctness of pronounced speech, as distinguished from manual assessment by an instructor or proctor. Also called speech verification, pronunciation evaluation, and pronunciation scoring, the main application of this technology is
914:
A well-known application has been automatic speech recognition, to cope with different speaking speeds. In general, it is a method that allows a computer to find an optimal match between two given sequences (e.g., time series) with certain restrictions. That is, the sequences are "warped"
5340: 115:
appliance control, search key words (e.g. find a podcast where particular words were spoken), simple data entry (e.g., entering a credit card number), preparation of structured documents (e.g. a radiology report), determining speaker characteristics, speech-to-text processing (e.g.,
680:
In the long history of speech recognition, both shallow form and deep form (e.g. recurrent nets) of artificial neural networks had been explored for many years during 1980s, 1990s and a few years into the 2000s. But these methods never won over the non-uniform internal-handcrafting
1253:
system, the voice is routed through a speech-recognition machine and the recognized draft document is routed along with the original voice file to the editor, where the draft is edited and report finalized. Deferred speech recognition is widely used in the industry currently.
1501:
Speech recognition is also very useful for people who have difficulty using their hands, ranging from mild repetitive stress injuries to involve disabilities that preclude using conventional computer input devices. In fact, people who used the keyboard a lot and developed
1489:
Speech recognition can allow students with learning disabilities to become better writers. By saying the words aloud, they can increase the fluidity of their writing, and be alleviated of concerns regarding spelling, punctuation, and other mechanics of writing. Also, see
1057:, the first end-to-end sentence-level lipreading model, using spatiotemporal convolutions coupled with an RNN-CTC architecture, surpassing human-level performance in a restricted grammar dataset. A large-scale CNN-RNN-CTC architecture was presented in 2018 by 1869:
Analysis of four-step neural network approaches can be explained by further information. Sound is produced by air (or some other medium) vibration, which we register by ears, but machines by receivers. Basic sound creates a wave which has two descriptions:
1316: 1497:
The use of voice recognition software, in conjunction with a digital audio recorder and a personal computer running word-processing software has proven to be positive for restoring damaged short-term memory capacity, in stroke and craniotomy individuals.
838:, or MLLT). Many systems use so-called discriminative training techniques that dispense with a purely statistical approach to HMM parameter estimation and instead optimize some classification-related measure of the training data. Examples are maximum 1044:
assumptions similar to a HMM. Consequently, CTC models can directly learn to map speech acoustics to English characters, but the models make many common spelling mistakes and must rely on a separate language model to clean up the transcripts. Later,
1222:
bias, especially in high-stakes assessments; from words with multiple correct pronunciations; and from phoneme coding errors in machine-readable pronunciation dictionaries. In 2022, researchers found that some newer speech to text systems, based on
7084:
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., ... & Vesely, K. (2011). The Kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding (No. CONF). IEEE Signal Processing
972:
development and of the state of the art as of October 2014 in the recent Springer book from Microsoft Research. See also the related background of automatic speech recognition and the impact of various machine learning paradigms, notably including
4883: 1381:
environment as well as in the jet fighter environment. The acoustic noise problem is actually more severe in the helicopter environment, not only because of the high noise levels but also because the helicopter pilot, in general, does not wear a
5329: 1073:
are important strategies for reusing and extending the capabilities of deep learning models, particularly due to the high costs of training models from scratch, and the small size of available corpus in many languages and/or specific domains.
172:. The advances are evidenced not only by the surge of academic papers published in the field, but more importantly by the worldwide industry adoption of a variety of deep learning methods in designing and deploying speech recognition systems. 1265:) provides for substantial financial benefits to physicians who utilize an EMR according to "Meaningful Use" standards. These standards require that a substantial amount of data be maintained by the EMR (now more commonly referred to as an 6764:
Caridakis, George; Castellano, Ginevra; Kessous, Loic; Raouzaiou, Amaryllis; Malatesta, Lori; Asteriadis, Stelios; Karpouzis, Kostas (19 September 2007). "Multimodal emotion recognition from expressive faces, body gestures and speech".
726:
are important parts of modern statistically based speech recognition algorithms. Hidden Markov models (HMMs) are widely used in many systems. Language modeling is also used in many other natural language processing applications such as
1479:) or have very low vision can benefit from using the technology to convey words and then hear the computer recite them, as well as use a computer by commanding with their voice, instead of having to look at the screen and keyboard. 911:
there were accelerations and deceleration during the course of one observation. DTW has been applied to video, audio, and graphics – indeed, any data that can be turned into a linear representation can be analyzed with DTW.
4579:
The earliest applications of speech recognition software were dictation ... Four months ago, IBM introduced a 'continual dictation product' designed to ... debuted at the National Business Travel Association trade show in
849:
to find the best path, and here there is a choice between dynamically creating a combination hidden Markov model, which includes both the acoustic and language model information and combining it statically beforehand (the
1008:
is required for all HMM-based systems, and a typical n-gram language model often takes several gigabytes in memory making them impractical to deploy on mobile devices. Consequently, modern commercial ASR systems from
3074: 1128:
recognition in place of a fixed set of commands, allowing the driver to use full sentences and common phrases. With such systems there is, therefore, no need for the user to memorize a set of fixed command words.
6192: 786:, will have a different output distribution; a hidden Markov model for a sequence of words or phonemes is made by concatenating the individual trained hidden Markov models for the separate words and phonemes. 815: 6070: 761:
Another reason why HMMs are popular is that they can be trained automatically and are simple and computationally feasible to use. In speech recognition, the hidden Markov model would output a sequence of
6223: 5246: 4622: 2068: 517:, a Belgium-based speech recognition company, acquired several other companies, including Kurzweil Applied Intelligence in 1997 and Dragon Systems in 2000. The L&H speech technology was used in the 5481:
Shillingford, Brendan; Assael, Yannis; Hoffman, Matthew W.; Paine, Thomas; Hughes, Cían; Prabhu, Utsav; Liao, Hank; Sak, Hasim; Rao, Kanishka (13 July 2018). "Large-Scale Visual Speech Recognition".
4263: 2233:
A good and accessible introduction to speech recognition technology and its history is provided by the general audience book "The Voice in the Machine. Building Computers That Understand Speech" by
5012: 2170: 877:, though it can be different distances for specific tasks; the set of possible transcriptions is, of course, pruned to maintain tractability. Efficient algorithms have been devised to re score 3564: 803: 835: 3706: 2154: 521:
operating system. L&H was an industry leader until an accounting scandal brought an end to the company in 2001. The speech technology from L&H was bought by ScanSoft which became
448:
Much of the progress in the field is owed to the rapidly increasing capabilities of computers. At the end of the DARPA program in 1976, the best computer available to researchers was the
2925: 3480: 802:
to normalize for a different speaker and recording conditions; for further speaker normalization, it might use vocal tract length normalization (VTLN) for male-female normalization and
2812:
When you speak to someone, they don't just recognize what you say: they recognize who you are. WhisperID will let computers do that, too, figuring out who you are by the way you sound.
1456:
systems. Despite the high level of integration with word processing in general personal computing, in the field of document production, ASR has not seen the expected increases in use.
6101: 4212:
2013 IEEE International Conference on Acoustics, Speech and Signal Processing: New types of deep neural network learning for speech recognition and related applications: An overview
2213:, second edition published in 2004, and "Speech Processing: A Dynamic and Optimization-Oriented Approach" published in 2003 by Li Deng and Doug O'Shaughnessey. The updated textbook 4525: 1940: 7032: 819: 7010: 1004:. End-to-end models jointly learn all the components of the speech recognizer. This is valuable since it simplifies the training process and deployment process. For example, a 831: 6485: 3762:, Connectionist Speech Recognition: A Hybrid Approach, The Kluwer International Series in Engineering and Computer Science; v. 247, Boston: Kluwer Academic Publishers, 1994. 1746:
e.g. the 10 digits "zero" to "nine" can be recognized essentially perfectly, but vocabulary sizes of 200, 5000 or 100000 may have error rates of 3%, 7%, or 45% respectively.
845:
Decoding of the speech (the term for what happens when the system is presented with a new utterance and must compute the most likely source sentence) would probably use the
192:– Three Bell Labs researchers, Stephen Balashek, R. Biddulph, and K. H. Davis built a system called "Audrey" for single-speaker digit recognition. Their system located the 3737: 3070: 8987: 4099:
Lohrenz, Timo; Li, Zhengyang; Fingscheidt, Tim (14 July 2021). "Multi-Encoder Learning and Stream Fusion for Transformer-Based End-to-End Automatic Speech Recognition".
2125: 6985: 6184: 3620: 3339: 2888: 1800:
With continuous speech naturally spoken sentences are used, therefore it becomes harder to recognize the speech, different from both isolated and discontinuous speech.
164:
From the technology perspective, speech recognition has a long history with several waves of major innovations. Most recently, the field has benefited from advances in
2832: 3170: 657:
applications. In 2015, Google's speech recognition reportedly experienced a dramatic performance jump of 49% through CTC-trained LSTM, which is now available through
811: 7276: 4566: 4027:
Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (24 May 2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding".
907:
Dynamic time warping is an approach that was historically used for speech recognition but has now largely been displaced by the more successful HMM-based approach.
6746: 6559: 6062: 5085: 4005: 5539:
Kriman, Samuel; Beliaev, Stanislav; Ginsburg, Boris; Huang, Jocelyn; Kuchaiev, Oleksii; Lavrukhin, Vitaly; Leary, Ryan; Li, Jason; Zhang, Yang (22 October 2019),
5935: 3971:
Wu, Haiping; Xiao, Bin; Codella, Noel; Liu, Mengchen; Dai, Xiyang; Yuan, Lu; Zhang, Lei (29 March 2021). "CvT: Introducing Convolutions to Vision Transformers".
2858: 6215: 3922: 3649: 5215: 4612: 1700: 1390:. Substantial test and evaluation programs have been carried out in the past decade in speech recognition systems applications in helicopters, notably by the 7474: 6359: 4255: 1798:
With discontinuous speech full sentences separated by silence are used, therefore it becomes easier to recognize the speech as well as with isolated speech.
542: 486:
deployed the Voice Recognition Call Processing service in 1992 to route telephone calls without the use of a human operator. The technology was developed by
5699: 5162: 4428: 2230:(the largest speech recognition-related project ongoing as of 2007 is the GALE project, which involves both speech recognition and translation components). 2201:
can be useful to acquire basic knowledge but may not be fully up to date (1993). Another good source can be "Statistical Methods for Speech Recognition" by
494:
By this point, the vocabulary of the typical commercial speech recognition system was larger than the average human vocabulary. Raj Reddy's former student,
7634: 6402: 4462:
Deng, L.; Hassanein, K.; Elmasry, M. (1994). "Analysis of the correlation structure for a neural predictive model with application to speech recognition".
2284:. When Mozilla redirected funding away from the project in 2020, it was forked by its original developers as Coqui STT using the same open-source license. 2189:/ACM Transactions on Audio, Speech and Language Processing—after merging with an ACM publication), Computer Speech and Language, and Speech Communication. 1228: 791: 1227:
to map audio signals directly into words, produce word and phrase confidence scores very closely correlated with genuine listener intelligibility. In the
1040:
and a CTC layer. Jointly, the RNN-CTC model learns the pronunciation and acoustic model together, however it is incapable of learning the language due to
861:) to rate these good candidates so that we may pick the best one according to this refined score. The set of candidates can be kept either as a list (the 4981: 4831: 4145:
NIPS Workshop: Deep Learning for Speech Recognition and Related Applications, Whistler, BC, Canada, Dec. 2009 (Organizers: Li Deng, Geoff Hinton, D. Yu).
857:
A possible improvement to decoding is to keep a set of good candidates instead of just keeping the best candidate, and to use a better scoring function (
807: 6953: 9167: 9152: 8235: 5124:
Dahl, George E.; Yu, Dong; Deng, Li; Acero, Alex (2012). "Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition".
1692: 858: 710:
to be as low as 4 professional human transcribers working together on the same benchmark, which was funded by IBM Watson speech team on the same task.
4210:
Deng, L.; Hinton, G.; Kingsbury, B. (2013). "New types of deep neural network learning for speech recognition and related applications: An overview".
4908: 4348: 3556: 3411: 2591: 6590: 5613:
Joshi, Raviraj; Singh, Anupam (May 2022). Malmasi, Shervin; Rokhlenko, Oleg; Ueffing, Nicola; Guy, Ido; Agichtein, Eugene; Kallumadi, Surya (eds.).
1077:
An alternative approach to CTC-based models are attention-based models. Attention-based ASR models were introduced simultaneously by Chan et al. of
8829: 4284:
Morgan, Bourlard, Renals, Cohen, Franco (1993) "Hybrid neural network/hidden Markov model systems for continuous speech recognition. ICASSP/IJPRAI"
4075:
Ristea, Nicolae-Catalin; Ionescu, Radu Tudor; Khan, Fahad Shahbaz (20 June 2022). "SepTr: Separable Transformer for Audio Spectrogram Processing".
1298:. Further research needs to be conducted to determine cognitive benefits for individuals whose AVMs have been treated using radiologic techniques. 6159: 3698: 4160:
Hinton, Geoffrey; Deng, Li; Yu, Dong; Dahl, George; Mohamed, Abdel-Rahman; Jaitly, Navdeep; Senior, Andrew; Vanhoucke, Vincent; Nguyen, Patrick;
2617: 597:, a telephone based directory service. The recordings from GOOG-411 produced valuable data that helped Google improve their recognition systems. 4504: 2260:
toolkit is one place to start to both learn about speech recognition and to start experimenting. Another resource (free but copyrighted) is the
5931: 2917: 2766: 1358:, and even allows the pilot to assign targets to his aircraft with two simple voice commands or to any of his wingmen with only five commands. 1262: 1258: 1171:) but instead, knowing the expected word(s) in advance, it attempts to verify the correctness of the learner's pronunciation and ideally their 5460:
Assael, Yannis; Shillingford, Brendan; Whiteson, Shimon; de Freitas, Nando (5 November 2016). "LipNet: End-to-End Sentence-level Lipreading".
3254: 2665:
Sarangi, Susanta; Sahidullah, Md; Saha, Goutam (September 2020). "Optimization of data-driven filterbank for automatic speaker verification".
1525:
saying another word and input the wrong one. Giving them more work to fix, causing them to have to take more time with fixing the wrong word.
7612: 2158: 862: 6530: 6281: 6093: 5745:
Chorowski, Jan; Jaitly, Navdeep (8 December 2016). "Towards better decoding and language model integration in sequence to sequence models".
2226:
techniques used in the best modern systems can be gained by paying attention to government sponsored evaluations such as those organised by
5502:
Li, Jason; Lavrukhin, Vitaly; Ginsburg, Boris; Leary, Ryan; Kuchaiev, Oleksii; Cohen, Jonathan M.; Nguyen, Huyen; Gadde, Ravi Teja (2019).
2524: 1065:
launched two CNN-CTC ASR models, Jasper and QuarzNet, with an overall performance WER of 3% . Similar to other deep learning applications,
4164:; Kingsbury, Brian (2012). "Deep Neural Networks for Acoustic Modeling in Speech Recognition: The shared views of four research groups". 3476: 1671: 7024: 4533: 3992:
Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan N; Kaiser, Łukasz; Polosukhin, Illia (2017).
6806: 6438: 4856:
Wu, J.; Chan, C. (1993). "Isolated Word Recognition by Neural Network Models with Cross-Correlation Coefficients for Speech Dynamics".
4675: 3503: 1518: 1231:(CEFR) assessment criteria for "overall phonological control", intelligibility outweighs formally correct pronunciation at all levels. 7350:
The Human-Computer Interaction Handbook: Fundamentals, Evolving Technologies, and Emerging Applications (Human Factors and Ergonomics)
6928: 6853: 1049:
expanded on the work with extremely large datasets and demonstrated some commercial success in Chinese Mandarin and English. In 2016,
988:
true "raw" features of speech, waveforms, have more recently been shown to produce excellent larger-scale speech recognition results.
8023: 7467: 7000: 4512: 2799: 2166: 3232: 3105: 4941: 2499: 566: 264:
wrote an open letter that was critical of and defunded speech recognition research. This defunding lasted until Pierce retired and
207: 6477: 6014:
these speech dimensions and which do not. These data are essential to train ASR algorithms to assess L2 learners' intelligibility.
1687:
The performance of speech recognition systems is usually evaluated in terms of accuracy and speed. Accuracy is usually rated with
8345: 8192: 5373: 3533: 2332: 1969: 1827:
Adverse conditions – Environmental noise (e.g. Noise in a car or a factory). Acoustical distortions (e.g. echoes, room acoustics)
99:
Some speech recognition systems require "training" (also called "enrollment") where an individual speaker reads text or isolated
3729: 3203: 2643: 8228: 4746:
Waibel, A.; Hanazawa, T.; Hinton, G.; Shikano, K.; Lang, K. J. (1989). "Phoneme recognition using time-delay neural networks".
279:
in the late 1960s. Previous systems required users to pause after each word. Reddy's system issued spoken commands for playing
7224: 7071: 6975: 6612: 6463: 5315: 2140:
recognizing music as speech, or to make what sounds like one command to a human sound like a different command to the system.
541:
In the 2000s DARPA sponsored two speech recognition programs: Effective Affordable Reusable Speech-to-Text (EARS) in 2002 and
7395: 7357: 7334: 7315: 7065: 6847: 6782: 5814: 5671: 4227: 3888: 3613: 3332: 3281: 2881: 2563: 2442: 1095: 1021: 5787:
Chung, Joon Son; Senior, Andrew; Vinyals, Oriol; Zisserman, Andrew (16 November 2016). "Lip Reading Sentences in the Wild".
2824: 9018: 7933: 7624: 7460: 6508: 5103: 3163: 1846:
e.g. Known word pronunciations or legal word sequences, which can compensate for errors or uncertainties at a lower level;
662: 6386:
Common European framework of reference for languages learning, teaching, assessment: Companion volume with new descriptors
6329:
Compare "four" given as "F AO R" with the vowel AO as in "caught," to "row" given as "R OW" with the vowel OW as in "oat."
6316: 6039: 9119: 8670: 8407: 8187: 7272: 6691:
Forgrave, Karen E. "Assistive Technology: Empowering Students with Disabilities." Clearing House 75.3 (2002): 122–6. Web.
6629: 6123:
Hair, Adam; et al. (19 June 2018). "Apraxia world: A speech therapy game for children with speech sound disorders".
5438: 2484: 2309: 1152: 7246: 6750: 6670: 6555: 5951:
only 16% of the variability in word-level intelligibility can be explained by the presence of obvious mispronunciations.
5052: 4958:
Graves, Alex; Mohamed, Abdel-rahman; Hinton, Geoffrey (2013). "Speech recognition with deep recurrent neural networks".
4558: 4502:
Achievements and Challenges of Deep Learning: From Speech Analysis and Recognition To Language and Multimodal Processing
3993: 3590: 2949:"A History of Realtime Digital Speech on Packet Networks: Part II of Linear Predictive Coding and the Internet Protocol" 9162: 9157: 7794: 7427: 7376: 6341:"Unsupervised Uncertainty Measures of Automatic Speech Recognition for Non-intrusive Speech Intelligibility Prediction" 4721: 4369:; Morgan, N.; O'Shaughnessy, D. (2009). "Developments and Directions in Speech Recognition and Understanding, Part 1". 4294: 3053: 2494: 915:
non-linearly to match each other. This sequence alignment method is often used in the context of hidden Markov models.
798:(so that phonemes with different left and right context would have different realizations as HMM states); it would use 17: 5920: 2854: 1735:
As mentioned earlier in this article, the accuracy of speech recognition may vary depending on the following factors:
8931: 8558: 8365: 8221: 7948: 7779: 6901: 6145: 4318: 3926: 3645: 2276:
to gather big database of voices that would help build free speech recognition project DeepSpeech (available free at
1362: 1224: 3675: 2981: 1452:
and simulation. In telephony systems, ASR is now being predominantly used in contact centers by integrating it with
593:'s first effort at speech recognition came in 2007 after hiring some researchers from Nuance. The first product was 511:
joined Apple where, in 1992, he helped develop a speech interface prototype for the Apple computer known as Casper.
8886: 7719: 7129: 2407: 1406:. Results have been encouraging, and voice applications have included: control of communication radios, setting of 1110:
extended LAS to "Watch, Listen, Attend and Spell" (WLAS) to handle lip reading surpassing human-level performance.
1025: 6340: 6254: 9172: 8136: 7789: 5688: 5159: 4425: 2347: 2337: 1172: 732: 7155: 6884:
Gerbino, E.; Baggia, P.; Ciaramella, A.; Rullent, C. (1993). "Test and evaluation of a spoken dialogue system".
6384: 770:
being a small integer, such as 10), outputting one of these every 10 milliseconds. The vectors would consist of
9177: 9073: 9013: 8611: 7784: 7529: 4303:[Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing 4124: 1249:
and signing off on the document. Back-end or deferred speech recognition is where the provider dictates into a
931: 372: 311: 247: 6769:. IFIP the International Federation for Information Processing. Vol. 247. Springer US. pp. 375–388. 6701:
Tang, K. W.; Kamoua, Ridha; Sutan, Victor (2004). "Speech Recognition Technology for Disabilities Education".
4792: 3419: 8606: 8295: 8053: 7774: 6945: 2489: 2352: 1366: 1311:
Substantial efforts have been devoted in the last decade to the test and evaluation of speech recognition in
996:
Since 2014, there has been much research interest in "end-to-end" ASR. Traditional phonetic-based (i.e., all
6028:"Using automatic speech processing for foreign language pronunciation tutoring: Some issues and a prototype" 1377:
The problems of achieving high recognition accuracy under stress and noise are particularly relevant in the
9048: 8445: 8402: 8355: 8350: 7746: 1886: 1395: 823: 529:
originally licensed software from Nuance to provide speech recognition capability to its digital assistant
6643:"Using Speech Recognition Software to Increase Writing Fluency for Individuals with Physical Disabilities" 4345: 2587: 1463:. Speech is used mostly as a part of a user interface, for creating predefined or custom speech commands. 1017:(as of 2017) are deployed on the cloud and require a network connection as opposed to the device locally. 9147: 9099: 8395: 8321: 8091: 8076: 8048: 7913: 7908: 7483: 7185: 6582: 5766:
Chan, William; Zhang, Yu; Le, Quoc; Jaitly, Navdeep (10 October 2016). "Latent Sequence Decompositions".
3442: 2162: 1647: 1629: 1086: 570: 334: 7095: 5182:
Yu, D.; Deng, L. (2014). "Automatic Speech Recognition: A Deep Learning Approach (Publisher: Springer)".
1207:
and from Amira Learning. Automatic pronunciation assessment can also be used to help diagnose and treat
1000:-based model) approaches required separate components and training for the pronunciation, acoustic, and 8723: 8658: 8259: 7828: 7799: 7577: 6303: 6124: 5104:"Roles of Pre-Training and Fine-Tuning in Context-Dependent DBN-HMMs for Real-World Speech Recognition" 2472: 2437: 2417: 2292: 2253: 1601: 1148: 1091: 1078: 653: 629: 504: 5503: 3366:""There's No Data Like More Data": Automatic Speech Recognition and the Making of Algorithmic Culture" 2613: 9124: 8982: 8621: 8452: 8275: 7671: 7524: 4897: 4546:
Maners said IBM has worked on advancing speech recognition ... or on the floor of a noisy trade show.
4501: 3886:
Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural nets
2153:
Popular speech recognition conferences held each year or two include SpeechTEK and SpeechTEK Europe,
1219: 964: 937: 924: 345:– The IEEE Acoustics, Speech, and Signal Processing group held a conference in Newton, Massachusetts. 111:
such as voice dialing (e.g. "call home"), call routing (e.g. "I would like to make a collect call"),
49: 7344:
Karat, Clare-Marie; Vergo, John; Nahamoo, David (2007). "Conversational Interface Technologies". In
7310:. Cambridge Studies in Natural Language Processing. Vol. XII–XIII. Cambridge University Press. 6715: 5200: 5072: 4884:
Vowel Classification for Computer based Visual Feedback for Speech Training for the Hearing Impaired
2719: 337:
all participated in the program. This revived speech recognition research post John Pierce's letter.
9023: 8280: 8197: 8121: 7853: 7809: 7694: 7592: 6063:"Reading Coach in Immersive Reader plus new features coming to Reading Progress in Microsoft Teams" 5689:"Listen, Attend and Spell: A Neural Network for Large Vocabulary Conversational Speech Recognition" 4793:"Optimisation of phonetic aware speech recognition through multi-objective evolutionary algorithms" 3258: 2377: 2372: 1557: 1503: 1483: 1266: 1180: 1137: 1041: 1037: 882: 641: 605: 581:
containing 260 hours of recorded conversations from over 500 speakers. The GALE program focused on
574: 558: 433: 231: 133: 6874:
Ciaramella, Alberto. "A prototype performance evaluation report." Sundial workpackage 8000 (1993).
3920:
Haşim Sak, Andrew Senior, Kanishka Rao, Françoise Beaufays and Johan Schalkwyk (September 2015): "
3905: 9068: 9053: 8706: 8701: 8601: 8469: 8250: 8101: 8071: 7738: 7306:; Uszkoreit, Hans; Varile, Giovanni Battista; Zaenen, Annie; Zampolli; Zue, Victor, eds. (1997). 6526: 6277: 1792:
With isolated speech, single words are used, therefore it becomes easier to recognize the speech.
1476: 1168: 890: 851: 728: 7572: 6427:
Speech recognition in the JAS 39 Gripen aircraft: Adaptation to speech at different G-loads
5642:"Domain Adaptation of Low-Resource Target-Domain Models Using Well-Trained ASR Conformer Models" 2520: 2083: 359:, which since then has been a major venue for the publication of research on speech recognition. 9028: 8788: 8507: 8502: 7958: 7651: 7629: 7619: 7587: 7562: 6710: 5067: 1696: 1658: 1612: 799: 637: 582: 441: 440:
used HMM to recognize languages (both in software and in hardware specialized processors, e.g.
5409:
Amodei, Dario (2016). "Deep Speech 2: End-to-End Speech Recognition in English and Mandarin".
1394:
Avionics Research and Development Activity (AVRADA) and by the Royal Aerospace Establishment (
1274:) are relatively minimal for people who are sighted and who can operate a keyboard and mouse. 9058: 9043: 9008: 8696: 8596: 8464: 7818: 5572:
Medeiros, Eduardo; Corado, Leonel; Rato, Luís; Quaresma, Paulo; Salgueiro, Pedro (May 2023).
5187: 3071:"ISCA Medalist: For leadership and extensive contributions to speech and language processing" 2457: 1664: 1576: 1271: 1176: 1033: 624:
In the early 2000s, speech recognition was still dominated by traditional approaches such as
522: 310:, speech recognition research seeking a minimum vocabulary size of 1,000 words. They thought 8926: 6946:"Letter Names Can Cause Confusion and Other Things to Know About Letter–Sound Relationships" 6798: 6425: 5724:
Bahdanau, Dzmitry (2016). "End-to-End Attention-based Large Vocabulary Speech Recognition".
5271: 4926: 4645: 4051:
Gong, Yuan; Chung, Yu-An; Glass, James (8 July 2021). "AST: Audio Spectrogram Transformer".
3881: 3825: 3779: 1369:
lead-in fighter trainer. These systems have produced word accuracy scores in excess of 98%.
1257:
One of the major issues relating to the use of speech recognition in healthcare is that the
649: 9078: 9033: 8479: 8424: 8270: 8265: 8171: 7847: 7823: 7676: 6925: 6434: 5574:"Domain Adaptation Speech-to-Text for Low-Resource European Portuguese Using Deep Learning" 5287: 4435:, Diploma thesis. Institut f. Informatik, Technische Univ. Munich. Advisor: J. Schmidhuber. 4378: 4173: 3441:
Billi, Roberto; Canavesio, Franco; Ciaramella, Alberto; Nebbia, Luciano (1 November 1995).
3136: 3018: 2684: 2367: 2357: 2210: 1587: 1383: 1103: 1050: 902: 874: 562: 514: 467: 287: 221: 180:
The key areas of growth were: vocabulary size, speaker independence, and processing speed.
108: 89: 5384: 4488:
Keynote talk: Recent Developments in Deep Neural Networks. ICASSP, 2013 (by Geoff Hinton).
3097: 2791: 2264:
book (and the accompanying HTK toolkit). For more recent and state-of-the-art techniques,
8: 8653: 8631: 8380: 8375: 8333: 8285: 8151: 8081: 8038: 7994: 7766: 7756: 7751: 7639: 7001:"Is it possible to control Amazon Alexa, Google Now using inaudible commands? Absolutely" 3228: 2521:"Speaker Independent Connected Speech Recognition- Fifth Generation Computer Corporation" 2452: 2427: 2422: 2382: 2222: 1564: 1491: 1347: 1295: 1218:
Assessing authentic listener intelligibility is essential for avoiding inaccuracies from
997: 984: 744: 686: 625: 598: 384: 276: 150: 145: 7407:"SpeeG2: A Speech- and Gesture-based Interface for Efficient Controller-free Text Entry" 6834:. SpringerBriefs in Electrical and Computer Engineering. Singapore: Springer Singapore. 5615:"A Simple Baseline for Domain Adaptation in End to End ASR Systems Using Synthetic Data" 5542:
QuartzNet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions
5291: 4930: 4382: 4177: 3140: 3022: 2688: 9038: 8161: 7898: 7661: 7644: 7502: 6907: 6728: 6662: 6151: 6004: 5901: 5853: 5843: 5820: 5792: 5767: 5746: 5725: 5649: 5546: 5511: 5482: 5461: 5410: 5238: 5141: 5004: 4959: 4823: 4773: 4667: 4404: 4324: 4233: 4189: 4100: 4076: 4052: 4028: 3972: 3951: 3863: 3837: 3807: 3393: 3365: 3309: 2758: 2700: 2674: 2569: 2234: 1336: 1156: 839: 751: 750:
signal. In a short time scale (e.g., 10 milliseconds), speech can be approximated as a
275:
was the first person to take on continuous speech recognition as a graduate student at
154: 129: 7208: 4931:"Sequence labelling in structured domains with hierarchical recurrent neural networks" 3525: 2720:"Robust text-independent speaker identification using Gaussian mixture speaker models" 9104: 9092: 8896: 8548: 8419: 8412: 8166: 7878: 7686: 7597: 7423: 7391: 7372: 7353: 7330: 7311: 7216: 7061: 6897: 6843: 6829: 6778: 6732: 6666: 6394: 6390: 6141: 5996: 5905: 5893: 5857: 5810: 5667: 5595: 4827: 4815: 4671: 4475: 4314: 4223: 4193: 3855: 3799: 3458: 3397: 3385: 3301: 3195: 3049: 2973: 2750: 2742: 2704: 2639: 2559: 2202: 2174: 1570: 1546: 1538: 1415: 1328: 1250: 1164: 1070: 1066: 846: 775: 396: 265: 243: 41: 31: 7126:"A TensorFlow implementation of Baidu's DeepSpeech architecture: mozilla/DeepSpeech" 6911: 6355: 6155: 6008: 5242: 5145: 4328: 4237: 3867: 2573: 1506:
became an urgent early market for speech recognition. Speech recognition is used in
1459:
The improvement of mobile processor speeds has made speech recognition practical in
452:
with 4 MB ram. It could take up to 100 minutes to decode just 30 seconds of speech.
157:
in systems that have been trained on a specific person's voice or it can be used to
8849: 8839: 8646: 8440: 8390: 8385: 8328: 8316: 8043: 7928: 7903: 7704: 7607: 6889: 6835: 6770: 6720: 6654: 6616: 6455: 6351: 6133: 5986: 5976: 5883: 5824: 5802: 5663: 5659: 5622: 5585: 5525: 5521: 5295: 5230: 5133: 5077: 4996: 4865: 4807: 4777: 4763: 4755: 4713: 4657: 4471: 4394: 4386: 4306: 4298: 4215: 4181: 3847: 3811: 3791: 3454: 3377: 3313: 3293: 3144: 3026: 2963: 2762: 2734: 2692: 2551: 2447: 2432: 2412: 2392: 2299: 2265: 2198: 1606: 1428: 1418:
in order to consistently achieve performance improvements in operational settings.
1312: 1208: 1200: 1160: 779: 609: 586: 550: 487: 463: 322: 239: 93: 81: 45: 7055: 5626: 5008: 4408: 3885: 3317: 2185:
Transactions on Audio, Speech and Language Processing and since Sept 2014 renamed
220:
demonstrated its 16-word "Shoebox" machine's speech recognition capability at the
8962: 8906: 8728: 8370: 8290: 8155: 8116: 8111: 7979: 7709: 7582: 7557: 7539: 7443: 6932: 6774: 6185:"Computer says no: Irish vet fails oral English test needed to stay in Australia" 5166: 5029:
Maas, Andrew L.; Le, Quoc V.; O'Neil, Tyler M.; Vinyals, Oriol; Nguyen, Patrick;
4508: 4432: 4421: 4352: 3892: 3851: 3775: 2218: 1778:
A speaker-independent system is intended for use by any speaker (more difficult).
1688: 1596: 1449: 1399: 1204: 878: 866: 827: 673: 645: 613:
intelligence applications of speech recognition, e.g. DARPA's EARS's program and
364: 330: 6724: 6512: 5965:"Directions for the future of technology in pronunciation research and teaching" 5641: 4219: 3699:"The Power of Voice: A Conversation With The Head Of Google's Speech Technology" 3164:"The Acoustics, Speech, and Signal Processing Society. A Historical Perspective" 1361:
Speaker-independent systems are also being developed and are under test for the
1102:
to directly emit sub-word units which are more natural than English characters;
8936: 8901: 8891: 8716: 8474: 8300: 7863: 7843: 7567: 7303: 6893: 6658: 6308: 6027: 5300: 5275: 4811: 4613:"Microsoft researchers achieve new conversational speech recognition milestone" 4448:
Artificial Neural Networks and their Application to Speech/Sequence Recognition
4310: 3795: 3006: 2261: 1591: 1514: 1287: 1212: 1001: 723: 719: 669: 632:. Today, however, many aspects of speech recognition have been taken over by a 409: 380: 376: 261: 158: 117: 7452: 6839: 6398: 6389:. Language Policy Programme, Education Policy Division, Education Department, 5888: 5871: 5430: 5234: 5137: 4768: 4717: 3906:
An application of recurrent neural networks to discriminative keyword spotting
3196:"First-Hand:The Hidden Markov Model – Engineering and Technology History Wiki" 2696: 2555: 1339:
Gripen cockpit, Englund (2004) found recognition deteriorated with increasing
806:(MLLR) for more general speaker adaptation. The features would have so-called 9141: 8881: 8861: 8778: 8457: 8126: 7938: 7918: 7699: 7413:. 15th International Conference on Multimodal Interaction. Sydney, Australia. 7220: 6000: 5897: 5599: 5330:"Acoustic Modeling with Deep Neural Networks Using Raw Time Signal for LVCSR" 5000: 4819: 4594: 4185: 3759: 3586: 3389: 3305: 2977: 2746: 2206: 1550: 1184: 1143: 980: 973: 968:
capacity and thus the potential of modeling complex patterns of speech data.
954: 886: 682: 633: 618: 578: 495: 235: 165: 6642: 6137: 5614: 4698: 4390: 3614:
Automatic Speech Recognition – A Brief History of the Technology Development
3127:
Klatt, Dennis H. (1977). "Review of the ARPA speech understanding project".
2882:"Automatic speech recognition–a brief history of the technology development" 2754: 1427:
have to conduct with pilots in a real ATC situation. Speech recognition and
1315:. Of particular note have been the US program in speech recognition for the 963:
are also under investigation. A deep feedforward neural network (DNN) is an
8967: 8798: 8213: 8106: 7724: 7345: 7005: 5981: 5964: 4982:"Modular Construction of Time-Delay Neural Networks for Speech Recognition" 4662: 4366: 4161: 3859: 2387: 2342: 2273: 1862:
Compute features of spectral-domain of the speech (with Fourier transform);
1196: 1188: 1099: 1082: 755: 701:
recognition, also called voice recognition was clearly differentiated from
658: 368: 356: 6216:"Australian ex-news reader with English degree fails robot's English test" 5806: 4791:
Bird, Jordan J.; Wanner, Elizabeth; Ekárt, Anikó; Faria, Diego R. (2020).
4446: 3803: 3333:
Automatic speech recognition–a brief history of the technology development
2948: 2305:
The commercial cloud based speech recognition APIs are broadly available.
1963:
While computing, the word recognition rate (WRR) is used. The formula is:
842:(MMI), minimum classification error (MCE), and minimum phone error (MPE). 9063: 8834: 8743: 8738: 8360: 8338: 8063: 7943: 7656: 7549: 7497: 7327:
Robustness in Automatic Speech Recognition: Fundamentals and Applications
6767:
Artificial Intelligence and Innovations 2007: From Theory to Applications
6126:
Proceedings of the 17th ACM Conference on Interaction Design and Children
4592:
Ellis Booker (14 March 1994). "Voice recognition enters the mainstream".
4341: 4299:"A real-time recurrent error propagation network word recognition system" 3671: 2546:
P. Nguyen (2010). "Automatic classification of speaker characteristics".
1653: 1245: 960: 203: 85: 57: 53: 7125: 4699:"Edit-Distance of Weighted Automata: General Definitions and Algorithms" 4399: 1807:
e.g. Querying application may dismiss the hypothesis "The apple is red."
778:
of a short time window of speech and decorrelating the spectrum using a
8957: 8916: 8911: 8824: 8733: 8641: 8553: 8533: 7666: 6886:
IEEE International Conference on Acoustics Speech and Signal Processing
6246: 5991: 5621:. Dublin, Ireland: Association for Computational Linguistics: 244–249. 5590: 5573: 5314:
L. Deng, M. Seltzer, D. Yu, A. Acero, A. Mohamed, and G. Hinton (2010)
5081: 5033:(2012). "Recurrent Neural Networks for Noise Reduction in Robust ASR". 2968: 2580: 2402: 2296: 2281: 2257: 1866:
computed every 10 ms, with one 10 ms section called a frame;
1675: 1641: 1635: 1618: 1581: 1542: 1460: 1407: 1387: 1378: 1324: 1014: 870: 526: 518: 508: 499: 100: 5789:
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
5374:"Towards End-to-End Speech Recognition with Recurrent Neural Networks" 4869: 3030: 2738: 896: 8952: 8921: 8819: 8663: 8626: 8563: 8517: 8512: 8497: 7534: 7151: 6094:"Schools Are Using Voice Technology to Teach Reading. Is It Helping?" 5030: 4784: 4759: 4617: 3948: 3148: 2322: 1875: 1871: 1534: 1510: 1445: 1391: 1291: 1192: 818:(HLDA); or might skip the delta and delta-delta coefficients and use 272: 257: 149:
refers to identifying the speaker, rather than what they are saying.
5459: 4120: 3297: 1858:
For telephone speech the sampling rate is 8000 samples per second;
1163:. Pronunciation assessment does not determine unknown speech (as in 8854: 8686: 8009: 7989: 7974: 7953: 7923: 7868: 7833: 7714: 7406: 7369:
The Voice in the Machine. Building Computers That Understand Speech
6763: 5848: 5797: 5772: 5751: 5730: 5654: 5551: 5540: 5516: 5487: 5466: 5415: 4105: 4081: 4057: 4033: 3977: 3956: 3557:"Speech Recognition Through the Decades: How We Ended Up With Siri" 3381: 2918:"Speech Recognition Through the Decades: How We Ended Up With Siri" 2679: 2467: 2462: 1775:
A speaker-dependent system is intended for use by a single speaker.
1634:
Speech to text (transcription of speech into text, real time video
1623: 1398:) in the UK. Work in France has included speech recognition in the 1355: 1107: 1058: 1029: 771: 594: 193: 169: 161:
or verify the identity of a speaker as part of a security process.
125: 6478:"Eurofighter Typhoon – The world's most advanced fighter aircraft" 5842:, Conference on Empirical Methods in Natural Language Processing, 5328:
Tüske, Zoltán; Golik, Pavel; Schlüter, Ralf; Ney, Hermann (2014).
4964: 3842: 1835:
Acoustical signals are structured into a hierarchy of units, e.g.
1061:
achieving 6 times better performance than human experts. In 2019,
8977: 8814: 8768: 8691: 8591: 8586: 8538: 8146: 8004: 7984: 7858: 7602: 7517: 5921:"Pronunciation accuracy and intelligibility of non-native speech" 5687:
Chan, William; Jaitly, Navdeep; Le, Quoc; Vinyals, Oriol (2016).
5619:
Proceedings of the Fifth Workshop on E-Commerce and NLP (ECNLP 5)
5160:
Recent Advances in Deep Learning for Speech Research at Microsoft
4559:"Voice Recognition To Ease Travel Bookings: Business Travel News" 4355:. IEEE Transactions on Acoustics, Speech, and Signal Processing." 3474: 2664: 2269: 2221:
and Martin presents the basics and the state of the art for ASR.
1836: 1810:
e.g. Constraints may be semantic; rejecting "The apple is angry."
1753:
Vocabulary is hard to recognize if it contains confusing letters:
1340: 1286:
Prolonged use of speech recognition software in conjunction with
948: 795: 783: 483: 466:
with up to 4096 words support, of which only 64 could be held in
112: 5216:"Machine Learning Paradigms for Speech Recognition: An Overview" 5111:
NIPS Workshop on Deep Learning and Unsupervised Feature Learning
3904:
Santiago Fernandez, Alex Graves, and Jürgen Schmidhuber (2007).
1831:
Speech recognition is a multi-leveled pattern recognition task.
1147:
computer-aided pronunciation teaching (CAPT) when combined with
8992: 8972: 8844: 8636: 7512: 7507: 7447: 4898:"Dimensionality Reduction Methods for HMM Phonetic Recognition" 2548:
International Conference on Communications and Electronics 2010
2397: 2362: 2288: 2277: 1403: 1062: 1054: 1010: 1005: 590: 449: 422: 352: 60:
of spoken language into text by computers. It is also known as
7388:
Advanced algorithms and architectures for speech understanding
7177: 6926:
The History of Automatic Speech Recognition Evaluation at NIST
6883: 5480: 5316:
Binary Coding of Speech Spectrograms Using a Deep Auto-encoder
5158:
Deng L., Li, J., Huang, J., Yao, K., Yu, D., Seide, F. et al.
4858:
IEEE Transactions on Pattern Analysis and Machine Intelligence
3440: 8793: 8773: 8763: 8758: 8753: 8748: 8711: 8543: 8202: 7838: 7103: 4748:
IEEE Transactions on Acoustics, Speech, and Signal Processing
3991: 3774: 2327: 2227: 1410:
systems, and control of an automated target handover system.
1046: 614: 554: 437: 303: 280: 121: 3470: 3468: 3443:"Interactive voice technology at work: The CSELT experience" 2161:/Eurospeech, and the IEEE ASRU. Conferences in the field of 1628:
Security, including usage with other biometric scanners for
1332:
weapons release parameters, and controlling flight display.
436:
allowed language models to use multiple length n-grams, and
8783: 7308:
Survey of the state of the art in human language technology
5786: 5504:"Jasper: An End-to-End Convolutional Neural Acoustic Model" 5223:
IEEE Transactions on Audio, Speech, and Language Processing
5126:
IEEE Transactions on Audio, Speech, and Language Processing
4745: 4600:
Just a few years ago, speech recognition was limited to ...
2186: 2182: 2181:
Transactions on Speech and Audio Processing (later renamed
2178: 2063:{\displaystyle WRR=1-WER={(n-s-d-i) \over n}={h-i \over n}} 1507: 1320: 1290:
has shown benefits to short-term-memory restrengthening in
530: 136:
is used in education such as for spoken language learning.
7301: 7178:"Coqui, a startup providing open speech tech for everyone" 6739: 5571: 5538: 4924: 4364: 4249: 4247: 3646:"Nuance Exec on iPhone 4S, Siri, and the Future of Speech" 991: 934:, audiovisual speaker recognition and speaker adaptation. 577:. EARS funded the collection of the Switchboard telephone 260:
dried up for several years when, in 1969, the influential
7999: 6980: 6583:"Speech recognition in schools: An update from the field" 6509:"Researchers fine-tune F-35 pilot-aircraft speech system" 4496: 4494: 3828:(2015). "Deep learning in neural networks: An overview". 3465: 3280:
Huang, Xuedong; Baker, James; Reddy, Raj (January 2014).
1453: 1351: 1175:
to listeners, sometimes along with often inconsequential
668:
The use of deep feedforward (non-recurrent) networks for
546: 326: 217: 6587:
Technology And Persons With Disabilities Conference 2000
6580: 5963:
O’Brien, Mary Grantham; et al. (31 December 2018).
5501: 4706:
International Journal of Foundations of Computer Science
3223: 3221: 869:). Re scoring is usually done by trying to minimize the 7209:"Māori are trying to save their language from Big Tech" 5028: 4482: 4244: 4026: 4491: 4205: 4203: 3475:
Xuedong Huang; James Baker; Raj Reddy (January 2014).
2242:
Automatic Speech Recognition: A Deep Learning Approach
814:
to capture speech dynamics and in addition, might use
482:– Dragon Dictate, a consumer product released in 1990 7025:"Attack Targets Automatic Speech Recognition Systems" 6976:"Listen Up: Your AI Assistant Goes Crazy For NPR Too" 5919:
Loukina, Anastassia; et al. (6 September 2015),
5327: 4526:"Improvements in voice recognition software increase" 3880:
Alex Graves, Santiago Fernandez, Faustino Gomez, and
3218: 3044:
Benesty, Jacob; Sondhi, M. M.; Huang, Yiteng (2008).
2173:, EMNLP, and HLT, are beginning to include papers on 2086: 1972: 1889: 1881:
The formula to compute the word error rate (WER) is:
545:(GALE). Four teams participated in the EARS program: 7152:"GitHub - tensorflow/docs: TensorFlow documentation" 6527:"Overcoming Communication Barriers in the Classroom" 4790: 4461: 4346:
Phoneme recognition using time-delay neural networks
4098: 3898: 3096:
Blechman, R. O.; Blechman, Nicholas (23 June 2008).
2606: 1306: 1229:
Common European Framework of Reference for Languages
6558:. National Center for Technology Innovation. 2010. 5686: 5646:
2022 IEEE Spoken Language Technology Workshop (SLT)
4957: 4882:
S. A. Zahorian, A. M. Zimmer, and F. Meng, (2002) "
4209: 4200: 4155: 4153: 4151: 2197:Books like "Fundamentals of Speech Recognition" by 1421: 897:
Dynamic time warping (DTW)-based speech recognition
7418:Woelfel, Matthias; McDonough, John (26 May 2009). 5838:El Kheir, Yassine; et al. (21 October 2023), 4256:"Scientists See Promise in Deep-Learning Programs" 3043: 2119: 2062: 1934: 1739:Error rates increase as the vocabulary size grows: 713: 7343: 7247:"Why you should move from DeepSpeech to coqui.ai" 6924:National Institute of Standards and Technology. " 6247:"The English test that ruined thousands of lives" 5765: 5640:Sukhadia, Vrunda N.; Umesh, S. (9 January 2023). 4646:"Minimum Bayes-risk automatic speech recognition" 4074: 3998:Advances in Neural Information Processing Systems 2718:Reynolds, Douglas; Rose, Richard (January 1995). 2588:"British English definition of voice recognition" 608:has made use of a type of speech recognition for 476:– a recognizer from Kurzweil Applied Intelligence 286:Around this time Soviet researchers invented the 80:). It incorporates knowledge and research in the 56:and technologies that enable the recognition and 27:Automatic conversion of spoken language into text 9139: 7685: 7417: 6831:Robustness-Related Issues in Speaker Recognition 5264: 4159: 4148: 3970: 3477:"A Historical Perspective of Speech Recognition" 3282:"A historical perspective of speech recognition" 3129:The Journal of the Acoustical Society of America 3095: 2727:IEEE Transactions on Speech and Audio Processing 2632: 1448:and is becoming more widespread in the field of 945:demonstrated improved performance in this area. 672:was introduced during the later part of 2009 by 7482: 6700: 6581:Follensbee, Bob; McCloskey-Dale, Susan (2000). 5744: 5321: 5308: 5123: 4426:Untersuchungen zu dynamischen neuronalen Netzen 3923:"Google voice search: faster and more accurate" 3818: 3279: 1819:Constraints are often represented by grammar. 416: 375:. A decade later, at CMU, Raj Reddy's students 6641:Garrett, Jennifer Tumlin; et al. (2011). 6350:. INTERSPEECH 2022. ISCA. pp. 3493–3497. 6191:. Australian Associated Press. 8 August 2017. 5962: 5932:International Speech Communication Association 5431:"LipNet: How easy do you think lipreading is?" 4415: 4365:Baker, J.; Li Deng; Glass, J.; Khudanpur, S.; 4050: 3554: 3005: 2915: 2295:applications. It can be activated through the 2240:The most recent book on speech recognition is 1439: 1317:Advanced Fighter Technology Integration (AFTI) 1259:American Recovery and Reinvestment Act of 2009 949:Deep feedforward and recurrent neural networks 8229: 7468: 6550: 6548: 6245:Main, Ed; Watson, Richard (9 February 2022). 5869: 5840:Automatic Pronunciation Assessment — A Review 5639: 5177: 5175: 4918: 3696: 3579: 2077:is the number of correctly recognized words: 1854:Digitize the speech that we want to recognize 1813:e.g. Syntactic; rejecting "Red is apple the." 1020:The first attempt at end-to-end ASR was with 774:coefficients, which are obtained by taking a 8243: 5918: 4591: 4287: 3752: 3501: 3011:Journal of the Acoustical Society of America 2717: 1785:Isolated, Discontinuous or continuous speech 1717:Isolated, discontinuous or continuous speech 1482:Students who are physically disabled have a 816:heteroscedastic linear discriminant analysis 7405:Signer, Beat; Hoste, Lode (December 2013). 7404: 7324: 5837: 5270: 5060:Foundations and Trends in Signal Processing 5046: 5044: 4896:Hu, Hongbing; Zahorian, Stephen A. (2010). 4344:, Hanazawa, Hinton, Shikano, Lang. (1989) " 4293: 3824: 2911: 2909: 2205:and "Spoken Language Processing (2001)" by 2148: 1842:Each level provides additional constraints; 1466: 1386:, which would reduce acoustic noise in the 1199:. Pronunciation assessment is also used in 421:The 1980s also saw the introduction of the 250:(NTT), while working on speech recognition. 8236: 8222: 7475: 7461: 7366: 6545: 6338: 6278:"13 Words That Can Be Pronounced Two Ways" 6244: 6122: 5870:Isaacs, Talia; Harding, Luke (July 2017). 5612: 5172: 5152: 5101: 4895: 4644:Goel, Vaibhava; Byrne, William J. (2000). 4438: 4335: 4278: 3916: 3914: 3874: 2999: 1691:(WER), whereas speed is measured with the 1402:. There has also been much useful work in 1335:Working with Swedish pilots flying in the 865:approach) or as a subset of the models (a 317:would be key to making progress in speech 9153:Automatic identification and data capture 7390:. Springer Science & Business Media. 6714: 6703:Journal of Educational Technology Systems 6511:. United States Air Force. Archived from 6339:Tu, Zehai; Ma, Ning; Barker, Jon (2022). 5990: 5980: 5887: 5847: 5796: 5771: 5750: 5729: 5653: 5589: 5550: 5515: 5486: 5465: 5414: 5299: 5095: 5071: 5053:"Deep Learning: Methods and Applications" 4963: 4767: 4661: 4643: 4398: 4358: 4104: 4080: 4056: 4032: 3976: 3955: 3841: 3048:. Springer Science & Business Media. 2967: 2678: 2545: 6828:Zheng, Thomas Fang; Li, Lantian (2017). 6613:"Speech recognition for disabled people" 6060: 6025: 5969:Journal of Second Language Pronunciation 5723: 5356: 5041: 4455: 4139: 3908:. Proceedings of ICANN (2), pp. 220–229. 3727: 3611: 3330: 3068: 2906: 2879: 2500:Timeline of speech and voice recognition 2272:launched the open source project called 2252:In terms of freely available resources, 1549:used speech recognition technology from 1239: 208:source-filter model of speech production 196:in the power spectrum of each utterance. 107:Speech recognition applications include 6998: 6827: 6799:"What is real-time captioning? | DO-IT" 6749:. The Planetary Society. Archived from 6647:Journal of Special Education Technology 6640: 6506: 6435:Stockholm Royal Institute of Technology 6423: 6276:Joyce, Katy Spratte (24 January 2023). 6213: 6091: 5956: 4253: 3911: 3770: 3768: 3161: 3009:(1969). "Whither speech recognition?". 2333:Applications of artificial intelligence 2280:), using Google's open source platform 1528: 1444:ASR is now commonplace in the field of 992:End-to-end automatic speech recognition 766:-dimensional real-valued vectors (with 738: 601:is now supported in over 30 languages. 543:Global Autonomous Language Exploitation 408:– Dragon Systems, founded by James and 14: 9140: 7385: 7206: 7013:from the original on 2 September 2017. 5408: 5371: 5346:from the original on 21 December 2016. 5213: 5181: 5050: 4979: 4855: 4444: 3895:. Proceedings of ICML'06, pp. 369–376. 3046:Springer Handbook of Speech Processing 2594:from the original on 16 September 2011 2143: 1714:Speaker dependence versus independence 826:-based projection followed perhaps by 30:For the human linguistic concept, see 9168:History of human–computer interaction 8217: 7456: 7279:from the original on 9 September 2024 7227:from the original on 9 September 2024 7188:from the original on 9 September 2024 7158:from the original on 9 September 2024 7132:from the original on 9 September 2024 7074:from the original on 31 January 2018. 7053: 7047: 6956:from the original on 9 September 2024 6856:from the original on 9 September 2024 6809:from the original on 9 September 2024 6673:from the original on 9 September 2024 6405:from the original on 9 September 2024 6365:from the original on 9 September 2024 6284:from the original on 9 September 2024 6275: 6257:from the original on 9 September 2024 6226:from the original on 9 September 2024 6195:from the original on 9 September 2024 6165:from the original on 9 September 2024 6104:from the original on 9 September 2024 6073:from the original on 9 September 2024 6042:from the original on 9 September 2024 5941:from the original on 9 September 2024 5705:from the original on 9 September 2024 5252:from the original on 9 September 2024 5117: 5091:from the original on 22 October 2014. 4837:from the original on 9 September 2024 4696: 4625:from the original on 9 September 2024 4569:from the original on 9 September 2024 4266:from the original on 30 November 2012 4127:from the original on 9 September 2024 4094: 4092: 4070: 4068: 4046: 4044: 4008:from the original on 9 September 2024 3652:from the original on 19 November 2011 3536:from the original on 21 December 2016 3126: 2802:from the original on 25 February 2014 2646:from the original on 19 February 2013 2527:from the original on 11 November 2013 2443:Speech recognition software for Linux 1935:{\displaystyle WER={(s+d+i) \over n}} 1824:and laughter) and limited vocabulary. 1695:. Other measures of accuracy include 1022:Connectionist Temporal Classification 505:speech recognition group at Microsoft 9074:Generative adversarial network (GAN) 7934:Simple Knowledge Organization System 7325:Junqua, J.-C.; Haton, J.-P. (1995). 6444:from the original on 2 October 2008. 4947:from the original on 15 August 2017. 3765: 3593:from the original on 5 February 2014 3567:from the original on 13 January 2017 3483:from the original on 20 January 2015 3108:from the original on 20 January 2015 3077:from the original on 24 January 2018 2946: 2940: 2928:from the original on 3 November 2018 2792:"Speaker Identification (WhisperID)" 2620:from the original on 3 December 2011 2130: 1768:Speaker dependence vs. independence: 1553:in the Mars Microphone on the Lander 1294:patients who have been treated with 804:maximum likelihood linear regression 7352:. Lawrence Erlbaum Associates Inc. 7057:Fundamentals of Speaker Recognition 6630:Friends International Support Group 6593:from the original on 21 August 2006 6319:from the original on 15 August 2010 5102:Yu, D.; Deng, L.; Dahl, G. (2010). 4925:Fernandez, Santiago; Graves, Alex; 4511:," Interspeech, September 2014 (by 3504:"When Cole talks, computers listen" 3345:from the original on 17 August 2014 3331:Juang, B. H.; Rabiner, Lawrence R. 3235:from the original on 28 August 2017 2987:from the original on 9 October 2022 2894:from the original on 17 August 2014 2880:Juang, B. H.; Rabiner, Lawrence R. 2485:List of speech recognition software 2310:List of speech recognition software 2291:supports speech recognition on all 1350:, currently in service with the UK 1153:computer-assisted language learning 959:Deep neural networks and denoising 836:maximum likelihood linear transform 412:, was one of IBM's few competitors. 24: 7295: 6999:Claburn, Thomas (25 August 2017). 6988:from the original on 23 July 2017. 6562:from the original on 13 April 2014 6466:from the original on 1 March 2017. 6061:Tholfsen, Mike (9 February 2023). 6032:Language Learning & Technology 5474: 5441:from the original on 27 April 2017 5018:from the original on 29 June 2016. 4727:from the original on 18 March 2012 4451:(Ph.D. thesis). McGill University. 4254:Markoff, John (23 November 2012). 4089: 4065: 4041: 3782:(1997). "Long Short-Term Memory". 3697:Jason Kincaid (13 February 2011). 3626:from the original on 9 August 2017 3555:Melanie Pinola (2 November 2011). 3518: 3363: 3176:from the original on 9 August 2017 2916:Melanie Pinola (2 November 2011). 2861:from the original on 9 August 2018 2614:"voice recognition, definition of" 2495:Outline of artificial intelligence 1960:is the number of word references. 1281: 1203:, for example in products such as 1024:(CTC)-based systems introduced by 918: 830:linear discriminant analysis or a 25: 9189: 7949:Thesaurus (information retrieval) 7437: 7207:Coffey, Donavyn (28 April 2021). 7035:from the original on 3 March 2018 6747:"Projects: Planetary Microphones" 6556:"Speech Recognition for Learning" 6533:from the original on 25 July 2013 6507:Schutte, John (15 October 2007). 6214:Ferrier, Tracey (9 August 2017). 6026:Eskenazi, Maxine (January 1999). 4914:from the original on 6 July 2012. 4678:from the original on 25 July 2011 3740:from the original on 27 June 2015 3709:from the original on 21 July 2015 3678:from the original on 11 July 2017 3648:. Tech.pinions. 10 October 2011. 3206:from the original on 3 April 2018 2835:from the original on 4 April 2019 2772:from the original on 8 March 2014 2308:For more software resources, see 2177:. Important journals include the 1956:is the number of insertions, and 1711:Vocabulary size and confusability 1644:(e.g. vehicle Navigation Systems) 1573:listing in audiovisual production 1307:High-performance fighter aircraft 1225:end-to-end reinforcement learning 1118: 9112: 9111: 9091: 7265: 7239: 7200: 7170: 7144: 7118: 7088: 7078: 7017: 6992: 6968: 6938: 6918: 6877: 6868: 6821: 6791: 6757: 6694: 6685: 6634: 6623: 6605: 6574: 6519: 6500: 6488:from the original on 11 May 2013 6470: 6448: 6417: 6377: 6332: 6309:"The CMU Pronouncing Dictionary" 6296: 6269: 6238: 6207: 6177: 6116: 6085: 6054: 6019: 5912: 5863: 5831: 5780: 5759: 5738: 5717: 5680: 4800:Expert Systems with Applications 4532:. 27 August 2002. Archived from 3612:Juang, B.H.; Rabiner, Lawrence. 2590:. Macmillan Publishers Limited. 2408:Multimedia information retrieval 1948:is the number of substitutions, 1839:, Words, Phrases, and Sentences; 1422:Training air traffic controllers 1036:in 2014. The model consisted of 983:is to do away with hand-crafted 754:. Speech can be thought of as a 321:, but this later proved untrue. 7386:Pirani, Giancarlo, ed. (2013). 6356:10.21437/Interspeech.2022-10408 6092:Banerji, Olina (7 March 2023). 5633: 5606: 5565: 5532: 5495: 5453: 5423: 5402: 5365: 5350: 5207: 5035:Proceedings of Interspeech 2012 5022: 4973: 4951: 4889: 4876: 4849: 4739: 4690: 4637: 4605: 4585: 4551: 4518: 4371:IEEE Signal Processing Magazine 4166:IEEE Signal Processing Magazine 4113: 4020: 3985: 3964: 3942: 3721: 3690: 3664: 3638: 3605: 3548: 3526:"ACT/Apricot - Apricot history" 3495: 3434: 3412:"History of Speech Recognition" 3404: 3357: 3324: 3273: 3255:"Pioneering Speech Recognition" 3247: 3188: 3155: 3120: 3089: 3062: 3037: 2873: 2847: 2348:Audio-visual speech recognition 2338:Articulatory speech recognition 1113: 976:, in recent overview articles. 893:verifying certain assumptions. 733:statistical machine translation 714:Models, methods, and algorithms 92:fields. The reverse process is 9024:Recurrent neural network (RNN) 9014:Differentiable neural computer 7530:Natural language understanding 7329:. Kluwer Academic Publishers. 6393:. February 2018. p. 136. 5664:10.1109/SLT54892.2023.10023233 5526:10.21437/Interspeech.2019-1819 5359:Speech and Language Processing 4650:Computer Speech & Language 3257:. 7 March 2012. Archived from 2825:"Obituaries: Stephen Balashek" 2817: 2784: 2711: 2658: 2539: 2513: 2215:Speech and Language Processing 2111: 2099: 2030: 2006: 1923: 1905: 1804:Task and language constraints 1723:Read versus spontaneous speech 1682: 1590:: Speech recognition computer 1372: 1234: 932:audiovisual speech recognition 758:for many stochastic purposes. 373:Institute for Defense Analysis 248:Nippon Telegraph and Telephone 238:method, was first proposed by 13: 1: 9069:Variational autoencoder (VAE) 9029:Long short-term memory (LSTM) 8296:Computational learning theory 8054:Optical character recognition 3730:"THE COMPUTERS ARE LISTENING" 3502:Kevin McKean (8 April 1980). 3479:. Communications of the ACM. 3364:Li, Xiaochang (1 July 2023). 2506: 2490:List of emerging technologies 2353:Automatic Language Translator 2268:toolkit can be used. In 2017 1720:Task and language constraints 1513:, such as voicemail to text, 1367:Alenia Aermacchi M-346 Master 1327:), the program in France for 979:One fundamental principle of 652:in 1997. LSTM RNNs avoid the 507:in 1993. Raj Reddy's student 455:Two practical products were: 367:developed the mathematics of 308:Speech Understanding Research 9049:Convolutional neural network 7747:Multi-document summarization 7367:Pieraccini, Roberto (2012). 6775:10.1007/978-0-387-74161-1_41 6529:. MassMATCH. 18 March 2010. 6067:Techcommunity Education Blog 4476:10.1016/0893-6080(94)90027-2 3852:10.1016/j.neunet.2014.09.003 3728:Froomkin, Dan (5 May 2015). 3459:10.1016/0167-6393(95)00030-R 3416:Dragon Medical Transcription 2956:Found. Trends Signal Process 2209:etc., "Computer Speech", by 1952:is the number of deletions, 1475:Students who are blind (see 1131: 889:represented themselves as a 832:global semi-tied co variance 417:Practical speech recognition 293: 206:developed and published the 62:automatic speech recognition 7: 9044:Multilayer perceptron (MLP) 8077:Latent Dirichlet allocation 8049:Natural language generation 7914:Machine-readable dictionary 7909:Linguistic Linked Open Data 7484:Natural language processing 6725:10.2190/K6K8-78K2-59Y7-R9R2 6424:Englund, Christine (2004). 5627:10.18653/v1/2022.ecnlp-1.28 5214:Deng, L.; Li, Xiao (2013). 5051:Deng, Li; Yu, Dong (2014). 4220:10.1109/ICASSP.2013.6639344 3994:"Attention is All you Need" 2315: 2247: 2163:natural language processing 1730: 1630:multi-factor authentication 1440:Telephony and other domains 1301: 1085:and Bahdanau et al. of the 335:Stanford Research Institute 183: 70:computer speech recognition 10: 9194: 9120:Artificial neural networks 9034:Gated recurrent unit (GRU) 8260:Differentiable programming 7829:Explicit semantic analysis 7578:Deep linguistic processing 7420:Distant Speech Recognition 7348:; Jacko, Julie A. (eds.). 6894:10.1109/ICASSP.1993.319250 6888:. pp. 135–138 vol.2. 6659:10.1177/016264341102600104 5872:"Pronunciation assessment" 5648:. IEEE. pp. 295–301. 5301:10.4249/scholarpedia.32832 4812:10.1016/j.eswa.2020.113402 4311:10.1109/ICASSP.1992.225833 4305:. pp. 617–620 vol.1. 3796:10.1162/neco.1997.9.8.1735 2473:Windows Speech Recognition 2438:Speech interface guideline 2418:Phonetic search technology 2254:Carnegie Mellon University 2120:{\displaystyle h=n-(s+d).} 1602:Interactive voice response 1149:computer-aided instruction 1135: 1092:Carnegie Mellon University 1079:Carnegie Mellon University 1032:and Navdeep Jaitly of the 952: 922: 900: 742: 654:vanishing gradient problem 630:artificial neural networks 628:combined with feedforward 604:In the United States, the 175: 29: 9163:User interface techniques 9158:Computational linguistics 9087: 9001: 8945: 8874: 8807: 8679: 8579: 8572: 8526: 8490: 8453:Artificial neural network 8433: 8309: 8276:Automatic differentiation 8249: 8180: 8135: 8090: 8062: 8022: 7967: 7889: 7877: 7808: 7765: 7737: 7672:Word-sense disambiguation 7548: 7525:Computational linguistics 7490: 7096:"Common Voice by Mozilla" 6840:10.1007/978-981-10-3238-7 6433:(Masters thesis thesis). 6220:The Sydney Morning Herald 5889:10.1017/S0261444817000118 5357:Jurafsky, Daniel (2016). 5235:10.1109/TASL.2013.2244083 5138:10.1109/TASL.2011.2134090 4718:10.1142/S0129054103002114 3672:"Switchboard-1 Release 2" 3286:Communications of the ACM 2697:10.1016/j.dsp.2020.102795 2667:Digital Signal Processing 2556:10.1109/ICCE.2010.5670700 1707:vary with the following: 1038:recurrent neural networks 965:artificial neural network 925:Artificial neural network 834:transform (also known as 661:to all smartphone users. 565:, and a team composed of 153:can simplify the task of 50:computational linguistics 8281:Neuromorphic engineering 8244:Differentiable computing 8198:Natural Language Toolkit 8122:Pronunciation assessment 8024:Automatic identification 7854:Latent semantic analysis 7810:Distributional semantics 7695:Compound-term processing 7593:Named-entity recognition 7411:Proceedings of ICMI 2013 7054:Beigi, Homayoon (2011). 5165:9 September 2024 at the 5001:10.1162/neco.1989.1.1.39 4351:25 February 2021 at the 4186:10.1109/MSP.2012.2205597 3891:9 September 2024 at the 3587:"Ray Kurzweil biography" 2947:Gray, Robert M. (2010). 2857:. androidauthority.net. 2378:Fluency Voice Technology 2373:Dragon NaturallySpeaking 2192: 2149:Conferences and journals 1874:(how strong is it), and 1650:(digital speech-to-text) 1609:, including mobile email 1484:Repetitive strain injury 1467:People with disabilities 1267:Electronic Health Record 1138:Pronunciation assessment 1042:conditional independence 883:finite state transducers 881:represented as weighted 812:delta-delta coefficients 692: 642:recurrent neural network 606:National Security Agency 575:University of Washington 536: 490:and others at Bell Labs. 232:Linear predictive coding 134:pronunciation assessment 9054:Residual neural network 8470:Artificial Intelligence 8102:Automated essay scoring 8072:Document classification 7739:Automatic summarization 6138:10.1145/3202185.3202733 4806:. Elsevier BV: 113402. 4391:10.1109/MSP.2009.932166 3229:"James Baker interview" 2855:"IBM-Shoebox-front.jpg" 1560:with speech recognition 1477:Blindness and education 1169:automatic transcription 891:finite state transducer 852:finite state transducer 729:document classification 589:broadcast news speech. 151:Recognizing the speaker 9173:Computer accessibility 7959:Universal Dependencies 7652:Terminology extraction 7635:Semantic decomposition 7630:Semantic role labeling 7620:Part-of-speech tagging 7588:Information extraction 7573:Coreference resolution 7563:Collocation extraction 7273:"Type with your voice" 7060:. New York: Springer. 6931:8 October 2013 at the 6348:Proc. Interspeech 2022 5982:10.1075/jslp.17001.obr 5934:, pp. 1917–1921, 5791:. pp. 3444–3453. 5195:Cite journal requires 4663:10.1006/csla.2000.0138 4563:BusinessTravelNews.com 3506:. Sarasota Journal. AP 3338:(Report). p. 10. 2121: 2064: 1936: 1697:Single Word Error Rate 1613:Multimodal interaction 1087:University of Montreal 800:cepstral normalization 683:Gaussian mixture model 638:Long short-term memory 363:During the late 1960s 306:funded five years for 146:speaker identification 9178:Machine learning task 9009:Neural Turing machine 8597:Human image synthesis 7720:Sentence segmentation 6313:www.speech.cs.cmu.edu 5807:10.1109/CVPR.2017.367 5372:Graves, Alex (2014). 4980:Waibel, Alex (1989). 4004:. Curran Associates. 3589:. KurzweilAINetwork. 2640:"The Mailbag LG #114" 2458:Subtitle (captioning) 2122: 2065: 1937: 1577:Automatic translation 1272:controlled vocabulary 1240:Medical documentation 1034:University of Toronto 1006:n-gram language model 854:, or FST, approach). 515:Lernout & Hauspie 503:went on to found the 109:voice user interfaces 9100:Computer programming 9079:Graph neural network 8654:Text-to-video models 8632:Text-to-image models 8480:Large language model 8465:Scientific computing 8271:Statistical manifold 8266:Information geometry 8172:Voice user interface 7883:datasets and corpora 7824:Document-term matrix 7677:Word-sense induction 6132:. pp. 119–131. 5930:, Dresden, Germany: 4938:Proceedings of IJCAI 4507:5 March 2021 at the 4431:6 March 2015 at the 3447:Speech Communication 2642:. Linuxgazette.net. 2550:. pp. 147–152. 2368:Cache language model 2358:Automotive head unit 2211:Manfred R. Schroeder 2084: 1970: 1887: 1701:Command Success Rate 1588:Hands-free computing 1529:Further applications 1104:University of Oxford 1051:University of Oxford 903:Dynamic time warping 875:Levenshtein distance 739:Hidden Markov models 626:hidden Markov models 563:Cambridge University 288:dynamic time warping 90:computer engineering 8446:In-context learning 8286:Pattern recognition 8152:Interactive fiction 8082:Pachinko allocation 8039:Speech segmentation 7995:Google Ngram Viewer 7767:Machine translation 7757:Text simplification 7752:Sentence extraction 7640:Semantic similarity 7166:– via GitHub. 7154:. 9 November 2019. 7140:– via GitHub. 7128:. 9 November 2019. 7106:on 27 February 2020 7031:. 31 January 2018. 6753:on 27 January 2012. 6515:on 20 October 2007. 6482:www.eurofighter.com 6460:Eurofighter Typhoon 6280:. Reader's Digest. 5437:. 4 November 2016. 5292:2015SchpJ..1032832S 5272:Schmidhuber, Jürgen 4927:Schmidhuber, Jürgen 4445:Bengio, Y. (1991). 4383:2009ISPM...26...75B 4178:2012ISPM...29...82H 3826:Schmidhuber, Jürgen 3758:Herve Bourlard and 3320:on 8 December 2023. 3261:on 19 February 2015 3202:. 12 January 2015. 3141:1977ASAJ...62.1345K 3023:1969ASAJ...46.1049P 2689:2020DSP...10402795S 2616:. WebFinance, Inc. 2453:Speech verification 2428:Speaker recognition 2423:Speaker diarisation 2383:Google Voice Search 2223:Speaker recognition 2144:Further information 1668:as working examples 1659:Tom Clancy's EndWar 1638:, Court reporting ) 1565:emotion recognition 1519:captioned telephone 1492:Learning disability 1348:Eurofighter Typhoon 985:feature engineering 745:Hidden Markov model 687:hidden Markov model 599:Google Voice Search 559:Univ. of Pittsburgh 462:– was released the 385:hidden Markov model 277:Stanford University 246:and Shuzo Saito of 9148:Speech recognition 9039:Echo state network 8927:Jürgen Schmidhuber 8622:Facial recognition 8617:Speech recognition 8527:Software libraries 8162:Question answering 8034:Speech recognition 7899:Corpus linguistics 7879:Language resources 7662:Textual entailment 7645:Sentiment analysis 6803:www.washington.edu 5591:10.3390/fi15050159 5510:. pp. 71–75. 5390:on 10 January 2017 5082:10.1561/2000000039 4989:Neural Computation 4769:10338.dmlcz/135496 4697:Mohri, M. (2002). 4621:. 21 August 2017. 4536:on 23 October 2018 3882:Jürgen Schmidhuber 3784:Neural Computation 2969:10.1561/2000000036 2796:Microsoft Research 2235:Roberto Pieraccini 2117: 2060: 1932: 1726:Adverse conditions 840:mutual information 792:context dependency 752:stationary process 650:Jürgen Schmidhuber 155:translating speech 130:direct voice input 38:Speech recognition 18:Speech Recognition 9135: 9134: 8897:Stephen Grossberg 8870: 8869: 8211: 8210: 8167:Virtual assistant 8092:Computer-assisted 8018: 8017: 7775:Computer-assisted 7733: 7732: 7725:Word segmentation 7687:Text segmentation 7625:Semantic analysis 7613:Syntactic parsing 7598:Ontology learning 7444:Speech Technology 7397:978-3-642-84341-9 7371:. The MIT Press. 7359:978-0-8058-5870-9 7336:978-0-7923-9646-8 7317:978-0-521-59277-2 7251:Mozilla Discourse 7100:voice.mozilla.org 7067:978-0-387-77591-3 6849:978-981-10-3237-0 6784:978-0-387-74160-4 6391:Council of Europe 5876:Language Teaching 5816:978-1-5386-0457-1 5673:979-8-3503-9690-4 4870:10.1109/34.244678 4864:(11): 1174–1185. 4229:978-1-4799-0356-6 3422:on 13 August 2015 3031:10.1121/1.1911801 3017:(48): 1049–1051. 2739:10.1109/89.365379 2565:978-1-4244-7055-6 2203:Frederick Jelinek 2175:speech processing 2131:Security concerns 2058: 2037: 1930: 1672:Virtual assistant 1584:(Legal discovery) 1547:Mars Polar Lander 1539:space exploration 1416:speech technology 1363:F-35 Lightning II 1270:from a list or a 1251:digital dictation 1071:domain adaptation 1067:transfer learning 847:Viterbi algorithm 776:Fourier transform 724:language modeling 720:acoustic modeling 670:acoustic modeling 266:James L. Flanagan 244:Nagoya University 222:1962 World's Fair 141:voice recognition 42:interdisciplinary 32:Speech perception 16:(Redirected from 9185: 9125:Machine learning 9115: 9114: 9095: 8850:Action selection 8840:Self-driving car 8647:Stable Diffusion 8612:Speech synthesis 8577: 8576: 8441:Machine learning 8317:Gradient descent 8238: 8231: 8224: 8215: 8214: 8188:Formal semantics 8137:Natural language 8044:Speech synthesis 8026:and data capture 7929:Semantic network 7904:Lexical resource 7887: 7886: 7705:Lexical analysis 7683: 7682: 7608:Semantic parsing 7477: 7470: 7463: 7454: 7453: 7433: 7414: 7401: 7382: 7363: 7340: 7321: 7289: 7288: 7286: 7284: 7269: 7263: 7262: 7260: 7258: 7243: 7237: 7236: 7234: 7232: 7204: 7198: 7197: 7195: 7193: 7174: 7168: 7167: 7165: 7163: 7148: 7142: 7141: 7139: 7137: 7122: 7116: 7115: 7113: 7111: 7102:. Archived from 7092: 7086: 7082: 7076: 7075: 7051: 7045: 7044: 7042: 7040: 7021: 7015: 7014: 6996: 6990: 6989: 6984:. 6 March 2016. 6972: 6966: 6965: 6963: 6961: 6942: 6936: 6922: 6916: 6915: 6881: 6875: 6872: 6866: 6865: 6863: 6861: 6825: 6819: 6818: 6816: 6814: 6795: 6789: 6788: 6761: 6755: 6754: 6743: 6737: 6736: 6718: 6698: 6692: 6689: 6683: 6682: 6680: 6678: 6638: 6632: 6627: 6621: 6620: 6619:on 4 April 2008. 6615:. Archived from 6609: 6603: 6602: 6600: 6598: 6578: 6572: 6571: 6569: 6567: 6552: 6543: 6542: 6540: 6538: 6523: 6517: 6516: 6504: 6498: 6497: 6495: 6493: 6474: 6468: 6467: 6452: 6446: 6445: 6443: 6432: 6421: 6415: 6414: 6412: 6410: 6381: 6375: 6374: 6372: 6370: 6364: 6345: 6336: 6330: 6328: 6326: 6324: 6300: 6294: 6293: 6291: 6289: 6273: 6267: 6266: 6264: 6262: 6242: 6236: 6235: 6233: 6231: 6211: 6205: 6204: 6202: 6200: 6181: 6175: 6174: 6172: 6170: 6164: 6131: 6120: 6114: 6113: 6111: 6109: 6089: 6083: 6082: 6080: 6078: 6058: 6052: 6051: 6049: 6047: 6023: 6017: 6016: 5994: 5984: 5960: 5954: 5953: 5948: 5946: 5940: 5928:INTERSPEECH 2015 5925: 5916: 5910: 5909: 5891: 5867: 5861: 5860: 5851: 5835: 5829: 5828: 5800: 5784: 5778: 5777: 5775: 5763: 5757: 5756: 5754: 5742: 5736: 5735: 5733: 5721: 5715: 5714: 5712: 5710: 5704: 5693: 5684: 5678: 5677: 5657: 5637: 5631: 5630: 5610: 5604: 5603: 5593: 5569: 5563: 5562: 5561: 5559: 5554: 5536: 5530: 5529: 5519: 5508:Interspeech 2019 5499: 5493: 5492: 5490: 5478: 5472: 5471: 5469: 5457: 5451: 5450: 5448: 5446: 5427: 5421: 5420: 5418: 5406: 5400: 5399: 5397: 5395: 5389: 5383:. Archived from 5378: 5369: 5363: 5362: 5354: 5348: 5347: 5345: 5337:Interspeech 2014 5334: 5325: 5319: 5312: 5306: 5305: 5303: 5268: 5262: 5261: 5259: 5257: 5251: 5229:(5): 1060–1089. 5220: 5211: 5205: 5204: 5198: 5193: 5191: 5183: 5179: 5170: 5156: 5150: 5149: 5121: 5115: 5114: 5108: 5099: 5093: 5092: 5090: 5075: 5066:(3–4): 197–387. 5057: 5048: 5039: 5038: 5026: 5020: 5019: 5017: 4986: 4977: 4971: 4969: 4967: 4955: 4949: 4948: 4946: 4935: 4922: 4916: 4915: 4913: 4902: 4893: 4887: 4886:," in ICSLP 2002 4880: 4874: 4873: 4853: 4847: 4846: 4844: 4842: 4836: 4797: 4788: 4782: 4781: 4771: 4760:10.1109/29.21701 4743: 4737: 4736: 4734: 4732: 4726: 4703: 4694: 4688: 4687: 4685: 4683: 4665: 4641: 4635: 4634: 4632: 4630: 4609: 4603: 4602: 4589: 4583: 4582: 4576: 4574: 4565:. 3 March 1997. 4555: 4549: 4548: 4543: 4541: 4530:TechRepublic.com 4522: 4516: 4498: 4489: 4486: 4480: 4479: 4459: 4453: 4452: 4442: 4436: 4419: 4413: 4412: 4402: 4362: 4356: 4339: 4333: 4332: 4291: 4285: 4282: 4276: 4275: 4273: 4271: 4251: 4242: 4241: 4214:. p. 8599. 4207: 4198: 4197: 4157: 4146: 4143: 4137: 4136: 4134: 4132: 4123:. Li Deng Site. 4117: 4111: 4110: 4108: 4096: 4087: 4086: 4084: 4072: 4063: 4062: 4060: 4048: 4039: 4038: 4036: 4024: 4018: 4017: 4015: 4013: 3989: 3983: 3982: 3980: 3968: 3962: 3961: 3959: 3946: 3940: 3938: 3936: 3934: 3925:. Archived from 3918: 3909: 3902: 3896: 3878: 3872: 3871: 3845: 3822: 3816: 3815: 3790:(8): 1735–1780. 3772: 3763: 3756: 3750: 3749: 3747: 3745: 3725: 3719: 3718: 3716: 3714: 3694: 3688: 3687: 3685: 3683: 3668: 3662: 3661: 3659: 3657: 3642: 3636: 3635: 3633: 3631: 3625: 3618: 3609: 3603: 3602: 3600: 3598: 3583: 3577: 3576: 3574: 3572: 3552: 3546: 3545: 3543: 3541: 3522: 3516: 3515: 3513: 3511: 3499: 3493: 3492: 3490: 3488: 3472: 3463: 3462: 3438: 3432: 3431: 3429: 3427: 3418:. Archived from 3408: 3402: 3401: 3361: 3355: 3354: 3352: 3350: 3344: 3337: 3328: 3322: 3321: 3316:. Archived from 3277: 3271: 3270: 3268: 3266: 3251: 3245: 3244: 3242: 3240: 3225: 3216: 3215: 3213: 3211: 3192: 3186: 3185: 3183: 3181: 3175: 3168: 3162:Rabiner (1984). 3159: 3153: 3152: 3149:10.1121/1.381666 3135:(6): 1345–1366. 3124: 3118: 3117: 3115: 3113: 3093: 3087: 3086: 3084: 3082: 3066: 3060: 3059: 3041: 3035: 3034: 3003: 2997: 2996: 2994: 2992: 2986: 2971: 2953: 2944: 2938: 2937: 2935: 2933: 2913: 2904: 2903: 2901: 2899: 2893: 2886: 2877: 2871: 2870: 2868: 2866: 2851: 2845: 2844: 2842: 2840: 2831:. 22 July 2012. 2821: 2815: 2814: 2809: 2807: 2788: 2782: 2781: 2779: 2777: 2771: 2724: 2715: 2709: 2708: 2682: 2662: 2656: 2655: 2653: 2651: 2636: 2630: 2629: 2627: 2625: 2610: 2604: 2603: 2601: 2599: 2584: 2578: 2577: 2543: 2537: 2536: 2534: 2532: 2523:. Fifthgen.com. 2517: 2448:Speech synthesis 2433:Speech analytics 2413:Origin of speech 2393:Keyword spotting 2199:Lawrence Rabiner 2126: 2124: 2123: 2118: 2069: 2067: 2066: 2061: 2059: 2054: 2043: 2038: 2033: 2004: 1941: 1939: 1938: 1933: 1931: 1926: 1903: 1693:real time factor 1607:Mobile telephony 1313:fighter aircraft 1209:speech disorders 1201:reading tutoring 1161:accent reduction 780:cosine transform 610:keyword spotting 549:, a team led by 498:, developed the 488:Lawrence Rabiner 464:Apricot Portable 425:language model. 383:began using the 240:Fumitada Itakura 128:(usually termed 94:speech synthesis 82:computer science 46:computer science 21: 9193: 9192: 9188: 9187: 9186: 9184: 9183: 9182: 9138: 9137: 9136: 9131: 9083: 8997: 8963:Google DeepMind 8941: 8907:Geoffrey Hinton 8866: 8803: 8729:Project Debater 8675: 8573:Implementations 8568: 8522: 8486: 8429: 8371:Backpropagation 8305: 8291:Tensor calculus 8245: 8242: 8212: 8207: 8176: 8156:Syntax guessing 8138: 8131: 8117:Predictive text 8112:Grammar checker 8093: 8086: 8058: 8025: 8014: 7980:Bank of English 7963: 7891: 7882: 7873: 7804: 7761: 7729: 7681: 7583:Distant reading 7558:Argument mining 7544: 7540:Text processing 7486: 7481: 7440: 7430: 7398: 7379: 7360: 7337: 7318: 7304:Mariani, Joseph 7298: 7296:Further reading 7293: 7292: 7282: 7280: 7271: 7270: 7266: 7256: 7254: 7245: 7244: 7240: 7230: 7228: 7205: 7201: 7191: 7189: 7176: 7175: 7171: 7161: 7159: 7150: 7149: 7145: 7135: 7133: 7124: 7123: 7119: 7109: 7107: 7094: 7093: 7089: 7083: 7079: 7068: 7052: 7048: 7038: 7036: 7023: 7022: 7018: 6997: 6993: 6974: 6973: 6969: 6959: 6957: 6944: 6943: 6939: 6933:Wayback Machine 6923: 6919: 6904: 6882: 6878: 6873: 6869: 6859: 6857: 6850: 6826: 6822: 6812: 6810: 6797: 6796: 6792: 6785: 6762: 6758: 6745: 6744: 6740: 6716:10.1.1.631.3736 6699: 6695: 6690: 6686: 6676: 6674: 6639: 6635: 6628: 6624: 6611: 6610: 6606: 6596: 6594: 6579: 6575: 6565: 6563: 6554: 6553: 6546: 6536: 6534: 6525: 6524: 6520: 6505: 6501: 6491: 6489: 6476: 6475: 6471: 6454: 6453: 6449: 6441: 6430: 6422: 6418: 6408: 6406: 6383: 6382: 6378: 6368: 6366: 6362: 6343: 6337: 6333: 6322: 6320: 6307: 6301: 6297: 6287: 6285: 6274: 6270: 6260: 6258: 6243: 6239: 6229: 6227: 6212: 6208: 6198: 6196: 6183: 6182: 6178: 6168: 6166: 6162: 6148: 6129: 6121: 6117: 6107: 6105: 6090: 6086: 6076: 6074: 6059: 6055: 6045: 6043: 6024: 6020: 5961: 5957: 5944: 5942: 5938: 5923: 5917: 5913: 5868: 5864: 5836: 5832: 5817: 5785: 5781: 5764: 5760: 5743: 5739: 5722: 5718: 5708: 5706: 5702: 5691: 5685: 5681: 5674: 5638: 5634: 5611: 5607: 5578:Future Internet 5570: 5566: 5557: 5555: 5537: 5533: 5500: 5496: 5479: 5475: 5458: 5454: 5444: 5442: 5429: 5428: 5424: 5407: 5403: 5393: 5391: 5387: 5376: 5370: 5366: 5355: 5351: 5343: 5332: 5326: 5322: 5313: 5309: 5276:"Deep Learning" 5269: 5265: 5255: 5253: 5249: 5218: 5212: 5208: 5196: 5194: 5185: 5184: 5180: 5173: 5169:. ICASSP, 2013. 5167:Wayback Machine 5157: 5153: 5122: 5118: 5106: 5100: 5096: 5088: 5073:10.1.1.691.3679 5055: 5049: 5042: 5027: 5023: 5015: 4984: 4978: 4974: 4956: 4952: 4944: 4933: 4923: 4919: 4911: 4900: 4894: 4890: 4881: 4877: 4854: 4850: 4840: 4838: 4834: 4795: 4789: 4785: 4744: 4740: 4730: 4728: 4724: 4701: 4695: 4691: 4681: 4679: 4642: 4638: 4628: 4626: 4611: 4610: 4606: 4590: 4586: 4572: 4570: 4557: 4556: 4552: 4539: 4537: 4524: 4523: 4519: 4509:Wayback Machine 4500:Keynote talk: " 4499: 4492: 4487: 4483: 4464:Neural Networks 4460: 4456: 4443: 4439: 4433:Wayback Machine 4422:Sepp Hochreiter 4420: 4416: 4363: 4359: 4353:Wayback Machine 4340: 4336: 4321: 4292: 4288: 4283: 4279: 4269: 4267: 4252: 4245: 4230: 4208: 4201: 4158: 4149: 4144: 4140: 4130: 4128: 4119: 4118: 4114: 4097: 4090: 4073: 4066: 4049: 4042: 4025: 4021: 4011: 4009: 3990: 3986: 3969: 3965: 3947: 3943: 3932: 3930: 3929:on 9 March 2016 3921: 3919: 3912: 3903: 3899: 3893:Wayback Machine 3879: 3875: 3830:Neural Networks 3823: 3819: 3776:Sepp Hochreiter 3773: 3766: 3757: 3753: 3743: 3741: 3726: 3722: 3712: 3710: 3695: 3691: 3681: 3679: 3670: 3669: 3665: 3655: 3653: 3644: 3643: 3639: 3629: 3627: 3623: 3616: 3610: 3606: 3596: 3594: 3585: 3584: 3580: 3570: 3568: 3553: 3549: 3539: 3537: 3524: 3523: 3519: 3509: 3507: 3500: 3496: 3486: 3484: 3473: 3466: 3439: 3435: 3425: 3423: 3410: 3409: 3405: 3362: 3358: 3348: 3346: 3342: 3335: 3329: 3325: 3298:10.1145/2500887 3278: 3274: 3264: 3262: 3253: 3252: 3248: 3238: 3236: 3227: 3226: 3219: 3209: 3207: 3194: 3193: 3189: 3179: 3177: 3173: 3166: 3160: 3156: 3125: 3121: 3111: 3109: 3094: 3090: 3080: 3078: 3067: 3063: 3056: 3042: 3038: 3004: 3000: 2990: 2988: 2984: 2951: 2945: 2941: 2931: 2929: 2914: 2907: 2897: 2895: 2891: 2884: 2878: 2874: 2864: 2862: 2853: 2852: 2848: 2838: 2836: 2829:The Star-Ledger 2823: 2822: 2818: 2805: 2803: 2790: 2789: 2785: 2775: 2773: 2769: 2722: 2716: 2712: 2663: 2659: 2649: 2647: 2638: 2637: 2633: 2623: 2621: 2612: 2611: 2607: 2597: 2595: 2586: 2585: 2581: 2566: 2544: 2540: 2530: 2528: 2519: 2518: 2514: 2509: 2504: 2318: 2250: 2195: 2151: 2146: 2133: 2085: 2082: 2081: 2044: 2042: 2005: 2003: 1971: 1968: 1967: 1904: 1902: 1888: 1885: 1884: 1799: 1733: 1689:word error rate 1685: 1597:Home automation 1545:, etc.) NASA's 1531: 1469: 1450:computer gaming 1442: 1424: 1414:and in overall 1400:Puma helicopter 1375: 1309: 1304: 1288:word processors 1284: 1282:Therapeutic use 1242: 1237: 1205:Microsoft Teams 1173:intelligibility 1155:(CALL), speech 1140: 1134: 1121: 1116: 1108:Google DeepMind 1059:Google DeepMind 1030:Google DeepMind 994: 957: 951: 938:Neural networks 927: 921: 919:Neural networks 905: 899: 828:heteroscedastic 747: 741: 716: 697:By early 2010s 695: 674:Geoffrey Hinton 646:Sepp Hochreiter 539: 419: 331:Carnegie Mellon 296: 186: 178: 118:word processors 35: 28: 23: 22: 15: 12: 11: 5: 9191: 9181: 9180: 9175: 9170: 9165: 9160: 9155: 9150: 9133: 9132: 9130: 9129: 9128: 9127: 9122: 9109: 9108: 9107: 9102: 9088: 9085: 9084: 9082: 9081: 9076: 9071: 9066: 9061: 9056: 9051: 9046: 9041: 9036: 9031: 9026: 9021: 9016: 9011: 9005: 9003: 8999: 8998: 8996: 8995: 8990: 8985: 8980: 8975: 8970: 8965: 8960: 8955: 8949: 8947: 8943: 8942: 8940: 8939: 8937:Ilya Sutskever 8934: 8929: 8924: 8919: 8914: 8909: 8904: 8902:Demis Hassabis 8899: 8894: 8892:Ian Goodfellow 8889: 8884: 8878: 8876: 8872: 8871: 8868: 8867: 8865: 8864: 8859: 8858: 8857: 8847: 8842: 8837: 8832: 8827: 8822: 8817: 8811: 8809: 8805: 8804: 8802: 8801: 8796: 8791: 8786: 8781: 8776: 8771: 8766: 8761: 8756: 8751: 8746: 8741: 8736: 8731: 8726: 8721: 8720: 8719: 8709: 8704: 8699: 8694: 8689: 8683: 8681: 8677: 8676: 8674: 8673: 8668: 8667: 8666: 8661: 8651: 8650: 8649: 8644: 8639: 8629: 8624: 8619: 8614: 8609: 8604: 8599: 8594: 8589: 8583: 8581: 8574: 8570: 8569: 8567: 8566: 8561: 8556: 8551: 8546: 8541: 8536: 8530: 8528: 8524: 8523: 8521: 8520: 8515: 8510: 8505: 8500: 8494: 8492: 8488: 8487: 8485: 8484: 8483: 8482: 8475:Language model 8472: 8467: 8462: 8461: 8460: 8450: 8449: 8448: 8437: 8435: 8431: 8430: 8428: 8427: 8425:Autoregression 8422: 8417: 8416: 8415: 8405: 8403:Regularization 8400: 8399: 8398: 8393: 8388: 8378: 8373: 8368: 8366:Loss functions 8363: 8358: 8353: 8348: 8343: 8342: 8341: 8331: 8326: 8325: 8324: 8313: 8311: 8307: 8306: 8304: 8303: 8301:Inductive bias 8298: 8293: 8288: 8283: 8278: 8273: 8268: 8263: 8255: 8253: 8247: 8246: 8241: 8240: 8233: 8226: 8218: 8209: 8208: 8206: 8205: 8200: 8195: 8190: 8184: 8182: 8178: 8177: 8175: 8174: 8169: 8164: 8159: 8149: 8143: 8141: 8139:user interface 8133: 8132: 8130: 8129: 8124: 8119: 8114: 8109: 8104: 8098: 8096: 8088: 8087: 8085: 8084: 8079: 8074: 8068: 8066: 8060: 8059: 8057: 8056: 8051: 8046: 8041: 8036: 8030: 8028: 8020: 8019: 8016: 8015: 8013: 8012: 8007: 8002: 7997: 7992: 7987: 7982: 7977: 7971: 7969: 7965: 7964: 7962: 7961: 7956: 7951: 7946: 7941: 7936: 7931: 7926: 7921: 7916: 7911: 7906: 7901: 7895: 7893: 7884: 7875: 7874: 7872: 7871: 7866: 7864:Word embedding 7861: 7856: 7851: 7844:Language model 7841: 7836: 7831: 7826: 7821: 7815: 7813: 7806: 7805: 7803: 7802: 7797: 7795:Transfer-based 7792: 7787: 7782: 7777: 7771: 7769: 7763: 7762: 7760: 7759: 7754: 7749: 7743: 7741: 7735: 7734: 7731: 7730: 7728: 7727: 7722: 7717: 7712: 7707: 7702: 7697: 7691: 7689: 7680: 7679: 7674: 7669: 7664: 7659: 7654: 7648: 7647: 7642: 7637: 7632: 7627: 7622: 7617: 7616: 7615: 7610: 7600: 7595: 7590: 7585: 7580: 7575: 7570: 7568:Concept mining 7565: 7560: 7554: 7552: 7546: 7545: 7543: 7542: 7537: 7532: 7527: 7522: 7521: 7520: 7515: 7505: 7500: 7494: 7492: 7488: 7487: 7480: 7479: 7472: 7465: 7457: 7451: 7450: 7439: 7438:External links 7436: 7435: 7434: 7429:978-0470517048 7428: 7415: 7402: 7396: 7383: 7378:978-0262016858 7377: 7364: 7358: 7341: 7335: 7322: 7316: 7302:Cole, Ronald; 7297: 7294: 7291: 7290: 7264: 7238: 7199: 7169: 7143: 7117: 7087: 7077: 7066: 7046: 7016: 6991: 6967: 6937: 6917: 6902: 6876: 6867: 6848: 6820: 6790: 6783: 6756: 6738: 6693: 6684: 6633: 6622: 6604: 6573: 6544: 6518: 6499: 6469: 6447: 6416: 6376: 6331: 6295: 6268: 6237: 6206: 6176: 6146: 6115: 6084: 6053: 6018: 5975:(2): 182–207. 5955: 5911: 5882:(3): 347–366. 5862: 5830: 5815: 5779: 5758: 5737: 5716: 5679: 5672: 5632: 5605: 5564: 5531: 5494: 5473: 5452: 5422: 5401: 5364: 5349: 5320: 5318:. Interspeech. 5307: 5263: 5206: 5197:|journal= 5171: 5151: 5116: 5094: 5040: 5021: 4972: 4950: 4917: 4888: 4875: 4848: 4783: 4754:(3): 328–339. 4738: 4712:(6): 957–982. 4689: 4656:(2): 115–135. 4636: 4604: 4598:. p. 45. 4584: 4550: 4517: 4490: 4481: 4470:(2): 331–339. 4454: 4437: 4414: 4357: 4334: 4319: 4286: 4277: 4260:New York Times 4243: 4228: 4199: 4147: 4138: 4112: 4088: 4064: 4040: 4019: 3984: 3963: 3941: 3910: 3897: 3873: 3817: 3780:J. Schmidhuber 3764: 3751: 3720: 3689: 3663: 3637: 3604: 3578: 3547: 3530:actapricot.org 3517: 3494: 3464: 3453:(3): 263–271. 3433: 3403: 3382:10.1086/725132 3356: 3323: 3272: 3246: 3217: 3187: 3154: 3119: 3102:The New Yorker 3088: 3069:John Makhoul. 3061: 3055:978-3540491255 3054: 3036: 3007:John R. Pierce 2998: 2962:(4): 203–303. 2939: 2905: 2872: 2846: 2816: 2783: 2710: 2657: 2631: 2605: 2579: 2564: 2538: 2511: 2510: 2508: 2505: 2503: 2502: 2497: 2492: 2487: 2481: 2480: 2476: 2475: 2470: 2465: 2460: 2455: 2450: 2445: 2440: 2435: 2430: 2425: 2420: 2415: 2410: 2405: 2400: 2395: 2390: 2385: 2380: 2375: 2370: 2365: 2360: 2355: 2350: 2345: 2340: 2335: 2330: 2325: 2319: 2317: 2314: 2249: 2246: 2194: 2191: 2150: 2147: 2145: 2142: 2132: 2129: 2128: 2127: 2116: 2113: 2110: 2107: 2104: 2101: 2098: 2095: 2092: 2089: 2071: 2070: 2057: 2053: 2050: 2047: 2041: 2036: 2032: 2029: 2026: 2023: 2020: 2017: 2014: 2011: 2008: 2002: 1999: 1996: 1993: 1990: 1987: 1984: 1981: 1978: 1975: 1929: 1925: 1922: 1919: 1916: 1913: 1910: 1907: 1901: 1898: 1895: 1892: 1864: 1863: 1856: 1855: 1852: 1844: 1843: 1840: 1829: 1828: 1825: 1817: 1816: 1815: 1814: 1811: 1808: 1796: 1795: 1794: 1793: 1787: 1786: 1782: 1781: 1780: 1779: 1776: 1770: 1769: 1765: 1764: 1763: 1762: 1755: 1754: 1750: 1749: 1748: 1747: 1741: 1740: 1732: 1729: 1728: 1727: 1724: 1721: 1718: 1715: 1712: 1684: 1681: 1680: 1679: 1669: 1651: 1645: 1639: 1632: 1626: 1621: 1615: 1610: 1604: 1599: 1594: 1592:user interface 1585: 1579: 1574: 1567: 1561: 1554: 1530: 1527: 1515:relay services 1468: 1465: 1441: 1438: 1423: 1420: 1374: 1371: 1365:(JSF) and the 1308: 1305: 1303: 1300: 1283: 1280: 1241: 1238: 1236: 1233: 1136:Main article: 1133: 1130: 1124:recognition. 1120: 1119:In-car systems 1117: 1115: 1112: 1002:language model 993: 990: 953:Main article: 950: 947: 923:Main article: 920: 917: 901:Main article: 898: 895: 887:edit distances 743:Main article: 740: 737: 715: 712: 706:their voice". 694: 691: 636:method called 538: 535: 492: 491: 477: 471: 446: 445: 434:back-off model 418: 415: 414: 413: 410:Janet M. Baker 402: 401: 397:Fred Jelinek's 381:Janet M. Baker 361: 360: 346: 339: 338: 295: 292: 270: 269: 251: 225: 211: 197: 185: 182: 177: 174: 74:speech-to-text 52:that develops 26: 9: 6: 4: 3: 2: 9190: 9179: 9176: 9174: 9171: 9169: 9166: 9164: 9161: 9159: 9156: 9154: 9151: 9149: 9146: 9145: 9143: 9126: 9123: 9121: 9118: 9117: 9110: 9106: 9103: 9101: 9098: 9097: 9094: 9090: 9089: 9086: 9080: 9077: 9075: 9072: 9070: 9067: 9065: 9062: 9060: 9057: 9055: 9052: 9050: 9047: 9045: 9042: 9040: 9037: 9035: 9032: 9030: 9027: 9025: 9022: 9020: 9017: 9015: 9012: 9010: 9007: 9006: 9004: 9002:Architectures 9000: 8994: 8991: 8989: 8986: 8984: 8981: 8979: 8976: 8974: 8971: 8969: 8966: 8964: 8961: 8959: 8956: 8954: 8951: 8950: 8948: 8946:Organizations 8944: 8938: 8935: 8933: 8930: 8928: 8925: 8923: 8920: 8918: 8915: 8913: 8910: 8908: 8905: 8903: 8900: 8898: 8895: 8893: 8890: 8888: 8885: 8883: 8882:Yoshua Bengio 8880: 8879: 8877: 8873: 8863: 8862:Robot control 8860: 8856: 8853: 8852: 8851: 8848: 8846: 8843: 8841: 8838: 8836: 8833: 8831: 8828: 8826: 8823: 8821: 8818: 8816: 8813: 8812: 8810: 8806: 8800: 8797: 8795: 8792: 8790: 8787: 8785: 8782: 8780: 8779:Chinchilla AI 8777: 8775: 8772: 8770: 8767: 8765: 8762: 8760: 8757: 8755: 8752: 8750: 8747: 8745: 8742: 8740: 8737: 8735: 8732: 8730: 8727: 8725: 8722: 8718: 8715: 8714: 8713: 8710: 8708: 8705: 8703: 8700: 8698: 8695: 8693: 8690: 8688: 8685: 8684: 8682: 8678: 8672: 8669: 8665: 8662: 8660: 8657: 8656: 8655: 8652: 8648: 8645: 8643: 8640: 8638: 8635: 8634: 8633: 8630: 8628: 8625: 8623: 8620: 8618: 8615: 8613: 8610: 8608: 8605: 8603: 8600: 8598: 8595: 8593: 8590: 8588: 8585: 8584: 8582: 8578: 8575: 8571: 8565: 8562: 8560: 8557: 8555: 8552: 8550: 8547: 8545: 8542: 8540: 8537: 8535: 8532: 8531: 8529: 8525: 8519: 8516: 8514: 8511: 8509: 8506: 8504: 8501: 8499: 8496: 8495: 8493: 8489: 8481: 8478: 8477: 8476: 8473: 8471: 8468: 8466: 8463: 8459: 8458:Deep learning 8456: 8455: 8454: 8451: 8447: 8444: 8443: 8442: 8439: 8438: 8436: 8432: 8426: 8423: 8421: 8418: 8414: 8411: 8410: 8409: 8406: 8404: 8401: 8397: 8394: 8392: 8389: 8387: 8384: 8383: 8382: 8379: 8377: 8374: 8372: 8369: 8367: 8364: 8362: 8359: 8357: 8354: 8352: 8349: 8347: 8346:Hallucination 8344: 8340: 8337: 8336: 8335: 8332: 8330: 8327: 8323: 8320: 8319: 8318: 8315: 8314: 8312: 8308: 8302: 8299: 8297: 8294: 8292: 8289: 8287: 8284: 8282: 8279: 8277: 8274: 8272: 8269: 8267: 8264: 8262: 8261: 8257: 8256: 8254: 8252: 8248: 8239: 8234: 8232: 8227: 8225: 8220: 8219: 8216: 8204: 8201: 8199: 8196: 8194: 8193:Hallucination 8191: 8189: 8186: 8185: 8183: 8179: 8173: 8170: 8168: 8165: 8163: 8160: 8157: 8153: 8150: 8148: 8145: 8144: 8142: 8140: 8134: 8128: 8127:Spell checker 8125: 8123: 8120: 8118: 8115: 8113: 8110: 8108: 8105: 8103: 8100: 8099: 8097: 8095: 8089: 8083: 8080: 8078: 8075: 8073: 8070: 8069: 8067: 8065: 8061: 8055: 8052: 8050: 8047: 8045: 8042: 8040: 8037: 8035: 8032: 8031: 8029: 8027: 8021: 8011: 8008: 8006: 8003: 8001: 7998: 7996: 7993: 7991: 7988: 7986: 7983: 7981: 7978: 7976: 7973: 7972: 7970: 7966: 7960: 7957: 7955: 7952: 7950: 7947: 7945: 7942: 7940: 7939:Speech corpus 7937: 7935: 7932: 7930: 7927: 7925: 7922: 7920: 7919:Parallel text 7917: 7915: 7912: 7910: 7907: 7905: 7902: 7900: 7897: 7896: 7894: 7888: 7885: 7880: 7876: 7870: 7867: 7865: 7862: 7860: 7857: 7855: 7852: 7849: 7845: 7842: 7840: 7837: 7835: 7832: 7830: 7827: 7825: 7822: 7820: 7817: 7816: 7814: 7811: 7807: 7801: 7798: 7796: 7793: 7791: 7788: 7786: 7783: 7781: 7780:Example-based 7778: 7776: 7773: 7772: 7770: 7768: 7764: 7758: 7755: 7753: 7750: 7748: 7745: 7744: 7742: 7740: 7736: 7726: 7723: 7721: 7718: 7716: 7713: 7711: 7710:Text chunking 7708: 7706: 7703: 7701: 7700:Lemmatisation 7698: 7696: 7693: 7692: 7690: 7688: 7684: 7678: 7675: 7673: 7670: 7668: 7665: 7663: 7660: 7658: 7655: 7653: 7650: 7649: 7646: 7643: 7641: 7638: 7636: 7633: 7631: 7628: 7626: 7623: 7621: 7618: 7614: 7611: 7609: 7606: 7605: 7604: 7601: 7599: 7596: 7594: 7591: 7589: 7586: 7584: 7581: 7579: 7576: 7574: 7571: 7569: 7566: 7564: 7561: 7559: 7556: 7555: 7553: 7551: 7550:Text analysis 7547: 7541: 7538: 7536: 7533: 7531: 7528: 7526: 7523: 7519: 7516: 7514: 7511: 7510: 7509: 7506: 7504: 7501: 7499: 7496: 7495: 7493: 7491:General terms 7489: 7485: 7478: 7473: 7471: 7466: 7464: 7459: 7458: 7455: 7449: 7445: 7442: 7441: 7431: 7425: 7421: 7416: 7412: 7408: 7403: 7399: 7393: 7389: 7384: 7380: 7374: 7370: 7365: 7361: 7355: 7351: 7347: 7346:Sears, Andrew 7342: 7338: 7332: 7328: 7323: 7319: 7313: 7309: 7305: 7300: 7299: 7278: 7274: 7268: 7253:. 7 July 2021 7252: 7248: 7242: 7226: 7222: 7218: 7214: 7210: 7203: 7187: 7183: 7179: 7173: 7157: 7153: 7147: 7131: 7127: 7121: 7105: 7101: 7097: 7091: 7081: 7073: 7069: 7063: 7059: 7058: 7050: 7034: 7030: 7026: 7020: 7012: 7008: 7007: 7002: 6995: 6987: 6983: 6982: 6977: 6971: 6955: 6951: 6947: 6941: 6934: 6930: 6927: 6921: 6913: 6909: 6905: 6903:0-7803-0946-4 6899: 6895: 6891: 6887: 6880: 6871: 6855: 6851: 6845: 6841: 6837: 6833: 6832: 6824: 6808: 6804: 6800: 6794: 6786: 6780: 6776: 6772: 6768: 6760: 6752: 6748: 6742: 6734: 6730: 6726: 6722: 6717: 6712: 6709:(2): 173–84. 6708: 6704: 6697: 6688: 6672: 6668: 6664: 6660: 6656: 6652: 6648: 6644: 6637: 6631: 6626: 6618: 6614: 6608: 6592: 6588: 6584: 6577: 6561: 6557: 6551: 6549: 6532: 6528: 6522: 6514: 6510: 6503: 6487: 6483: 6479: 6473: 6465: 6461: 6457: 6456:"The Cockpit" 6451: 6440: 6436: 6429: 6428: 6420: 6404: 6400: 6396: 6392: 6388: 6387: 6380: 6361: 6357: 6353: 6349: 6342: 6335: 6318: 6314: 6310: 6305: 6299: 6283: 6279: 6272: 6256: 6252: 6248: 6241: 6225: 6221: 6217: 6210: 6194: 6190: 6186: 6180: 6161: 6157: 6153: 6149: 6147:9781450351522 6143: 6139: 6135: 6128: 6127: 6119: 6103: 6099: 6095: 6088: 6072: 6069:. Microsoft. 6068: 6064: 6057: 6041: 6037: 6033: 6029: 6022: 6015: 6010: 6006: 6002: 5998: 5993: 5988: 5983: 5978: 5974: 5970: 5966: 5959: 5952: 5937: 5933: 5929: 5922: 5915: 5907: 5903: 5899: 5895: 5890: 5885: 5881: 5877: 5873: 5866: 5859: 5855: 5850: 5845: 5841: 5834: 5826: 5822: 5818: 5812: 5808: 5804: 5799: 5794: 5790: 5783: 5774: 5769: 5762: 5753: 5748: 5741: 5732: 5727: 5720: 5701: 5697: 5690: 5683: 5675: 5669: 5665: 5661: 5656: 5651: 5647: 5643: 5636: 5628: 5624: 5620: 5616: 5609: 5601: 5597: 5592: 5587: 5583: 5579: 5575: 5568: 5553: 5548: 5544: 5543: 5535: 5527: 5523: 5518: 5513: 5509: 5505: 5498: 5489: 5484: 5477: 5468: 5463: 5456: 5440: 5436: 5432: 5426: 5417: 5412: 5405: 5386: 5382: 5375: 5368: 5360: 5353: 5342: 5338: 5331: 5324: 5317: 5311: 5302: 5297: 5293: 5289: 5286:(11): 32832. 5285: 5281: 5277: 5273: 5267: 5248: 5244: 5240: 5236: 5232: 5228: 5224: 5217: 5210: 5202: 5189: 5178: 5176: 5168: 5164: 5161: 5155: 5147: 5143: 5139: 5135: 5131: 5127: 5120: 5112: 5105: 5098: 5087: 5083: 5079: 5074: 5069: 5065: 5061: 5054: 5047: 5045: 5036: 5032: 5031:Ng, Andrew Y. 5025: 5014: 5010: 5006: 5002: 4998: 4994: 4990: 4983: 4976: 4966: 4961: 4954: 4943: 4939: 4932: 4928: 4921: 4910: 4906: 4899: 4892: 4885: 4879: 4871: 4867: 4863: 4859: 4852: 4833: 4829: 4825: 4821: 4817: 4813: 4809: 4805: 4801: 4794: 4787: 4779: 4775: 4770: 4765: 4761: 4757: 4753: 4749: 4742: 4723: 4719: 4715: 4711: 4707: 4700: 4693: 4677: 4673: 4669: 4664: 4659: 4655: 4651: 4647: 4640: 4624: 4620: 4619: 4614: 4608: 4601: 4597: 4596: 4595:Computerworld 4588: 4581: 4568: 4564: 4560: 4554: 4547: 4535: 4531: 4527: 4521: 4514: 4510: 4506: 4503: 4497: 4495: 4485: 4477: 4473: 4469: 4465: 4458: 4450: 4449: 4441: 4434: 4430: 4427: 4423: 4418: 4410: 4406: 4401: 4396: 4392: 4388: 4384: 4380: 4376: 4372: 4368: 4361: 4354: 4350: 4347: 4343: 4338: 4330: 4326: 4322: 4320:0-7803-0532-9 4316: 4312: 4308: 4304: 4300: 4296: 4290: 4281: 4265: 4261: 4257: 4250: 4248: 4239: 4235: 4231: 4225: 4221: 4217: 4213: 4206: 4204: 4195: 4191: 4187: 4183: 4179: 4175: 4171: 4167: 4163: 4162:Sainath, Tara 4156: 4154: 4152: 4142: 4126: 4122: 4116: 4107: 4102: 4095: 4093: 4083: 4078: 4071: 4069: 4059: 4054: 4047: 4045: 4035: 4030: 4023: 4007: 4003: 3999: 3995: 3988: 3979: 3974: 3967: 3958: 3953: 3945: 3928: 3924: 3917: 3915: 3907: 3901: 3894: 3890: 3887: 3883: 3877: 3869: 3865: 3861: 3857: 3853: 3849: 3844: 3839: 3835: 3831: 3827: 3821: 3813: 3809: 3805: 3801: 3797: 3793: 3789: 3785: 3781: 3777: 3771: 3769: 3761: 3760:Nelson Morgan 3755: 3739: 3735: 3734:The Intercept 3731: 3724: 3708: 3704: 3700: 3693: 3677: 3673: 3667: 3651: 3647: 3641: 3622: 3615: 3608: 3592: 3588: 3582: 3566: 3562: 3558: 3551: 3535: 3531: 3527: 3521: 3505: 3498: 3482: 3478: 3471: 3469: 3460: 3456: 3452: 3448: 3444: 3437: 3421: 3417: 3413: 3407: 3399: 3395: 3391: 3387: 3383: 3379: 3375: 3371: 3367: 3360: 3341: 3334: 3327: 3319: 3315: 3311: 3307: 3303: 3299: 3295: 3292:(1): 94–103. 3291: 3287: 3283: 3276: 3260: 3256: 3250: 3234: 3230: 3224: 3222: 3205: 3201: 3197: 3191: 3172: 3165: 3158: 3150: 3146: 3142: 3138: 3134: 3130: 3123: 3107: 3103: 3099: 3092: 3076: 3072: 3065: 3057: 3051: 3047: 3040: 3032: 3028: 3024: 3020: 3016: 3012: 3008: 3002: 2983: 2979: 2975: 2970: 2965: 2961: 2957: 2950: 2943: 2927: 2923: 2919: 2912: 2910: 2890: 2887:. p. 6. 2883: 2876: 2860: 2856: 2850: 2834: 2830: 2826: 2820: 2813: 2801: 2798:. Microsoft. 2797: 2793: 2787: 2768: 2764: 2760: 2756: 2752: 2748: 2744: 2740: 2736: 2732: 2728: 2721: 2714: 2706: 2702: 2698: 2694: 2690: 2686: 2681: 2676: 2672: 2668: 2661: 2645: 2641: 2635: 2619: 2615: 2609: 2593: 2589: 2583: 2575: 2571: 2567: 2561: 2557: 2553: 2549: 2542: 2526: 2522: 2516: 2512: 2501: 2498: 2496: 2493: 2491: 2488: 2486: 2483: 2482: 2478: 2477: 2474: 2471: 2469: 2466: 2464: 2461: 2459: 2456: 2454: 2451: 2449: 2446: 2444: 2441: 2439: 2436: 2434: 2431: 2429: 2426: 2424: 2421: 2419: 2416: 2414: 2411: 2409: 2406: 2404: 2401: 2399: 2396: 2394: 2391: 2389: 2386: 2384: 2381: 2379: 2376: 2374: 2371: 2369: 2366: 2364: 2361: 2359: 2356: 2354: 2351: 2349: 2346: 2344: 2341: 2339: 2336: 2334: 2331: 2329: 2326: 2324: 2321: 2320: 2313: 2311: 2306: 2303: 2301: 2298: 2294: 2290: 2285: 2283: 2279: 2275: 2271: 2267: 2263: 2259: 2255: 2245: 2243: 2238: 2236: 2231: 2229: 2224: 2220: 2216: 2212: 2208: 2207:Xuedong Huang 2204: 2200: 2190: 2188: 2184: 2180: 2176: 2172: 2168: 2164: 2160: 2156: 2141: 2137: 2114: 2108: 2105: 2102: 2096: 2093: 2090: 2087: 2080: 2079: 2078: 2076: 2055: 2051: 2048: 2045: 2039: 2034: 2027: 2024: 2021: 2018: 2015: 2012: 2009: 2000: 1997: 1994: 1991: 1988: 1985: 1982: 1979: 1976: 1973: 1966: 1965: 1964: 1961: 1959: 1955: 1951: 1947: 1942: 1927: 1920: 1917: 1914: 1911: 1908: 1899: 1896: 1893: 1890: 1882: 1879: 1877: 1873: 1867: 1861: 1860: 1859: 1853: 1849: 1848: 1847: 1841: 1838: 1834: 1833: 1832: 1826: 1822: 1821: 1820: 1812: 1809: 1806: 1805: 1803: 1802: 1801: 1791: 1790: 1789: 1788: 1784: 1783: 1777: 1774: 1773: 1772: 1771: 1767: 1766: 1759: 1758: 1757: 1756: 1752: 1751: 1745: 1744: 1743: 1742: 1738: 1737: 1736: 1725: 1722: 1719: 1716: 1713: 1710: 1709: 1708: 1704: 1702: 1698: 1694: 1690: 1677: 1673: 1670: 1667: 1666: 1661: 1660: 1655: 1652: 1649: 1648:Transcription 1646: 1643: 1640: 1637: 1633: 1631: 1627: 1625: 1622: 1620: 1616: 1614: 1611: 1608: 1605: 1603: 1600: 1598: 1595: 1593: 1589: 1586: 1583: 1580: 1578: 1575: 1572: 1568: 1566: 1562: 1559: 1555: 1552: 1551:Sensory, Inc. 1548: 1544: 1540: 1536: 1533: 1532: 1526: 1522: 1520: 1516: 1512: 1509: 1505: 1499: 1495: 1493: 1487: 1485: 1480: 1478: 1473: 1464: 1462: 1457: 1455: 1451: 1447: 1437: 1433: 1430: 1419: 1417: 1411: 1409: 1405: 1401: 1397: 1393: 1389: 1385: 1380: 1370: 1368: 1364: 1359: 1357: 1353: 1349: 1344: 1342: 1338: 1333: 1330: 1326: 1322: 1318: 1314: 1299: 1297: 1293: 1289: 1279: 1275: 1273: 1268: 1264: 1260: 1255: 1252: 1247: 1232: 1230: 1226: 1221: 1216: 1214: 1210: 1206: 1202: 1198: 1194: 1190: 1186: 1182: 1178: 1174: 1170: 1166: 1162: 1158: 1154: 1150: 1145: 1144:pronunciation 1139: 1129: 1125: 1111: 1109: 1105: 1101: 1097: 1093: 1088: 1084: 1080: 1075: 1072: 1068: 1064: 1060: 1056: 1052: 1048: 1043: 1039: 1035: 1031: 1027: 1023: 1018: 1016: 1012: 1007: 1003: 999: 989: 986: 982: 981:deep learning 977: 975: 974:deep learning 969: 966: 962: 956: 955:Deep learning 946: 942: 939: 935: 933: 930:recognition, 926: 916: 912: 908: 904: 894: 892: 888: 884: 880: 876: 872: 868: 864: 860: 855: 853: 848: 843: 841: 837: 833: 829: 825: 821: 817: 813: 809: 805: 801: 797: 793: 787: 785: 781: 777: 773: 769: 765: 759: 757: 753: 746: 736: 734: 730: 725: 721: 711: 707: 704: 700: 690: 688: 684: 678: 675: 671: 666: 664: 660: 655: 651: 647: 644:published by 643: 639: 635: 634:deep learning 631: 627: 622: 620: 619:Babel program 616: 611: 607: 602: 600: 596: 592: 588: 584: 580: 579:speech corpus 576: 572: 568: 564: 560: 556: 552: 548: 544: 534: 532: 528: 524: 520: 516: 512: 510: 506: 501: 497: 496:Xuedong Huang 489: 485: 481: 478: 475: 472: 469: 465: 461: 458: 457: 456: 453: 451: 443: 439: 435: 431: 428: 427: 426: 424: 411: 407: 404: 403: 398: 394: 390: 389: 388: 386: 382: 378: 374: 370: 369:Markov chains 366: 358: 354: 350: 347: 344: 341: 340: 336: 332: 328: 324: 320: 316: 315: 314:understanding 309: 305: 301: 298: 297: 291: 289: 284: 282: 278: 274: 267: 263: 259: 256:– Funding at 255: 252: 249: 245: 241: 237: 236:speech coding 233: 229: 226: 223: 219: 215: 212: 209: 205: 201: 198: 195: 191: 188: 187: 181: 173: 171: 167: 166:deep learning 162: 160: 156: 152: 148: 147: 142: 137: 135: 132:). Automatic 131: 127: 123: 119: 114: 110: 105: 102: 97: 95: 91: 87: 83: 79: 75: 71: 67: 63: 59: 55: 54:methodologies 51: 47: 43: 39: 33: 19: 8968:Hugging Face 8932:David Silver 8616: 8580:Audio–visual 8434:Applications 8413:Augmentation 8258: 8107:Concordancer 8033: 7503:Bag-of-words 7419: 7410: 7387: 7368: 7349: 7326: 7307: 7281:. Retrieved 7267: 7255:. Retrieved 7250: 7241: 7229:. Retrieved 7212: 7202: 7190:. Retrieved 7181: 7172: 7160:. Retrieved 7146: 7134:. Retrieved 7120: 7108:. Retrieved 7104:the original 7099: 7090: 7080: 7056: 7049: 7037:. Retrieved 7028: 7019: 7006:The Register 7004: 6994: 6979: 6970: 6958:. Retrieved 6949: 6940: 6920: 6885: 6879: 6870: 6858:. Retrieved 6830: 6823: 6811:. Retrieved 6802: 6793: 6766: 6759: 6751:the original 6741: 6706: 6702: 6696: 6687: 6675:. Retrieved 6653:(1): 25–41. 6650: 6646: 6636: 6625: 6617:the original 6607: 6595:. Retrieved 6586: 6576: 6564:. Retrieved 6535:. Retrieved 6521: 6513:the original 6502: 6490:. Retrieved 6481: 6472: 6459: 6450: 6426: 6419: 6407:. Retrieved 6385: 6379: 6367:. Retrieved 6347: 6334: 6321:. Retrieved 6312: 6298: 6286:. Retrieved 6271: 6259:. Retrieved 6250: 6240: 6228:. Retrieved 6219: 6209: 6197:. Retrieved 6189:The Guardian 6188: 6179: 6167:. Retrieved 6125: 6118: 6106:. Retrieved 6098:EdSurge News 6097: 6087: 6075:. Retrieved 6066: 6056: 6044:. Retrieved 6038:(2): 62–76. 6035: 6031: 6021: 6012: 5972: 5968: 5958: 5950: 5943:, retrieved 5927: 5914: 5879: 5875: 5865: 5839: 5833: 5788: 5782: 5761: 5740: 5719: 5707:. Retrieved 5695: 5682: 5645: 5635: 5618: 5608: 5581: 5577: 5567: 5558:30 September 5556:, retrieved 5541: 5534: 5507: 5497: 5476: 5455: 5443:. Retrieved 5434: 5425: 5404: 5392:. Retrieved 5385:the original 5380: 5367: 5358: 5352: 5336: 5323: 5310: 5283: 5280:Scholarpedia 5279: 5266: 5254:. Retrieved 5226: 5222: 5209: 5188:cite journal 5154: 5132:(1): 30–42. 5129: 5125: 5119: 5110: 5097: 5063: 5059: 5034: 5024: 4995:(1): 39–46. 4992: 4988: 4975: 4970:ICASSP 2013. 4953: 4937: 4920: 4904: 4891: 4878: 4861: 4857: 4851: 4839:. Retrieved 4803: 4799: 4786: 4751: 4747: 4741: 4729:. Retrieved 4709: 4705: 4692: 4680:. Retrieved 4653: 4649: 4639: 4627:. Retrieved 4616: 4607: 4599: 4593: 4587: 4578: 4571:. Retrieved 4562: 4553: 4545: 4538:. Retrieved 4534:the original 4529: 4520: 4484: 4467: 4463: 4457: 4447: 4440: 4417: 4400:1721.1/51891 4377:(3): 75–80. 4374: 4370: 4367:Chin-Hui Lee 4360: 4337: 4302: 4289: 4280: 4268:. Retrieved 4259: 4211: 4172:(6): 82–97. 4169: 4165: 4141: 4129:. Retrieved 4115: 4022: 4010:. Retrieved 4001: 3997: 3987: 3966: 3944: 3931:. Retrieved 3927:the original 3900: 3876: 3833: 3829: 3820: 3787: 3783: 3754: 3742:. Retrieved 3733: 3723: 3711:. Retrieved 3702: 3692: 3680:. Retrieved 3666: 3654:. Retrieved 3640: 3628:. Retrieved 3607: 3597:25 September 3595:. Retrieved 3581: 3569:. Retrieved 3560: 3550: 3538:. Retrieved 3529: 3520: 3508:. Retrieved 3497: 3485:. Retrieved 3450: 3446: 3436: 3424:. Retrieved 3420:the original 3415: 3406: 3373: 3369: 3359: 3347:. Retrieved 3326: 3318:the original 3289: 3285: 3275: 3263:. Retrieved 3259:the original 3249: 3237:. Retrieved 3208:. Retrieved 3199: 3190: 3178:. Retrieved 3157: 3132: 3128: 3122: 3110:. Retrieved 3101: 3098:"Hello, Hal" 3091: 3079:. Retrieved 3064: 3045: 3039: 3014: 3010: 3001: 2989:. Retrieved 2959: 2955: 2942: 2930:. Retrieved 2921: 2896:. Retrieved 2875: 2863:. Retrieved 2849: 2837:. Retrieved 2828: 2819: 2811: 2804:. Retrieved 2795: 2786: 2774:. Retrieved 2733:(1): 72–83. 2730: 2726: 2713: 2670: 2666: 2660: 2648:. Retrieved 2634: 2622:. Retrieved 2608: 2596:. Retrieved 2582: 2547: 2541: 2529:. Retrieved 2515: 2388:IBM ViaVoice 2343:Audio mining 2307: 2304: 2286: 2274:Common Voice 2251: 2241: 2239: 2232: 2214: 2196: 2152: 2138: 2134: 2074: 2072: 1962: 1957: 1953: 1949: 1945: 1943: 1883: 1880: 1868: 1865: 1857: 1845: 1830: 1818: 1797: 1734: 1705: 1686: 1676:Apple's Siri 1663: 1657: 1523: 1500: 1496: 1488: 1481: 1474: 1470: 1458: 1443: 1434: 1425: 1412: 1376: 1360: 1345: 1334: 1310: 1285: 1276: 1256: 1243: 1217: 1141: 1126: 1122: 1114:Applications 1100:Google Brain 1083:Google Brain 1076: 1019: 995: 978: 970: 961:autoencoders 958: 943: 936: 928: 913: 909: 906: 856: 844: 788: 767: 763: 760: 756:Markov model 748: 717: 708: 702: 698: 696: 679: 667: 663:Transformers 659:Google Voice 623: 603: 540: 513: 493: 479: 473: 459: 454: 447: 429: 420: 405: 392: 365:Leonard Baum 362: 357:Philadelphia 355:was held in 351:– The first 348: 342: 318: 313: 307: 299: 285: 271: 253: 227: 213: 199: 189: 179: 163: 159:authenticate 144: 140: 138: 106: 98: 77: 73: 69: 65: 61: 44:subfield of 37: 36: 9116:Categories 9064:Autoencoder 9019:Transformer 8887:Alex Graves 8835:OpenAI Five 8739:IBM Watsonx 8361:Convolution 8339:Overfitting 8064:Topic model 7944:Text corpus 7790:Statistical 7657:Text mining 7498:AI-complete 7283:9 September 7162:9 September 7136:9 September 6860:9 September 6677:9 September 6409:9 September 6369:17 December 6323:15 February 6288:23 February 6261:12 February 6230:12 February 6199:12 February 6169:9 September 6077:12 February 6046:11 February 5992:2066/199273 5945:9 September 5709:9 September 5256:9 September 4905:ICASSP 2010 4841:9 September 4629:9 September 4573:9 September 4295:T. Robinson 4131:9 September 4012:9 September 3703:Tech Crunch 3656:23 November 3510:23 November 3376:: 165–182. 2991:9 September 2839:9 September 2806:21 February 2776:21 February 2624:21 February 2598:21 February 2159:Interspeech 1761:vocabulary. 1699:(SWER) and 1683:Performance 1654:Video games 1461:smartphones 1373:Helicopters 1246:health care 1235:Health care 1157:remediation 1026:Alex Graves 863:N-best list 790:would need 377:James Baker 319:recognition 262:John Pierce 204:Gunnar Fant 86:linguistics 58:translation 9142:Categories 9105:Technology 8958:EleutherAI 8917:Fei-Fei Li 8912:Yann LeCun 8825:Q-learning 8808:Decisional 8734:IBM Watson 8642:Midjourney 8534:TensorFlow 8381:Activation 8334:Regression 8329:Clustering 7785:Rule-based 7667:Truecasing 7535:Stop words 7257:16 October 7231:16 October 7110:9 November 6960:27 October 6399:1090351600 5849:2310.13974 5798:1611.05358 5773:1610.03035 5752:1612.02695 5731:1508.04395 5655:2202.09167 5584:(5): 159. 5552:1910.10261 5517:1904.03288 5488:1807.05162 5467:1611.01599 5416:1512.02595 4540:22 October 4270:20 January 4106:2104.00120 4082:2203.09581 4058:2104.01778 4034:1810.04805 3978:2103.15808 3957:2010.11929 3836:: 85–117. 3619:(Report). 3540:2 February 3487:20 January 3426:17 January 3349:17 January 3265:18 January 3239:9 February 3180:23 January 3112:17 January 3081:23 January 2932:22 October 2898:17 January 2680:2007.10729 2673:: 102795. 2507:References 2403:Mondegreen 2297:microphone 2282:TensorFlow 2217:(2008) by 2165:, such as 1642:Telematics 1636:captioning 1619:Captioning 1617:Real Time 1582:eDiscovery 1569:Automatic 1563:Automatic 1558:subtitling 1556:Automatic 1543:spacecraft 1472:services. 1408:navigation 1388:microphone 1379:helicopter 1325:F-16 VISTA 1323:aircraft ( 1181:intonation 1142:Automatic 1053:presented 871:Bayes risk 859:re scoring 640:(LSTM), a 519:Windows XP 509:Kai-Fu Lee 470:at a time. 268:took over. 101:vocabulary 8988:MIT CSAIL 8953:Anthropic 8922:Andrew Ng 8820:AlphaZero 8664:VideoPoet 8627:AlphaFold 8564:MindSpore 8518:SpiNNaker 8513:Memristor 8420:Diffusion 8396:Rectifier 8376:Batchnorm 8356:Attention 8351:Adversary 8094:reviewing 7892:standards 7890:Types and 7422:. Wiley. 7221:1357-0978 6733:143159997 6711:CiteSeerX 6667:142730664 6001:2215-1931 5906:209353525 5898:0261-4448 5858:264426545 5600:1999-5903 5068:CiteSeerX 4965:1303.5778 4828:216472225 4820:0957-4174 4672:206561058 4618:Microsoft 4194:206485943 4121:"Li Deng" 3843:1404.7828 3398:259502346 3390:0369-7827 3306:0001-0782 2978:1932-8346 2747:1063-6676 2705:220665533 2323:AI effect 2097:− 2049:− 2025:− 2019:− 2013:− 1989:− 1876:frequency 1872:amplitude 1535:Aerospace 1511:telephony 1446:telephony 1429:synthesis 1392:U.S. Army 1296:resection 1292:brain AVM 1165:dictation 1132:Education 525:in 2005. 500:Sphinx-II 393:mid-1980s 294:1970–1990 273:Raj Reddy 258:Bell Labs 234:(LPC), a 139:The term 9096:Portals 8855:Auto-GPT 8687:Word2vec 8491:Hardware 8408:Datasets 8310:Concepts 8010:Wikidata 7990:FrameNet 7975:BabelNet 7954:Treebank 7924:PropBank 7869:Word2vec 7834:fastText 7715:Stemming 7277:Archived 7225:Archived 7213:Wired UK 7186:Archived 7156:Archived 7130:Archived 7085:Society. 7072:Archived 7033:Archived 7029:vice.com 7011:Archived 6986:Archived 6954:Archived 6929:Archived 6912:57374050 6854:Archived 6813:11 April 6807:Archived 6671:Archived 6597:26 March 6591:Archived 6566:26 March 6560:Archived 6531:Archived 6486:Archived 6464:Archived 6439:Archived 6403:Archived 6360:Archived 6317:Archived 6282:Archived 6255:Archived 6251:BBC News 6224:Archived 6193:Archived 6160:Archived 6156:13790002 6102:Archived 6071:Archived 6040:Archived 6009:86440885 5936:archived 5700:Archived 5439:Archived 5341:Archived 5274:(2015). 5247:Archived 5243:16585863 5163:Archived 5146:14862572 5086:Archived 5013:Archived 4942:Archived 4929:(2007). 4909:Archived 4832:Archived 4731:28 March 4722:Archived 4682:28 March 4676:Archived 4623:Archived 4567:Archived 4505:Archived 4429:Archived 4424:(1991), 4349:Archived 4329:62446313 4297:(1992). 4264:Archived 4238:13953660 4125:Archived 4006:Archived 3950:Scale". 3889:Archived 3884:(2006). 3868:11715509 3860:25462637 3738:Archived 3707:Archived 3676:Archived 3650:Archived 3621:Archived 3591:Archived 3565:Archived 3561:PC World 3534:Archived 3481:Archived 3340:Archived 3233:Archived 3204:Archived 3200:ethw.org 3171:Archived 3106:Archived 3075:Archived 2982:Archived 2926:Archived 2922:PC World 2889:Archived 2859:Archived 2833:Archived 2800:Archived 2767:Archived 2755:26108901 2644:Archived 2618:Archived 2592:Archived 2574:13482115 2525:Archived 2468:VoxForge 2463:VoiceXML 2316:See also 2248:Software 2237:(2012). 2219:Jurafsky 1837:Phonemes 1731:Accuracy 1665:Lifeline 1624:Robotics 1384:facemask 1356:workload 1302:Military 1211:such as 1179:such as 879:lattices 820:splicing 796:phonemes 794:for the 772:cepstral 595:GOOG-411 587:Mandarin 484:AT&T 194:formants 184:Pre-1970 170:big data 126:aircraft 8978:Meta AI 8815:AlphaGo 8799:PanGu-Σ 8769:ChatGPT 8744:Granite 8692:Seq2seq 8671:Whisper 8592:WaveNet 8587:AlexNet 8559:Flux.jl 8539:PyTorch 8391:Sigmoid 8386:Softmax 8251:General 8181:Related 8147:Chatbot 8005:WordNet 7985:DBpedia 7859:Seq2seq 7603:Parsing 7518:Trigram 7192:7 March 6537:15 June 6304:CMUDICT 6108:7 March 5825:1662180 5435:YouTube 5394:22 July 5288:Bibcode 4778:9563026 4513:Li Deng 4379:Bibcode 4174:Bibcode 3933:5 April 3812:1915014 3804:9377276 3744:20 June 3713:21 July 3682:26 July 3630:28 July 3571:28 July 3314:6175701 3137:Bibcode 3019:Bibcode 2865:4 April 2763:7319345 2685:Bibcode 2650:15 June 2531:15 June 2293:Android 2287:Google 2270:Mozilla 1703:(CSR). 1656:, with 1341:g-loads 1244:In the 1213:apraxia 1177:prosody 867:lattice 822:and an 784:phoneme 703:speaker 391:By the 371:at the 312:speech 176:History 124:), and 113:domotic 8993:Huawei 8973:OpenAI 8875:People 8845:MuZero 8707:Gemini 8702:Claude 8637:DALL-E 8549:Theano 8154:(c.f. 7812:models 7800:Neural 7513:Bigram 7508:n-gram 7448:Curlie 7426:  7394:  7375:  7356:  7333:  7314:  7219:  7182:GitHub 7064:  6910:  6900:  6846:  6781:  6731:  6713:  6665:  6397:  6302:E.g., 6154:  6144:  6007:  5999:  5904:  5896:  5856:  5823:  5813:  5696:ICASSP 5670:  5598:  5241:  5144:  5070:  5009:236321 5007:  4826:  4818:  4776:  4670:  4409:357467 4407:  4342:Waibel 4327:  4317:  4236:  4226:  4192:  3866:  3858:  3810:  3802:  3396:  3388:  3370:Osiris 3312:  3304:  3052:  2976:  2761:  2753:  2745:  2703:  2572:  2562:  2398:Kinect 2363:Braina 2289:Gboard 2278:GitHub 2258:Sphinx 2155:ICASSP 2073:where 1944:where 1674:(e.g. 1537:(e.g. 1517:, and 1404:Canada 1337:JAS-39 1329:Mirage 1220:accent 1197:stress 1195:, and 1193:rhythm 1063:Nvidia 1055:LipNet 1011:Google 699:speech 648:& 591:Google 583:Arabic 523:Nuance 450:PDP-10 432:– The 423:n-gram 395:IBM's 353:ICASSP 122:emails 40:is an 9059:Mamba 8830:SARSA 8794:LLaMA 8789:BLOOM 8774:GPT-J 8764:GPT-4 8759:GPT-3 8754:GPT-2 8749:GPT-1 8712:LaMDA 8544:Keras 8203:spaCy 7848:large 7839:GloVe 7039:1 May 6950:NAEYC 6908:S2CID 6729:S2CID 6663:S2CID 6492:1 May 6442:(PDF) 6431:(PDF) 6363:(PDF) 6344:(PDF) 6163:(PDF) 6152:S2CID 6130:(PDF) 6005:S2CID 5939:(PDF) 5924:(PDF) 5902:S2CID 5854:S2CID 5844:arXiv 5821:S2CID 5793:arXiv 5768:arXiv 5747:arXiv 5726:arXiv 5703:(PDF) 5692:(PDF) 5650:arXiv 5547:arXiv 5512:arXiv 5483:arXiv 5462:arXiv 5445:5 May 5411:arXiv 5388:(PDF) 5377:(PDF) 5344:(PDF) 5333:(PDF) 5250:(PDF) 5239:S2CID 5219:(PDF) 5142:S2CID 5107:(PDF) 5089:(PDF) 5056:(PDF) 5016:(PDF) 5005:S2CID 4985:(PDF) 4960:arXiv 4945:(PDF) 4934:(PDF) 4912:(PDF) 4901:(PDF) 4835:(PDF) 4824:S2CID 4796:(PDF) 4774:S2CID 4725:(PDF) 4702:(PDF) 4668:S2CID 4580:1994. 4405:S2CID 4325:S2CID 4234:S2CID 4190:S2CID 4101:arXiv 4077:arXiv 4053:arXiv 4029:arXiv 3973:arXiv 3952:arXiv 3864:S2CID 3838:arXiv 3808:S2CID 3624:(PDF) 3617:(PDF) 3394:S2CID 3343:(PDF) 3336:(PDF) 3310:S2CID 3210:1 May 3174:(PDF) 3167:(PDF) 2985:(PDF) 2952:(PDF) 2892:(PDF) 2885:(PDF) 2770:(PDF) 2759:S2CID 2723:(PDF) 2701:S2CID 2675:arXiv 2570:S2CID 2479:Lists 2328:ALPAC 2266:Kaldi 2228:DARPA 2193:Books 2171:NAACL 1189:tempo 1185:pitch 1159:, or 1047:Baidu 1015:Apple 885:with 808:delta 718:Both 693:2010s 615:IARPA 555:LIMSI 553:with 537:2000s 527:Apple 442:RIPAC 438:CSELT 304:DARPA 281:chess 8983:Mila 8784:PaLM 8717:Bard 8697:BERT 8680:Text 8659:Sora 7968:Data 7819:BERT 7424:ISBN 7392:ISBN 7373:ISBN 7354:ISBN 7331:ISBN 7312:ISBN 7285:2024 7259:2021 7233:2021 7217:ISSN 7194:2022 7164:2024 7138:2024 7112:2019 7062:ISBN 7041:2018 6962:2023 6898:ISBN 6862:2024 6844:ISBN 6815:2021 6779:ISBN 6679:2024 6599:2014 6568:2014 6539:2013 6494:2018 6411:2024 6395:OCLC 6371:2023 6325:2023 6290:2023 6263:2023 6232:2023 6201:2023 6171:2024 6142:ISBN 6110:2023 6079:2023 6048:2023 5997:ISSN 5947:2024 5894:ISSN 5811:ISBN 5711:2024 5668:ISBN 5596:ISSN 5560:2024 5447:2017 5396:2019 5381:ICML 5258:2024 5201:help 4843:2024 4816:ISSN 4733:2011 4684:2011 4631:2024 4575:2024 4542:2018 4315:ISBN 4272:2015 4224:ISBN 4133:2024 4014:2024 3935:2016 3856:PMID 3800:PMID 3746:2015 3715:2015 3684:2017 3658:2011 3632:2017 3599:2014 3573:2017 3542:2016 3512:2015 3489:2015 3428:2015 3386:ISSN 3351:2015 3302:ISSN 3267:2015 3241:2017 3212:2018 3182:2018 3114:2015 3083:2018 3050:ISBN 2993:2024 2974:ISSN 2934:2018 2900:2015 2867:2019 2841:2024 2808:2014 2778:2014 2751:OCLC 2743:ISSN 2652:2013 2626:2012 2600:2012 2560:ISBN 2533:2013 2300:icon 2187:IEEE 2183:IEEE 2179:IEEE 1662:and 1571:shot 1508:deaf 1346:The 1321:F-16 1263:ARRA 1151:for 1106:and 1098:and 1081:and 1069:and 1013:and 810:and 722:and 585:and 573:and 567:ICSI 557:and 531:Siri 480:1990 474:1987 460:1984 430:1987 406:1982 379:and 349:1976 343:1972 333:and 300:1971 254:1969 228:1966 214:1962 200:1960 190:1952 168:and 88:and 48:and 8724:NMT 8607:OCR 8602:HWR 8554:JAX 8508:VPU 8503:TPU 8498:IPU 8322:SGD 8000:UBY 7446:at 6981:NPR 6890:doi 6836:doi 6771:doi 6721:doi 6655:doi 6352:doi 6134:doi 5987:hdl 5977:doi 5884:doi 5803:doi 5660:doi 5623:doi 5586:doi 5522:doi 5296:doi 5231:doi 5134:doi 5078:doi 4997:doi 4866:doi 4808:doi 4804:153 4764:hdl 4756:doi 4714:doi 4658:doi 4472:doi 4395:hdl 4387:doi 4307:doi 4216:doi 4182:doi 3848:doi 3792:doi 3455:doi 3378:doi 3294:doi 3145:doi 3027:doi 2964:doi 2735:doi 2693:doi 2671:104 2552:doi 2262:HTK 2256:'s 2167:ACL 1504:RSI 1454:IVR 1396:RAE 1352:RAF 1167:or 1096:MIT 1028:of 998:HMM 824:LDA 731:or 617:'s 571:SRI 551:BBN 547:IBM 468:RAM 327:IBM 323:BBN 242:of 218:IBM 143:or 120:or 78:STT 72:or 68:), 66:ASR 9144:: 7409:. 7275:. 7249:. 7223:. 7215:. 7211:. 7184:. 7180:. 7098:. 7070:. 7027:. 7009:. 7003:. 6978:. 6952:. 6948:. 6935:". 6906:. 6896:. 6852:. 6842:. 6805:. 6801:. 6777:. 6727:. 6719:. 6707:33 6705:. 6669:. 6661:. 6651:26 6649:. 6645:. 6589:. 6585:. 6547:^ 6484:. 6480:. 6462:. 6458:. 6437:. 6401:. 6358:. 6346:. 6315:. 6311:. 6306:, 6253:. 6249:. 6222:. 6218:. 6187:. 6158:. 6150:. 6140:. 6100:. 6096:. 6065:. 6034:. 6030:. 6011:. 6003:. 5995:. 5985:. 5971:. 5967:. 5949:, 5926:, 5900:. 5892:. 5880:50 5878:. 5874:. 5852:, 5819:. 5809:. 5801:. 5698:. 5694:. 5666:. 5658:. 5644:. 5617:. 5594:. 5582:15 5580:. 5576:. 5545:, 5520:. 5506:. 5433:. 5379:. 5339:. 5335:. 5294:. 5284:10 5282:. 5278:. 5245:. 5237:. 5227:21 5225:. 5221:. 5192:: 5190:}} 5186:{{ 5174:^ 5140:. 5130:20 5128:. 5109:. 5084:. 5076:. 5062:. 5058:. 5043:^ 5011:. 5003:. 4991:. 4987:. 4940:. 4936:. 4907:. 4903:. 4862:15 4860:. 4830:. 4822:. 4814:. 4802:. 4798:. 4772:. 4762:. 4752:37 4750:. 4720:. 4710:14 4708:. 4704:. 4674:. 4666:. 4654:14 4652:. 4648:. 4615:. 4577:. 4561:. 4544:. 4528:. 4515:). 4493:^ 4466:. 4403:. 4393:. 4385:. 4375:26 4373:. 4323:. 4313:. 4301:. 4262:. 4258:. 4246:^ 4232:. 4222:. 4202:^ 4188:. 4180:. 4170:29 4168:. 4150:^ 4091:^ 4067:^ 4043:^ 4002:30 4000:. 3996:. 3939:." 3913:^ 3862:. 3854:. 3846:. 3834:61 3832:. 3806:. 3798:. 3786:. 3778:; 3767:^ 3736:. 3732:. 3705:. 3701:. 3674:. 3563:. 3559:. 3532:. 3528:. 3467:^ 3451:17 3449:. 3445:. 3414:. 3392:. 3384:. 3374:38 3372:. 3368:. 3308:. 3300:. 3290:57 3288:. 3284:. 3231:. 3220:^ 3198:. 3169:. 3143:. 3133:62 3131:. 3104:. 3100:. 3073:. 3025:. 3015:46 3013:. 2980:. 2972:. 2958:. 2954:. 2924:. 2920:. 2908:^ 2827:. 2810:. 2794:. 2765:. 2757:. 2749:. 2741:. 2729:. 2725:. 2699:. 2691:. 2683:. 2669:. 2568:. 2558:. 2312:. 2302:. 2169:, 2157:, 1541:, 1494:. 1215:. 1191:, 1187:, 1183:, 1094:, 735:. 621:. 569:, 561:, 533:. 444:). 329:, 325:, 302:– 283:. 230:– 216:– 202:– 96:. 84:, 8237:e 8230:t 8223:v 8158:) 7881:, 7850:) 7846:( 7476:e 7469:t 7462:v 7432:. 7400:. 7381:. 7362:. 7339:. 7320:. 7287:. 7261:. 7235:. 7196:. 7114:. 7043:. 6964:. 6914:. 6892:: 6864:. 6838:: 6817:. 6787:. 6773:: 6735:. 6723:: 6681:. 6657:: 6601:. 6570:. 6541:. 6496:. 6413:. 6373:. 6354:: 6327:. 6292:. 6265:. 6234:. 6203:. 6173:. 6136:: 6112:. 6081:. 6050:. 6036:2 5989:: 5979:: 5973:4 5908:. 5886:: 5846:: 5827:. 5805:: 5795:: 5776:. 5770:: 5755:. 5749:: 5734:. 5728:: 5713:. 5676:. 5662:: 5652:: 5629:. 5625:: 5602:. 5588:: 5549:: 5528:. 5524:: 5514:: 5491:. 5485:: 5470:. 5464:: 5449:. 5419:. 5413:: 5398:. 5361:. 5304:. 5298:: 5290:: 5260:. 5233:: 5203:) 5199:( 5148:. 5136:: 5113:. 5080:: 5064:7 5037:. 4999:: 4993:1 4968:. 4962:: 4872:. 4868:: 4845:. 4810:: 4780:. 4766:: 4758:: 4735:. 4716:: 4686:. 4660:: 4633:. 4478:. 4474:: 4468:7 4411:. 4397:: 4389:: 4381:: 4331:. 4309:: 4274:. 4240:. 4218:: 4196:. 4184:: 4176:: 4135:. 4109:. 4103:: 4085:. 4079:: 4061:. 4055:: 4037:. 4031:: 4016:. 3981:. 3975:: 3960:. 3954:: 3937:. 3870:. 3850:: 3840:: 3814:. 3794:: 3788:9 3748:. 3717:. 3686:. 3660:. 3634:. 3601:. 3575:. 3544:. 3514:. 3491:. 3461:. 3457:: 3430:. 3400:. 3380:: 3353:. 3296:: 3269:. 3243:. 3214:. 3184:. 3151:. 3147:: 3139:: 3116:. 3085:. 3058:. 3033:. 3029:: 3021:: 2995:. 2966:: 2960:3 2936:. 2902:. 2869:. 2843:. 2780:. 2737:: 2731:3 2707:. 2695:: 2687:: 2677:: 2654:. 2628:. 2602:. 2576:. 2554:: 2535:. 2115:. 2112:) 2109:d 2106:+ 2103:s 2100:( 2094:n 2091:= 2088:h 2075:h 2056:n 2052:i 2046:h 2040:= 2035:n 2031:) 2028:i 2022:d 2016:s 2010:n 2007:( 2001:= 1998:R 1995:E 1992:W 1986:1 1983:= 1980:R 1977:R 1974:W 1958:n 1954:i 1950:d 1946:s 1928:n 1924:) 1921:i 1918:+ 1915:d 1912:+ 1909:s 1906:( 1900:= 1897:R 1894:E 1891:W 1678:) 1319:/ 1261:( 768:n 764:n 685:/ 224:. 210:. 76:( 64:( 34:. 20:)

Index

Speech Recognition
Speech perception
interdisciplinary
computer science
computational linguistics
methodologies
translation
computer science
linguistics
computer engineering
speech synthesis
vocabulary
voice user interfaces
domotic
word processors
emails
aircraft
direct voice input
pronunciation assessment
speaker identification
Recognizing the speaker
translating speech
authenticate
deep learning
big data
formants
Gunnar Fant
source-filter model of speech production
IBM
1962 World's Fair

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.