| |
Abstract
Continuous quantities are ubiquitous in models of real-world phenomena, but are surprisingly difficult to reason about automatically. Probabilistic graphical models such as Bayesian networks and Markov random fields, and algorithms for approximate inference such as belief propagation (BP), have proven to be powerful tools in a wide range of applications in statistics and artificial intelligence. However, applying these methods to models with continuous variables remains a challenging task. In this work we describe an extension of BP to continuous variable models, ...
|
| |
Abstract
This paper studies single-channel speech separation, assuming unknown, arbitrary temporal dynamics for the speech signals to be separated. A data-driven approach is described, which matches each mixed speech segment against a composite training segment to separate the underlying clean speech segments. To advance the separation accuracy, the new approach seeks and separates the longest mixed speech segments with matching composite training segments. Lengthening the mixed speech segments to match reduces the uncertainty of the constituent training segments, and hence the error ...
|
| |
Signal Processing, IEEE Transactions on In Signal Processing, IEEE Transactions on [see also Acoustics, Speech, and Signal Processing, IEEE Transactions on], Vol. 54, No. 11. (November 2006), pp. 4311-4322, doi:10.1109/tsp.2006.881199
Abstract
In recent years there has been a growing interest in the study of sparse representation of signals. Using an overcomplete dictionary that contains prototype signal-atoms, signals are described by sparse linear combinations of these atoms. Applications that use sparse representation are many and include compression, regularization in inverse problems, feature extraction, and more. Recent activity in this field has concentrated mainly on the study of pursuit algorithms that decompose signals with respect to a given dictionary. Designing dictionaries to better fit ...
|
| |
Abstract
Distant microphone speech recognition systems that operate with human-like robustness remain a distant goal. The key difficulty is that operating in everyday listening conditions entails processing a speech signal that is reverberantly mixed into a noise background composed of multiple competing sound sources. This paper describes a recent speech recognition evaluation that was designed to bring together researchers from multiple communities in order to foster novel approaches to this problem. The task was to identify keywords from sentences reverberantly mixed into ...
|
| |
In Advances in Neural Information Processing Systems 21 (2009), pp. 1697-1704
|
| |
In Applications of Signal Processing to Audio and Acoustics, 1995., IEEE ASSP Workshop on (October 1995), pp. 213-216, doi:10.1109/aspaa.1995.482993
posted to cepstrum frequencydomainlpc
by asterix77
on 2013-03-15 17:10:12
Abstract
This paper presents an improved method for the estimation of a continuous frequency-envelope when the value of this envelope is specified only at discrete frequencies. It is based on the Galas/Rodet (1990) approach which consists of fitting a cepstral amplitude envelope to the specified frequency points by minimizing a frequency-domain least-squares criterion. This paper introduces a regularization technique which increases the robustness of the estimation procedure. Used in combination with a warped frequency-scale, the proposed method is shown to provide an ...
|
| |
posted to missingdata speechrecognition
by asterix77
on 2013-03-04 21:35:18
Abstract
Human speech perception is robust in the face of a wide variety of distortions, both experimentally applied and naturally occurring. In these conditions, state-of-the-art automatic speech recognition (ASR) technology fails. This paper describes an approach to robust ASR which acknowledges the fact that some spectro-temporal regions will be dominated by noise. For the purposes of recognition, these regions are treated as missing or unreliable. The primary advantage of this viewpoint is that it makes minimal assumptions about any noise background. Instead, ...
|
| |
posted to lpc plp warping
by asterix77
on 2013-02-21 22:39:10
Abstract
Linear prediction is considered with respect to a nonlinear frequency scale obtained by a first‐order all‐pass transformation. The predictor can be computed from a frequency‐warped autocorrelation function obtained from the power spectrum or by a direct linear transformation of the original acf. Three numerical procedures are compared. Alternatively, the predictor can be determined from a covariance matrix or (adaptively) from continuously formed correlations, suitably defined according to the all‐pass transformation. Prediction‐error minimization and spectral flattening are no longer equivalent criteria. In ...
|
| |
In Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP '85., Vol. 10 (April 1985), pp. 509-512, doi:10.1109/icassp.1985.1168384
posted to lpc plp
by asterix77
on 2013-02-21 21:46:48
Abstract
A novel speech analysis method which uses several established psychoacoustic concepts, the perceptually based linear predictive analysis (PLP), models the auditory spectrum by the spectrum of the low-order all-pole model. The auditory spectrum is derived from the speech waveform by critical-band filtering, equal-loudness curve pre-emphasis, and intensity-loudness root compression. We demonstrate through analysis of both synthetic and natural speech that psychoacoustic concepts of spectral auditory integration in vowel perception, namely the F1, F2' concept of Carlson and Fant and the 3.5 ...
|
| |
posted to noise noiseestimator tracking
by asterix77
on 2013-02-21 14:26:34
 /  /
Abstract
We propose a new approach for online noise power spectral density (psd) tracking. In this approach, the prior and posterior probabilities of speech absence and also noise statistics are analytically retrieved from a maximum-likelihood-based criterion at every time-frequency slot. The recursive update rules of these three terms are performed in a unified manner and without relying on the conventional tracking of speech psd minima. A single parameter (a forgetting factor) is needed in this process. Comparisons with state of the art ...
|
| |
posted to bregman divergence nmf
by asterix77
on 2013-02-19 14:25:35
Abstract
In this paper, we present a complete proof that the β-divergence is a particular case of Bregman divergence. This little-known result makes it possible to straightforwardly apply theorems about Bregman divergences to β-divergences. This is of interest for numerous applications since these divergences are widely used, for instance in non-negative matrix factorization (NMF). ...
|
| |
posted to fourier stft
by asterix77
on 2013-02-12 15:00:43
 /  /
Abstract
A theory of short term spectral analysis, synthesis, and modification is presented with an attempt at pointing out certain practical and theoretical questions. The methods discussed here are useful in designing filter banks when the filter bank outputs are to be used for synthesis after multiplicative modifications are made to the spectrum. ...
|
| |
(26 Jun 2009)
posted to filtering gradient gradient-descent
by asterix77
on 2013-01-31 18:49:24
 /  /
Abstract
A thorough discussion and development of the calculus of real-valued functions of complex-valued vectors is given using the framework of the Wirtinger Calculus. The presented material is suitable for exposition in an introductory Electrical Engineering graduate level course on the use of complex gradients and complex Hessian matrices, and has been successfully used in teaching at UC San Diego. Going beyond the commonly encountered treatments of the first-order complex vector calculus, second-order considerations are examined in some detail filling a gap in the pedagogic literature. ...
|
| |
posted to lpc lsf robustlpc
by asterix77
on 2013-01-28 16:02:39
Abstract
This study presents a new technique called weighted-sum line spectrum pair (WLSP) where an all-pole filter is defined by using a sum of weighted line spectrum pair polynomials. The WLSP yields a stable all-pole filter of order m, whose autocorrelation function coincides with that of the input signal between indices 0 and m-1. By sacrificing the exact matching at index m, the WLSP models the autocorrelation of the input signal at the indices above m more accurately than conventional linear prediction ...
|
| |
In Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03). 2003 IEEE International Conference on, Vol. 1 (April 2003), pp. I-828-I-831 vol.1, doi:10.1109/icassp.2003.1198909
posted to frequencydomainlpc plp
by asterix77
on 2013-01-25 21:34:30
 /  /
Abstract
The paper presents a new method for all-pole model estimation based on minimization of the weighted mean square error in the sampled spectral domain. Due to discrete nature of the proposed distance measure, emphasis can be put on an arbitrary set of spectral samples what can greatly improve the model accuracy for periodic signals. Weighting can also be applied to improve the fitting in certain spectral regions according to any desired fidelity criterion. Iterative algorithm for determination of the optimal model ...
|
| |
(2004)
posted to lpc lsp prior
by asterix77
on 2013-01-25 20:20:34
Abstract
In an exploration of the spectral modelling of speech, this thesis presents theory and applications of constrained linear predictive (LP) models. Spectral models are essential in many applications of speech technology, such as speech coding, synthesis and recognition. At present, the prevailing approach in speech spectral modelling is linear prediction. In speech coding, spectral models obtained by LP are typically quantised using a polynomial transform called the Line Spectrum Pair (LSP) decomposition. An inherent drawback of conventional LP is its inability ...
|
| |
posted to lpc maxent missingdata
by asterix77
on 2013-01-25 20:05:54
 /  /
Abstract
This paper reviews the fundamental concepts of Linear Prediction (LP) and Maximum Entropy (ME) spectral analysis, and elucidates the reasons for their practical importance in the world of real signals. Subsequently, the powerful principle of Minimum Cross-Entropy (MCE) spectral analysis is introduced. MCE permits the incorporation of prior information into signal analysis. In a new approach to speech signal analysis, application of the MCE principle reduces the average number of predictor coefficients (poles) that have to be specified per time frame ...
|
| |
In Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Conference on (April 2009), pp. 3845-3848, doi:10.1109/icassp.2009.4960466
Abstract
We address the problem of single-channel speech separation and recognition using loopy belief propagation in a way that enables efficient inference for an arbitrary number of speech sources. The graphical model consists of a set of N Markov chains, each of which represents a language model or grammar for a given speaker. A Gaussian mixture model with shared states is used to model the hidden acoustic signal for each grammar state of each source. The combination of sources is modeled in ...
|
| |
(29 Aug 2012)
posted to distance probability
by asterix77
on 2013-01-23 19:55:08
Abstract
We introduce a new divergence measure, the bounded Bhattacharyya distance (BBD), for quantifying the dissimilarity between probability distributions. BBD is based on the Bhattacharyya coefficient (fidelity), and is symmetric, positive semi-definite, and bounded. Unlike the Kullback-Leibler divergence, BBD does not require probability density functions to be absolutely continuous with respect to each other. We show that BBD belongs to the class of Csiszar f-divergence and derive certain relationships between BBD and well known measures such as Bhattacharyya, Hellinger and Jensen-Shannon divergence. Bounds on the Bayesian error probability are established ...
|
| |
Abstract
Approaches that abandon traditional speech categories offer promise for developing statistical descriptions that encapsulate how speech conveys information. Grandparents would be among the beneficiaries. Our own ease of understanding speech belies its underlying complexity. At an abstract level, it is easy enough to describe speech as a sequence of words or of phonemes, but it's notoriously difficult to analyse at the level of the acoustic signal. ...
|
| |
posted to dbn neuralnet representation speech
by asterix77
on 2013-01-15 15:38:51
 /  /
Abstract
State of the art speech recognition systems rely on preprocessed speech features such as Mel cepstrum or linear predictive coding coefficients that collapse high dimensional speech sound waves into low dimensional encodings. While these have been successfully applied in speech recognition systems, such low dimensional encodings may lose some relevant information and express other information in a way that makes it difficult to use for discrimination. Higher dimensional encodings could both improve performance in recognition tasks, and also be applied to ...
|
| |
Acoustics, Speech, and Signal Processing, 2004. Proceedings. (ICASSP '04). IEEE International Conference on In Acoustics, Speech, and Signal Processing, 2004. Proceedings. (ICASSP '04). IEEE International Conference on, Vol. 2 (May 2004), pp. ii-589-92 vol.2, doi:10.1109/icassp.2004.1326326
Abstract
The operation of digital signal processors in continuous time is discussed. It is shown that the main advantages of digital arithmetic can be maintained in such operations, while aliasing of the signal and the quantization error is avoided altogether. Continuous-time operation makes possible a smaller number of bits for a given signal-to-quantization error ratio. Simulation results are presented. ...
|
| |
Abstract
Weighted finite-state transducers (WFSTs) have been widely adopted as efficient representations of a general speech recognition model. The WFST for speech recognizer is typically assembled or composed from the several components-the language model, the pronunciation mapping and the acoustic model-which are estimated separately without any end-to-end optimization. This paper examines how the weights of such transducers can be learned in a manner that captures the interaction between the components. The paths in the transducer are represented as ...
|
| |
Abstract
In the development process of noise-reduction algorithms, an objective machine-driven intelligibility measure which shows high correlation with speech intelligibility is of great interest. Besides reducing time and costs compared to real listening experiments, an objective intelligibility measure could also help provide answers on how to improve the intelligibility of noisy unprocessed speech. In this paper, a short-time objective intelligibility measure (STOI) is presented, which shows high correlation with the intelligibility of noisy and time-frequency weighted noisy speech (e.g., resulting from noise ...
|
| |
Abstract
Modern automatic speech recognition systems handle large vocabularies of words, making it infeasible to collect enough repetitions of each word to train individual word models. Instead, large-vocabulary recognizers represent each word in terms of subword units. Typically the subword unit is the phone, a basic speech sound such as a single consonant or vowel. Each word is then represented as a sequence, or several alternative sequences, of phones specified in a pronunciation dictionary. Other choices of subword units have been studied ...
|
| |
The Journal of the Acoustical Society of America, Vol. 22, No. 2. (01 March 1950), pp. 167-173, doi:10.1121/1.1906584
Abstract
This paper concerns the effects of interrupting speech waves—turning them on and off intermittently or masking them with intermittent noise—upon their intelligibility. The effects were studied with various rates of interruption and with the speech left undisturbed various percentages of the time. Tests were conducted (1) with speech turned on and off in quiet, (2) with continuous speech masked by interrupted white noise, and (3) with speech and noise interrupted alternately, the speech wave being turned on as the noise wave ...
|
| |
The Journal of the Acoustical Society of America, Vol. 27, No. 2. (01 March 1955), pp. 338-352, doi:10.1121/1.1907526
Abstract
Sixteen English consonants were spoken over voice communication systems with frequency distortion and with random masking noise. The listeners were forced to guess at every sound and a count was made of all the different errors that resulted when one sound was confused with another. With noise or low‐pass filtering the confusions fall into consistent patterns, but with high‐pass filtering the errors are scattered quite randomly. An articulatory analysis of these 16 consonants provides a system of five articulatory features or ...
|
| |
Abstract
The ability of listeners to ‘‘glimpse’’ acoustic cues during the quieter sections of an interrupted noise has primarily been studied using maskers with interruptions occurring simultaneously across the entire frequency range of the masker—broadband comodulated interruptions. Here, the possibility of uncomodulated glimpsing (the glimpsing of acoustic cues separated both in time and frequency) was investigated. To achieve this, speech reception thresholds for a set of intervocalic consonants were adaptively measured in 100‐Hz to 10‐kHz pink noise divided into a varying number ...
|
| |
posted to idealbinarymask separation
by asterix77
on 2012-12-10 21:55:51
Abstract
In his famous treatise of computational vision, Marr (1982) makes a compelling argument for separating different levels of analysis in order to understand complex information processing. In particular, the computational theory level, concerned with the goal of computation and general processing strategy, must be separated from the algorithm level, or the separation of what from how. This chapter is an attempt at a computational-theory analysis of auditory scene analysis, where the main task is to understand the character of the CASA ...
|
| |
Abstract
Intelligibility of ideal binary masked noisy speech was measured on a group of normal hearing individuals across mixture signal to noise ratio (SNR) levels, masker types, and local criteria for forming the binary mask. The binary mask is computed from time-frequency decompositions of target and masker signals using two different schemes: an ideal binary mask computed by thresholding the local SNR within time-frequency units and a target binary mask computed by comparing the local target energy against the long-term average speech ...
|
| |
Abstract
This paper presents the results of a closed-set recognition task for 64 consonant-vowel sounds (16 C×4 V, spoken by 18 talkers) in speech-weighted noise (−22,−20,−16,−10,−2 [dB]) and in quiet. The confusion matrices were generated using responses of a homogeneous set of ten listeners and the confusions were analyzed using a graphical method. In speech-weighted noise the consonants separate into three sets: a low-scoring set C1 (/f/, /θ/, /v/, /ð/, /b/, /m/), a high-scoring set C2 (/t/, /s/, /z/, /ʃ/, /ʒ/) and set C3 (/n/, ...
|
| |
Abstract
Synthetic speech has been widely used in the study of speech cues. A serious disadvantage of this method is that it requires prior knowledge about the cues to be identified in order to synthesize the speech. Incomplete or inaccurate hypotheses about the cues often lead to speech sounds of low quality. In this research a psychoacoustic method, named three-dimensional deep search (3DDS), is developed to explore the perceptual cues of stop consonants from naturally produced speech. For a given sound, it ...
|
| |
Abstract
In this paper, we show how uncertainty propagation, combined with observation uncertainty techniques, can be applied to a realistic implementation of robust distributed speech recognition (DSR) to improve recognition robustness furthermore, with little increase in computational complexity. Uncertainty propagation, or error propagation, techniques employ a probabilistic description of speech to reflect the information lost during speech enhancement or source separation in the time or frequency domain. This uncertain description is then propagated through the feature extraction process to the domain of ...
|
| |
posted to earlyechos intelligibility
by asterix77
on 2012-12-06 18:11:24
Abstract
The auditory system takes advantage of early reflections (ERs) in a room by integrating them with the direct sound (DS) and thereby increasing the effective speech level. In the present paper the benefit from realistic ERs on speech intelligibility in diffuse speech-shaped noise was investigated for normal-hearing and hearing-impaired listeners. Monaural and binaural speech intelligibility tests were performed in a virtual auditory environment where the spectral characteristics of ERs from a simulated room could be preserved. The useful ER energy was ...
|
| |
Journal of the Royal Statistical Society: Series B (Statistical Methodology), Vol. 59, No. 1. (1997), pp. 255-268, doi:10.1111/1467-9868.00067
posted to lpc robustlpc
by asterix77
on 2012-11-28 17:21:47
Abstract
A Bayesian analysis is presented of a time series which is the sum of a stationary component with a smooth spectral density and a deterministic component consisting of a linear combination of a trend and periodic terms. The periodic terms may have known or unknown frequencies. The advantage of our approach is that different features of the data—such as the regression parameters, the spectral density, unknown frequencies and missing observations—are combined in a hierarchical Bayesian framework and estimated simultaneously. A Bayesian ...
|
| |
In Applications of Signal Processing to Audio and Acoustics, 1997. 1997 IEEE ASSP Workshop on (October 1997), 4 pp., doi:10.1109/aspaa.1997.625612
posted to frequencydomainlpc lpc robustlpc
by asterix77
on 2012-11-28 16:56:32
Abstract
Finding a smooth spectral envelope that connects estimated sinusoids is a topic of major importance in audio signal processing. A penalized likelihood criterion is introduced for the estimation of the spectral envelope in the presence of measurement noise. Various simulation results are presented that highlight the efficiency of the proposed performance criterion ...
|
| |
Signal Processing, IEEE Transactions on, Vol. 39, No. 2. (February 1991), pp. 411-423, doi:10.1109/78.80824
posted to frequencydomainlpc lpc robustlpc
by asterix77
on 2012-11-28 16:07:01
 /  /
Abstract
A method for parametric modeling and spectral envelopes when only a discrete set of spectral points is given is introduced. This method, called discrete all-pole (DAP) modeling, uses a discrete version of the Itakura-Saito distortion measure as its error criterion. One result is an autocorrelation matching condition that overcomes the limitations of linear prediction and produces better fitting spectral envelopes for spectra that are representable by a relatively small discrete set of values, such as in voiced speech. An iterative algorithm ...
|
| |
In Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on, Vol. 3 (April 2007), pp. III-1001-III-1004, doi:10.1109/icassp.2007.366851
posted to frequencydomainlpc lpc robustlpc
by asterix77
on 2012-11-28 16:01:13
 /  /
Abstract
Frequency-selective autoregressive (AR) estimation is arousing increasing interest. We propose herein a new method to estimate the AR model from a reduced set of spectral samples. The proposed method is founded on the maximum likelihood criterion over the logarithmic spectral residue, and it is implemented efficiently with a multivariate Newton-Raphson algorithm. Results over deterministic and stochastic scenarios show its excellent performance ...
|
| |
posted to frequencydomainlpc lpc robustlpc
by asterix77
on 2012-11-27 17:08:53
 /  /
|
| |
In Neural Information Processing Systems (2003)
posted to graphicalmodels phase
by asterix77
on 2012-11-02 18:28:57
 /  /
Abstract
Many techniques for complex speech processing such as denoising and deconvolution, time/frequency warping, multiple speaker separation, and multiple microphone analysis operate on sequences of short-time power spectra (spectrograms), a representation which is often well-suited to these tasks. However, a significant problem with algorithms that manipulate spectrograms is that the output spectrogram does not include a phase component, which is needed to create a time-domain signal that has good perceptual quality. Here we describe a generative model of time-domain speech signals and their spectrograms, and show how an efficient ...
|
| |
In Interspeech (2007)
posted to articulatory
by asterix77
on 2012-11-02 16:13:19
|
| |
In Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP '85., Vol. 10 (April 1985), pp. 473-476, doi:10.1109/icassp.1985.1168377
posted to lpc robustlpc
by asterix77
on 2012-11-02 16:07:52
Abstract
Contamination of speech, for example by environmental noise, is sometimes unavoidable. Under such circumstances the familiar LPC analysis technique, either for low bit-rate coding or for automated recognition at the receiver, becomes fragile thus jeopardizing the system objective. In this paper we present an extended correlation matching approach for LPC analysis which results in good spectral matching between the true speech spectrum and the all-pole model spectrum, especially for the first three formants. The method has been tested on both synthetic ...
|
| |
posted to phase subjective
by asterix77
on 2012-11-02 16:06:33
Abstract
The importance of Fourier transform phase in speech enhancement is considered. Results indicate that a more accurate estimation of phase is unwarranted in speech enhancement at the S/N ratios where the intelligibility scores of unprocessed speech range from 5 to 95 percent, if the phase estimate is used to reconstruct speech by combining it with an independently estimated magnitude or to reconstruct speech using the phase-only signal reconstruction algorithm. ...
|
| |
In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics (2009), pp. 611-619
posted to em gmm online
by asterix77
on 2012-09-24 21:31:05
Abstract
The (batch) EM algorithm plays an important role in unsupervised induction, but it sometimes suffers from slow convergence. In this paper, we show that online variants (1) provide significant speedups and (2) can even find better solutions than those found by batch EM. We support these findings on four unsupervised tasks: part-of-speech tagging, document classification, word segmentation, and word alignment. ...
|
| |
(2007)
posted to crf graphicalmodel hmm statistics
by asterix77
on 2010-09-15 23:21:41
|
| |
J. Mach. Learn. Res., Vol. 5 (2004), pp. 1089-1105
Abstract
Most machine learning researchers perform quantitative experiments to estimate generalization error and compare the performance of different algorithms (in particular, their proposed algorithm). In order to be able to draw statistically convincing conclusions, it is important to estimate the uncertainty of such estimates. This paper studies the very commonly used K-fold cross-validation estimator of generalization performance. The main theorem shows that there exists no universal (valid under all distributions) unbiased estimator of the variance of K-fold cross-validation. The analysis that accompanies ...
|
| |
Abstract
Spike trains recorded from populations of neurons can exhibit substantial pairwise correlations between neurons and rich temporal structure. Thus, for the realistic simulation and analysis of neural systems, it is essential to have efficient methods for generating artificial spike trains with specified correlation structure. Here we show how correlated binary spike trains can be simulated by means of a latent multivariate gaussian model. Sampling from the model is computationally very efficient and, in particular, feasible even for large populations of neurons. ...
|
| |
posted to localization motion tracking
by asterix77
on 2010-06-11 23:21:51
 /  /
Abstract
Speaker location estimation techniques based on time-difference-of-arrival measurements have attracted much attention recently. Many existing localization ideas assume that only one speaker is active at a time. In this paper, we focus on a more realistic assumption that the number of active speakers is unknown and time-varying. Such an assumption results in a more complex localization problem, and we employ the random finite set (RFS) theory to deal with that problem. The RFS concepts provide us with an effective, solid foundation ...
|
| |
Abstract
Hidden Markov model speech recognition systems typically use Gaussian mixture models to estimate the distributions of decorrelated acoustic feature vectors that correspond to individual subword units. By contrast, hybrid connectionist-HMM systems use discriminatively-trained neural networks to estimate the probability distribution among subword units given the acoustic observations. In this work we show a large improvement in word recognition performance by combining neural-net discriminative feature processing with Gaussian-mixture distribution modeling. By training the network to generate the subword probability posteriors, then using ...
|
| |
In ISCA ITRW ASR2000 (2000), pp. 29-32
Abstract
This paper describes a database designed to evaluate the performance of speech recognition algorithms in noisy conditions. The database may either be used to measure frontend feature extraction algorithms, using a defined HMM recognition back-end, or complete recognition systems. The source speech for this database is the TIdigits, consisting of connected digits task spoken by American English talkers (downsampled to 8kHz). A selection of 8 different real-world noises have been added to the speech over a range of signal to noise ...
|