| |
Abstract
Acoustic scene reconstruction is a process that aims to infer characteristics of the environment from acoustic measurements. We investigate the problem of locating planar reflectors in rooms, such as walls and furniture, from signals obtained using distributed microphones. Specifically, localization of multiple two- dimensional (2-D) reflectors is achieved by estimation of the time of arrival (TOA) of reflected signals by analysis of acoustic impulse responses (AIRs). The estimated TOAs are converted into elliptical constraints about the location of the line reflector, ...
|
| |
posted to exemplar speechrecognition
by asterix77
on 2013-04-09 20:43:44
Abstract
Solving real-world classification and recognition problems requires a principled way of modeling the physical phenomena generating the observed data and the uncertainty in it. The uncertainty originates from the fact that many data generation aspects are influenced by nondirectly measurable variables or are too complex to model and hence are treated as random fluctuations. For example, in speech production, uncertainty could arise from vocal tract variations among different people or corruption by noise. The goal of modeling is to establish a ...
|
| |
posted to review sparse
by asterix77
on 2013-03-01 16:56:46
Abstract
Sparse representations have proved a powerful tool in the analysis and processing of audio signals and already lie at the heart of popular coding standards such as MP3 and Dolby AAC. In this paper we give an overview of a number of current and emerging applications of sparse representations in areas from audio coding, audio enhancement and music transcription to blind source separation solutions that can solve the ??cocktail party problem.?? In each case we will show how the prior assumption ...
|
| |
Signal Processing, IEEE Transactions on, Vol. 47, No. 2. (February 1999), pp. 306-320, doi:10.1109/78.740104
Abstract
We investigate the application of expectation maximization (EM) algorithms to the classical problem of multiple target tracking (MTT) for a known number of targets. Conventional algorithms, which deal with this problem, have a computational complexity that depends exponentially on the number of targets, and usually divide the problem into a localization stage and a tracking stage. The new algorithms achieve a linear dependency and integrate these two stages. Three optimization criteria are proposed, using deterministic and stochastic dynamic models for the ...
|
| |
posted to clustering temporal tracking
by asterix77
on 2013-02-11 22:15:01
Abstract
This paper introduces an algorithm for tracking targets whose locations are inferred from clusters of observations. This method, which we call MHTC, expands the traditional multiple hypothesis tracking (MHT) hypothesis tree to include model hypotheses - possible ways the data can be clustered in each time step - as well as ways the measurements can be associated with existing targets across time steps. We present this new hypothesis framework and its probability expressions and demonstrate MHTC's operation in a robotic solution ...
|
| |
Abstract
Summary. We propose a generic on-line (also sometimes called adaptive or recursive) version of the expectation–maximization (EM) algorithm applicable to latent variable models of independent observations. Compared with the algorithm of Titterington, this approach is more directly connected to the usual EM algorithm and does not rely on integration with respect to the complete-data distribution. The resulting algorithm is usually simpler and is shown to achieve convergence to the stationary points of the Kullback–Leibler divergence between the marginal distribution of the ...
|
| |
No. CS-TR-4966. (September 2010)
posted to researchmethodology
by asterix77
on 2013-01-23 19:58:06
|
| |
Abstract
Band-importance functions were created using the “compound” technique [Apoux and Healy, J. Acoust. Soc. Am. 132, 1078–1087 (2012)] that accounts for the multitude of synergistic and redundant interactions that take place among speech bands. Functions were created for standard recordings of the speech perception in noise (SPIN) sentences and the Central Institute for the Deaf (CID) W-22 words using 21 critical-band divisions and steep filtering to eliminate the influence of filter slopes. On a given trial, a band of interest was ...
|
| |
posted to importance intelligibility noise
by asterix77
on 2013-01-18 16:26:21
Abstract
Speech sounds are traditionally divided into consonants and vowels. When only vowels or only consonants are replaced by noise, listeners are more accurate understanding sentences in which consonants are replaced but vowels remain. From such data, vowels have been suggested to be more important for understanding sentences; however, such conclusions are mitigated by the fact that replaced consonant segments were roughly one-third shorter than vowels. We report two experiments that demonstrate listener performance to be better predicted by simple psychoacoustic measures ...
|
| |
Abstract
A new method for the estimation of multiple concurrent pitches in piano recordings is presented. It addresses the issue of overlapping overtones by modeling the spectral envelope of the overtones of each note with a smooth autoregressive model. For the background noise, a moving-average model is used and the combination of both tends to eliminate harmonic and sub-harmonic erroneous pitch estimations. This leads to a complete generative spectral model for simultaneous piano notes, which also explicitly includes the typical deviation from ...
|
| |
(27 Jun 2012)
posted to music transcription
by asterix77
on 2012-09-26 21:25:05
Abstract
We investigate the problem of modeling symbolic sequences of polyphonic music in a completely general piano-roll representation. We introduce a probabilistic model based on distribution estimators conditioned on a recurrent neural network that is able to discover temporal dependencies in high-dimensional sequences. Our approach outperforms many traditional models of polyphonic music on a variety of realistic datasets. We show how our musical language model can serve as a symbolic prior to improve the accuracy of polyphonic transcription. ...
|
| |
Abstract
The initial success of Web-image search was based exclusively on the text around an image. Certainly we have progressed since then. But recent research results dramatically beg to differ. For example, if you want to judge the similarity of two different pieces of music, should you look at the musical notes, or should you look at what people say about the music? Similarly, how should you find the best movie to recommend to a friend? Shouldn't the genre of the movie ...
|
| |
Abstract
Statistical estimators of the magnitude-squared spectrum are derived based on the assumption that the magnitude-squared spectrum of the noisy speech signal can be computed as the sum of the (clean) signal and noise magnitude-squared spectra. Maximum a posterior (MAP) and minimum mean square error (MMSE) estimators are derived based on a Gaussian statistical model. The gain function of the MAP estimator was found to be identical to the gain function used in the ideal binary mask (IdBM) that is widely used ...
|
| |
Vol. 67, No. 4. (26 August 2010), pp. 643-655
Abstract
In the precedence effect, sounds emanating directly from the source are localized preferentially over their reflections. Although most studies have focused on the delay between the onset of a sound and its echo, humans still experience the precedence effect when this onset delay is removed. We tested in barn owls the hypothesis that an ongoing delay, equivalent to the onset delay, is discernible from the envelope features of amplitude-modulated stimuli and may be sufficient to evoke this effect. With sound pairs ...
|
| |
In International Society of Music Information Retrieval Conference (2010), pp. 345-350
|
| |
In International Society of Music Information Retrieval Conference (2010), pp. 297-302
posted to classification dirichlet tags
by asterix77
on 2010-08-23 21:16:41
|
| |
No. MIT-CSAIL-TR-2010-037. (August 2010)
Abstract
We present a new learning algorithm for Boltzmann Machines that contain many layers of hid- den variables. Data-dependent statistics are estimated using a variational approximation that tends to focus on a single mode, and data-independent statistics are estimated using persistent Markov chains. The use of two quite different techniques for estimating the two types of statistic that enter into the gradient of the log likelihood makes it practical to learn Boltzmann Machines with multiple hidden layers and millions of parameters. The learning can be made more ...
|
| |
(1994)
Abstract
The Conjugate Gradient Method is the most prominent iterative method for solving sparse systems of linear equations. Unfortunately, many textbook treatments of the topic are written so that even their own authors would be mystified, if they bothered to read their own writing. For this reason, an understanding of the method has been reserved for the elite brilliant few who have painstakingly decoded the mumblings of their forebears. Nevertheless, the Conjugate Gradient Method is a composite of simple, elegant ideas that ...
|
| |
Abstract
The area under the ROC curve (AUC) is a very widely used measure of performance for classification and diagnostic rules. It has the appealing property of being objective, requiring no subjective input from the user. On the other hand, the AUC has disadvantages, some of which are well known. For example, the AUC can give potentially misleading results if ROC curves cross. However, the AUC also has a much more serious deficiency, and one which appears not to have been previously ...
|
| |
In Proc. International Symposium on Music Information Retrieval (2009)
|
| |
Abstract
This review presents an overview of a challenging problem in auditory perception, the cocktail party phenomenon, the delineation of which goes back to a classic paper by Cherry in 1953. In this review, we address the following issues: (1) human auditory scene analysis, which is a general process carried out by the auditory system of a human listener; (2) insight into auditory perception, which is derived from Marr's vision theory; (3) computational auditory scene analysis, which focuses on specific approaches aimed ...
|
| |
posted to localization neural psych reverb
by asterix77
on 2009-04-16 17:44:23
Abstract
In reverberant environments, acoustic reflections interfere with the direct sound arriving at a listener's ears, distorting the spatial cues for sound localization. Yet, human listeners have little difficulty localizing sounds in most settings. Because reverberant energy builds up over time, the source location is represented relatively faithfully during the early portion of a sound, but this representation becomes increasingly degraded later in the stimulus. We show that the directional sensitivity of single neurons in the auditory midbrain of anesthetized cats follows ...
|
| |
Acta Acustica united with Acustica, pp. 359-366
Abstract
A new monaural method for the suppression of late room reverberation from speech signals, based on spectral subtraction, is presented. The problem of reverberation suppression differs from classical speech de-noising in that the "reverberation noise" is non stationary. In this paper, the use of a novel estimator of the non-stationary reverberation-noise power spectrum, based on a statistical model of late reverberation, is presented. The algorithm is tested on real reverberated signals. The performances for different RIRs with Tr ranging from 0.34 ...
|
| |
In Proc. International Workshop on Acoustic Echo and Noise Control (September 2008)
Abstract
The direct-to-reverberant energy ratio has long been recognized as an absolute auditory cue for sound source distance perception in listeners. Traditional methods to extract this energy ratio are based on post-processing of the estimated room impulse response, which is computationally expensive and inaccurate in practice. An alternative is based on estimating the energy arriving from the azimuth of the direct source, under the assumption that reverberant components result in a spatially-diffuse sound field. We propose a binaural equalization-cancellation technique to calculate ...
|
| |
Journal of the Acoustical Society of America, Vol. 45, No. 1. (1969), pp. 337-337
posted to acoustics coherence reverb
by asterix77
on 2009-02-19 22:36:38
Abstract
Point-to-point correlations of reverberant sound fields are important both for high-intensity noise tests of spacecraft and for exploring the state of diffusion of the sound field. The classic paper on reverberant field correlation by Cook and others [J. Acoust Soc. Amer. 27, 1072 (1955)] derived a narrow-band correlation coefficient of (sin kr)/kr, where r is the separation of any two points considered, on the assumption that the field is completely diffuse. The primary intent of the present paper is to consider ...
|
| |
Journal of the Acoustical Society of America, Vol. 34, No. 11. (1962), pp. 1732-1736
Abstract
Observations indicate that noise in the ocean is a superposition of an isotropic noise field and an anisotropic noise field originating at the surface. Models which produce such noise fields are described, and the spatial-correlation functions are obtained. The volume-noise model, which produces an isotropic noise field, consists of noise sources uniformly distributed within a sphere. A single-frequency component of each noise source is considered; the mean-square output of each is the same, the relative phases are random, and inverse spreading ...
|
| |
Abstract
We propose a new method for speech source separation that is based on directionally-disjoint estimation of the transfer functions between microphones and sources at different frequencies and at multiple times. The spatial transfer functions are estimated from eigenvectors of the microphones' correlation matrix. Smoothing and association of transfer function parameters across different frequencies are performed by simultaneous extended Kalman filtering of the amplitude and phase estimates. This approach allows transfer function estimation even if the number of sources is greater than ...
|
| |
In International Symposium on Music Information Retrieval (September 2008), pp. 603-608
posted to ismir representation sparse
by asterix77
on 2008-11-01 19:42:23
|
| |
Journal of the Acoustical Society of America, Vol. 110, No. 6. (2001), pp. 3218-3231, doi:10.1121/1.1419090
posted to model separation
by asterix77
on 2008-07-24 21:27:40
Abstract
This paper describes algorithms for signal extraction for use as a front-end of telecommunication devices, speech recognition systems, as well as hearing aids that operate in noisy environments. The development was based on some independent, hypothesized theories of the computational mechanics of biological systems in which directional hearing is enabled mainly by binaural processing of interaural directional cues. Our system uses two microphones as input devices and a signal processing method based on the two input channels. The signal processing procedure ...
|
| |
Journal of the Acoustical Society of America, Vol. 118, No. 2. (2005), pp. 887-906, doi:10.1121/1.1945807
posted to model timefrequency
by asterix77
on 2008-03-25 16:30:58
Abstract
A computational model of auditory analysis is described that is inspired by psychoacoustical and neurophysiological findings in early and central stages of the auditory system. The model provides a unified multiresolution representation of the spectral and temporal features likely critical in the perception of sound. Simplified, more specifically tailored versions of this model have already been validated by successful application in the assessment of speech intelligibility [Elhilali et al., Speech Commun. 41(2-3), 331348 (2003); Chi et al., J. Acoust. Soc. Am. ...
|
| |
International Journal of Imaging Systems and Technology, Vol. 15, No. 1. (2005), pp. 18-33, doi:10.1002/ima.20035
posted to ica model separation timefrequency
by asterix77
on 2008-02-13 18:27:37
Abstract
Source separation arises in a variety of signal processing applications, ranging from speech processing to medical image analysis. The separation of a superposition of multiple signals is accomplished by taking into account the structure of the mixing process and by making assumptions about the sources. When the information about the mixing process and sources is limited, the problem is called ?blind?. By assuming that the sources can be represented sparsely in a given basis, recent research has demonstrated that solutions to ...
|
| |
Acoustical Science and Technology, Vol. 24, No. 4. (2003), pp. 172-178
Abstract
We can communicate with others in a noisy environment. This phenomenon is known as a “Cocktail Party Effect” and is one of the most important binaural functions. This paper addresses a frequency domain binaural model that plays the role of a binaural function based on an interaural phase and level difference. The proposed model is evaluated not only as a front-end of the speech recognition system, but also as a speech enhancer. According to the evaluation, when the direction of arrival ...
|
| |
Journal of the Acoustical Society of America, Vol. 27, No. 6. (1955), pp. 1072-1077, doi:10.1121/1.1908122
Abstract
Reverberation chambers used for acoustical measurements should have completely random sound fields. We denote by R the cross-correlation coefficient for the sound pressures at two points a distance r apart. R = p1p2Av/(p12Avp22Av), where p1 is the sound pressure at one point, p2 that at the other, and the angular brackets denote long time averages. In a random sound field, R = (sinkr)/kr, where k = 2/(the wavelength of the sound). An instrument for measuring and recording R as a function ...
|
| |
Acoustics, Speech, and Signal Processing, 2005. Proceedings. (ICASSP '05). IEEE International Conference on In Proceedings of the {IEEE} International Conference on Acoustics, Speech, and Signal Processing, Vol. 3 (2005), pp. 81-84, doi:10.1109/icassp.2005.1415651
posted to masking model separation
by asterix77
on 2008-01-10 20:33:10
Abstract
Musical noise is a typical problem with blind source separation using a time-frequency mask. We report that a fine-shift and overlap-add method reduces the musical noise without degrading the separation performance. The effectiveness was confirmed by results of a listening test undertaken in a room with a reverberation time of RT/sub 60/=130 ms. ...
|
| |
IEEE Transactions on Systems, Man, and Cybernetics---Part B: Cybernetics In Systems, Man, and Cybernetics, Part B, IEEE Transactions on, Vol. 36, No. 5. (October 2006), pp. 982-994, doi:10.1109/tsmcb.2006.872263
Abstract
This paper proposes a biologically inspired and technically implemented sound localization system to robustly estimate the position of a sound source in the frontal azimuthal half-plane. For localization, binaural cues are extracted using cochleagrams generated by a cochlear model that serve as input to the system. The basic idea of the model is to separately measure interaural time differences and interaural level differences for a number of frequencies and process these measurements as a whole. This leads to two-dimensional frequency versus ...
|
| |
Journal of the Acoustical Society of America, Vol. 116, No. 5. (2004), pp. 3141-3151, doi:10.1121/1.1781621
Abstract
Reverberation interferes with the ability to understand speech in rooms. Overlap-masking explains this degradation by assuming reverberant phonemes endure in time and mask subsequent reverberant phonemes. Most listeners benefit from binaural listening when reverberation exists, indicating that the listener's binaural system processes the two channels to reduce the reverberation. This paper investigates the hypothesis that the binaural word intelligibility advantage found in reverberation is a result of binaural overlap-masking release with the reverberation acting as masking noise. The tests utilize phonetically ...
|
| |
J. Mach. Learn. Res., Vol. 10 (December 2009), pp. 2233-2271
posted to boosting ranking
by asterix77
on 2013-04-17 19:59:08
Abstract
We are interested in supervised ranking algorithms that perform especially well near the top of the ranked list, and are only required to perform sufficiently well on the rest of the list. In this work, we provide a general form of convex objective that gives high-scoring examples more importance. This "push" near the top of the list can be chosen arbitrarily large or small, based on the preference of the user. We choose lp-norms to provide a specific type of push; ...
|
| |
posted to review sparse
by asterix77
on 2013-03-04 17:18:08
Abstract
Musical signals are, strictly speaking, acoustic signals where some aesthetically relevant information is conveyed through propagating pressure waves. Although the human auditory system exhibits a remarkable ability to interpret and understand these sound waves, these types of signals cannot be processed as such by computers. Obviously, the signals have to be converted into digital form, and this first implies sampling and quantization. In time-domain digital formats, such as the Pulse Code Modulation (PCM)—or newer formats such as one-bit oversampled bitstreams used ...
|
| |
Abstract
Itakura and Saito [1] used the maximum likelihood (ML) method to derive a spectral matching criterion for autoregressive (i.e., all-pole) random processes. In this paper, their results are generalized to periodic processes having arbitrary model spectra. For the all-pole model, Kay's [2] covariance domain solution to the recursive ML (RML) problem is cast into the spectral domain and used to obtain the RML solution for periodic processes. When applied to speech, this leads to a method for solving the joint pitch ...
|
| |
posted to binaural dereverb
by asterix77
on 2013-02-12 19:01:04
Abstract
The ability of the human auditory system for sound localization mainly depends on the binaural cues, especially interaural time and level differences (ITD and ILD). In the context of digital hearing aids and binaural audio transmission systems, these cues can be severely degraded by independent bilateral signal processing such as dereverberation or noise reduction. This contribution presents a novel two-stage binaural dereverberation algorithm which explicitly preserves the binaural cues. The first stage is based on a statistical model of the room ...
|
| |
Abstract
Random finite sets (RFSs) are natural representations of multitarget states and observations that allow multisensor multitarget filtering to fit in the unifying random set framework for data fusion. Although the foundation has been established in the form of finite set statistics (FISST), its relationship to conventional probability is not clear. Furthermore, optimal Bayesian multitarget filtering is not yet practical due to the inherent computational hurdle. Even the probability hypothesis density (PHD) filter, which propagates only the first moment (or PHD) instead ...
|
| |
posted to casa localization reverb
by asterix77
on 2013-02-11 22:16:12
Abstract
Sound source localization from a binaural input is a challenging problem, particularly when multiple sources are active simultaneously and reverberation or background noise are present. In this work, we investigate a multi-source localization framework in which monaural source segregation is used as a mechanism to increase the robustness of azimuth estimates from a binaural input. We demonstrate performance improvement relative to binaural only methods assuming a known number of spatially stationary sources. We also propose a flexible azimuth-dependent model of binaural ...
|
| |
Journal of the Royal Statistical Society. Series B (Methodological), Vol. 46, No. 2. (1984)
posted to em online temporal
by asterix77
on 2013-02-11 16:21:09
Abstract
Stochastic approximation procedures are considered for the estimation of parameters using incomplete data. One procedure is stated and illustrated which often leads to asymptotically efficient estimators. Others are developed which, although possibly not optimal in the above sense, will be very much easier to apply. This will be particularly advantageous when quick recursive estimates are required. Examples are given and a link is made between one of the sub-optimal methods and the EM algorithm. ...
|
| |
posted to lsf lsp
by asterix77
on 2013-02-04 16:49:30
Abstract
It has been known that the linear predictor coefficients (LPC) of speech signals can be transformed into a “pseudo” vocal‐tract area function whose boundary conditions are (a) a complete opening at the lips and (b) a matching resistance termination at the glottis. If the boundary condition at the glottis is replaced by a complete opening or a complete closure, all the poles of the resulting system function will move onto the unit circle in z plane. Using this fact it is ...
|
| |
In Acoustics, Speech, and Signal Processing, 1996. ICASSP-96. Conference Proceedings., 1996 IEEE International Conference on, Vol. 2 (May 1996), pp. 805-808 vol. 2, doi:10.1109/icassp.1996.543243
posted to lsp temporal tracking
by asterix77
on 2013-01-25 21:55:32
Abstract
We present an adaptive path-following method based on the technique of homotopy, which efficiently computes the line spectral pairs by exploiting their natural ordering and low frame-to-frame variation. We first define continuous paths from known roots of the LSP polynomials of a prior speech frame to the unknown roots of the next frame in the sequence. A gradient-search based numerical predictor-corrector procedure is then used for tracing these paths in order to compute the unknown roots. This method uses only scalar ...
|
| |
In Acoustics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on (March 2008), pp. 1833-1836, doi:10.1109/icassp.2008.4517989
posted to mixturemodel nmf separation
by asterix77
on 2013-01-23 21:08:20
Abstract
We present a new probabilistic architecture for analyzing composite non-negative data, called Non-negative Subspace Analysis (NSA). The NSA model provides a framework for understanding the relationships between sparse subspace and mixture model based approaches, and encompasses a range of models, including Sparse Non-negative Matrix Factorization (SNMF) [1] and mixture-model based analysis as special cases. We present a convenient instantiation of the NSA model, and an efficient variational approximate learning and inference algorithm that combines the advantages of SNMF and mixture model-based ...
|
| |
posted to asynchronous review sequence
by asterix77
on 2013-01-15 15:31:50
Abstract
Event-driven analog-to-digital conversion and associated digital signal processing techniques are reviewed. Such techniques, still in the research stage, have the potential to significantly reduce the consumption of energy and bandwidth resources in several important applications. ...
|
| |
Abstract
We present a self-learning algorithm using a bottom-up based approach to automatically discover, acquire and recognize the words of a language. First, an unsupervised technique using non-negative matrix factorization (NMF) discovers phone-sized time-frequency patches into which speech can be decomposed. The input matrix for the NMF is constructed for static and dynamic speech features using a spectral representation of both short and long acoustic events. By describing speech in terms of the discovered time-frequency patches, patch activations are obtained which express ...
|
| |
posted to autotag image topicmodel
by asterix77
on 2013-01-15 15:17:09
Abstract
This paper presents a new probabilistic model for the task of image annotation. Our model, which we call sLDA-bin, extends supervised Latent Dirichlet Allocation (sLDA) model to handle a multi-variate binary response variable of the annotation data. Unlike correspondence LDA (cLDA), the association model in sLDA allows each caption word to be associated with more than 1 image region and is thus more appropriate for annotation words that globally describe the scene. By modeling the response variable as a multi-variate Bernoulli, ...
|
| |
J. Mach. Learn. Res., Vol. 6 (December 2005), pp. 1453-1484
Abstract
Learning general functional dependencies between arbitrary input and output spaces is one of the key challenges in computational intelligence. While recent progress in machine learning has mainly focused on designing flexible and powerful input representations, this paper addresses the complementary issue of designing classification algorithms that can deal with more complex outputs, such as trees, sequences, or sets. More generally, we consider problems involving multiple dependent output variables, structured output spaces, and classification problems with class attributes. In order to accomplish ...
|