| |
Speech Commun., Vol. 18 (May 1996), pp. 205-231
posted to no-tag
by marlar
to the group searchingspeech2010
on 2011-10-04 00:39:10
Abstract
An abstract is not available. ...
|
| |
(02 May 1997)
Abstract
<B>Foreword by Karen Spärck Jones</B><br /> <br /> <br /> Intelligent multimedia information retrieval lies at the intersection of artificial intelligence, information retrieval, human-computer interaction, and multimedia computing. Its systems enable users to create, process, summarize, present, interact with, and organize information within and across different media such as text, speech, graphics, imagery, and video. These systems go beyond traditional hypermedia and hypertext environments to analyze and generate media, and support intelligent interaction with or via multiple media.<br /> <br /> ...
|
| |
Abstract
Advances in multimedia technologies have enabled the creation of huge archives of audio-video recordings of meetings, and there is burgeoning interest in developing meeting browsers to help users better leverage these archives. A recent study has shown that extractive summaries provide a more efficient way of navigating meeting content than simply reading through the transcript and using the audio-video record, or navigating via keyword search (Murray, 2007). The extractive summary technique identifies informative dialogue acts to generate general purpose summaries. These ...
|
| |
Database and Expert Systems Applications, International Workshop on, Vol. 0 (2008), pp. 130-134, doi:10.1109/dexa.2008.22
posted to no-tag
by marlar
to the group searchingspeech2010
on 2011-10-02 19:49:01
Abstract
Automatic genre classification is a simple and effective solution to describe semantic properties of multimedia data. In this paper, a method to classify the genre of TV programmes is presented. In our approach, four multimodal vectors, including both low-level perceptual descriptors and higher-level, human-centred features are employed. These vectors serve as the input for a parallel neural network system that performs classification of seven video genres. The experiment results confirm the effectiveness of our method, reaching a classification accuracy rate of ...
|
| |
Abstract
While storytelling has long been recognized as an important part of effective knowledge management in organizations, knowledge management technologies have generally not distinguished between stories and other types of discourse. In this paper we describe a new type of technological support for storytelling that involves automatically capturing the stories that people tell to each other in conversations. We describe our first attempt at constructing an automated story extraction system using statistical text classification and a simple voting scheme. We evaluate the ...
|
| |
In Proceedings of the seventh ACM international conference on Multimedia (Part 1) (1999), pp. 393-400, doi:10.1145/319463.319658
Abstract
The role of audio in the context of multimedia applications involving video is becoming increasingly important. Many efforts in this area focus on audio data that contains some built-in semantic information structure such as in broadcast news, or focus on classification of audio that contains a single type of sound such as cleaar speech or clear music only. In the CueVideo system, we detect and classify audio that consists of mixed audio, i.e. combinations of speech and music together with other ...
|
| |
posted to no-tag
by marlar
to the group searchingspeech2010
on 2011-10-02 16:01:33
Abstract
Joke-o-mat HD is a system that allows a user to navigate sitcoms (such as Seinfeld) by "narrative themes", including scenes, punchlines, and dialog segments. The themes can be filtered by the main actors and by keyword. For example, the user can select to see only punchlines by Kramer that contain the word "armoire". The system infers the narrative themes using segmentation of the audio track into laughter, actors, words, and music. The segmentation can be generated either by an expert annotator, ...
|
| |
Abstract
Voice search is the technology underlying many spoken dialog systems (SDSs) that provide users with the information they request with a spoken query. The information normally exists in a large database, and the query has to be compared with a field in the database to obtain the relevant information. The contents of the field, such as business or product names, are often unstructured text. This article categorized spoken dialog technology into form filling, call routing, and voice search, and reviewed the ...
|
| |
In Proceedings of the fifth international ACM conference on Assistive technologies (2002), pp. 192-196, doi:10.1145/638249.638284
posted to searchingspeech
by marlar
to the group searchingspeech2010
on 2011-10-02 10:41:20
Abstract
The LIBERATED LEARNING PROJECT (LLP) is an applied research project studying two core questions:1) Can speech recognition (SR) technology successfully digitize lectures to display spoken words as text in university classrooms?2) Can speech recognition technology be used successfully as an alternative to traditional classroom notetaking for persons with disabilities?This paper addresses these intriguing questions and explores the underlying complex relationship between speech recognition technology, university educational environments, and disability issues. ...
|
| |
Abstract
Recently, increasing attention has been directed to the study of the emotional content of speech signals, and hence, many systems have been proposed to identify the emotional content of a spoken utterance. This paper is a survey of speech emotion classification addressing three important aspects of the design of a speech emotion recognition system. The first one is the choice of suitable features for speech representation. The second issue is the design of an appropriate classification scheme and the third issue ...
|
| |
Abstract
An abstract is not available. ...
|
| |
Abstract
After two successful years at SIGIR in 2007 and 2008, the third workshop on Searching Spontaneous Conversational Speech (SSCS 2009) was held conjunction with the ACM Multimedia 2009. The goal of the SSCS series is to serve as a forum that brings together the disciplines that collaborate on spoken content retrieval, including information retrieval, speech recognition and multimedia analysis. Multimedia collections often contain a speech track, but in many cases it is ignored or not fully exploited for information retrieval. Currently, ...
|
| |
posted to no-tag
by marlar
to the group searchingspeech2010
on 2011-02-16 14:22:49
Abstract
In this paper we introduce a new search algorithm that provides a simple, clean, and efficient interface between the speech and natural language components of a spoken language system. The N-Best algorithm is a time-synchronous Viterbi-style beam search algorithm that can be made to find the most likely N whole sentence alternatives that are within a given a "beam" of the most likely sentence. The algorithm can be shown to be exact under some reasonable constraints. That is, it guarantees that ...
|
| |
posted to no-tag
by marlar
to the group searchingspeech2010
on 2011-02-16 14:21:27
Abstract
Word graphs are directed acyclic graphs where each edge is labeled with a word and a score, and each node is labeled with a point in time. Word graphs form an efficient feedforward interface between continuous-speech recognition and linguistic processors. Word graphs with high coverage and modest graph densities can be generated with a computational load comparable with bigram best-sentence recognition. Results on word graph error rates and word graph densities are presented for the ASL (Architecture Speech/Language) benchmark test ...
|
| |
pp. 9-22
Abstract
A new technique is presented for searching digital audio at the word/phrase level. Unlike previous methods based upon Large Vocabulary Continuous Speech Recognition (LVCSR, with inherent problems of closed vocabulary and high word error rate), phonetic searching combines high speed and accuracy, supports open vocabulary, imposes low penalty for new words, permits phonetic and inexact spelling, enables user-determined depth of search, and is amenable to parallel execution for highly scalable deployment. A detailed comparison of accuracy between phonetic searching and one ...
|
| |
Abstract
This paper proposes a new approach for keyword spotting, which is based on large margin and kernel methods rather than on HMMs. Unlike previous approaches, the proposed method employs a discriminative learning procedure, in which the learning phase aims at achieving a high area under the ROC curve, as this quantity is the most common measure to evaluate keyword spotters. The keyword spotter we devise is based on mapping the input acoustic representation of the speech utterance along with the target ...
|
| |
In Proceedings of the ACM Multimedia Workshop on Searching Spontaneous Conversational Speech (2010), pp. 11-14
posted to no-tag
by marlar
to the group searchingspeech2010
on 2010-12-12 12:03:04
|
| |
In ACM International Conference on Image and Video Retrieval 2010 (CIVR 2010) (July 2010)
posted to av
by marlar
to the group searchingspeech2010
on 2010-11-28 15:34:04
along with 1 group
mirlit
Abstract
Content-based video retrieval is maturing to the point where it can be used in real-world retrieval practices. One such practice is the audiovisual archive, whose users increasingly require fine-grained access to broadcast television content. We investigate to what extent content-based video retrieval methods can improve search in the audiovisual archive. In particular, we propose an evaluation methodology tailored to the specific needs and circumstances of the audiovisual archive, which are typically missed by existing evaluation initiatives. We utilize logged searches and ...
|
| |
posted to no-tag
by marlar
to the group searchingspeech2010
on 2010-11-28 12:46:11
Abstract
Errors in speech recognition transcripts have a negative impact on effectiveness of content-based speech retrieval and present a particular challenge for collections containing conversational spoken content. We propose a Global Semantic Distortion (GSD) metric that measures the collection-wide impact of speech recognition error on spoken content retrieval in a query-independent manner. We deploy our metric to examine the effects of speech recognition substitution errors. First, we investigate frequent substitutions, cases in which the recognizer habitually mis-transcribes one word as another. Although ...
|
| |
posted to no-tag
by marlar
to the group searchingspeech2010
on 2010-11-28 12:42:27
Abstract
In this paper, we describe the SemanticVox project. SemanticVox aims at providing a real link between speech transcription technologies from Vecsys [8] based on LIMSI research [9] and multimedia documents analysis and retrieval technologies from the Multilingual Multimedia Knowledge Engineering Laboratory (LIC2M) of the CEA-LIST [1]. The first application of the project is a cross-lingual automatic video indexing and retrieval system based on speech transcription and video analysis. The two main novelties of the system are: (i) its ability to manage ...
|
| |
Abstract
It is a conventional wisdom in the speech community that better speech recognition accuracy is a good indicator for better spoken language understanding accuracy, given a fixed understanding component. The findings in this work reveal that this is not always the case. More important than word error rate reduction, the language model for recognition should be trained to match the optimization objective for understanding. In this work, we applied a spoken language understanding model as the language model in speech recognition. ...
|
| |
In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval (2006), pp. 673-674, doi:10.1145/1148170.1148311
Abstract
Early speech retrieval experiments focused on news broadcasts, for which adequate Automatic Speech Recognition (ASR) accuracy could be obtained. Like newspapers, news broadcasts are a manually selected and arranged set of stories. Evaluation designs reflected that, using known story boundaries as a basis for evaluation. Substantial advances in ASR accuracy now make it possible to build search systems for some types of spontaneous conversational speech, but present evaluation designs continue to rely on known topic boundaries that are no longer well ...
|
| |
posted to error
by marlar
to the group searchingspeech2010
on 2010-11-27 22:20:38
Abstract
Digital document archives are increasingly derived from various different media sources. At present such archives are stored and searched independently. The Information Retrieval from Mixed-Media Collections (IRMMC) project is investigating retrieval from combined document collections composed of items originating from differing media forms. Experimentalin vestigation of a mixed-media retrieval task based on the existing TREC Spoken Document Retrieval task combining Text, Spoken and Scanned Image is described. Results show that nontext media perform well within the mixed-media collection. Also ...
|
| |
In International Conference on Language Resources and Evaluation (LREC) (2002)
posted to alignment
by marlar
to the group searchingspeech2010
on 2010-11-27 20:04:32
|
| |
posted to no-tag
by marlar
to the group searchingspeech2010
on 2010-11-27 19:59:28
Abstract
This paper describes the MPEG-7 compliant indexing and retrieval system iFinder based on XML and open source database technology. The iFinder system automatically extracts metadata from A/V-content and allows access to the enriched content by means of a client/server-based retrieval engine. This multimedia retrieval system allows for search and retrieval of short video segments in huge multimedia archives. As a reference application, the iFinder system is used to index speeches from the German Parliament. The user can search for fragments of ...
|
| |
No. TR-99-013. (July 1999)
posted to ui
by marlar
to the group searchingspeech2010
on 2010-11-27 18:30:58
Abstract
We present the results of a study of users' perception of relevance of documents. Documents retrieved in response to a query are presented to users in a variety of ways, from full text to a machine spoken query-biased automatically-generated summary, and the difference in users' perception of relevance is studied. The aim is to study experimentally how users' perception of relevance varies depending on the form that retrieved documents are presented. The experimental results suggest that the effectiveness of advanced multimedia ...
|
| |
posted to no-tag
by marlar
to the group searchingspeech2010
on 2010-11-25 19:02:31
Abstract
This document describes the realization of a spoken information retrieval system and its application to words search in an indexed video database. The system uses an automatic speech recognition (ASR) software to convert the audio signal of a video file into a transcript file and then a document indexing tool to index this transcripted file. Then, a spoken query, uttered by any user, is presented to the ASR to decode the audio signal and propose a hypothesis that is later used ...
|
| |
posted to filtering
by marlar
to the group searchingspeech2010
on 2010-11-25 19:00:39
|
| |
posted to std
by marlar
to the group searchingspeech2010
on 2010-11-22 09:20:40
Abstract
While spoken term detection (STD) systems based on word indices provide good accuracy, there are several practical applications where it is infeasible or too costly to employ an LVCSR engine. An STD system is presented, which is designed to incorporate a fast phonetic decoding front-end and be robust to decoding errors whilst still allowing for rapid search speeds. This goal is achieved through monophone open-loop decoding coupled with fast hierarchical phone lattice search. Results demonstrate that an STD system that is ...
|
| |
posted to no-tag
by marlar
to the group searchingspeech2010
on 2010-11-22 09:19:28
Abstract
Enterprise-scale search engines are generally designed for linear text. Linear text is suboptimal for audio search, where accuracy can be significantly improved if the search includes alternate recognition candidates, commonly represented as word lattices. We propose two methods to enable text indexers to approximately index lattices with little or no code change: "TMI" (Time-based Merging for Indexing) aims at lattice-index size reduction, and the "sausage"-like "TALE" (Time-Anchored Lattice Expansion) approximation requires no indexer-code or data-format changes at all. On four enterprise-type ...
|
| |
In Proceedings of the 4th international conference on Machine learning for multimodal interaction (2008), pp. 237-247
posted to std
by marlar
to the group searchingspeech2010
on 2010-11-22 09:18:33
Abstract
The paper presents the Brno University of Technology (BUT) system for indexing and search of speech, combining LVCSR and phonetic approach. It brings a complete description of individual building blocks of the system from signal processing, through the recognizers, indexing and search until the normalization of detection scores. It also describes the data used in the first edition of NIST Spoken term detection (STD) evaluation. The results are presented on three US-English conditions - meetings, broadcast news and conversational telephone speech, ...
|
| |
posted to std
by marlar
to the group searchingspeech2010
on 2010-11-22 09:16:27
Abstract
This paper presents methods to improve retrieval of Out-Of-Vocabulary (OOV) terms in a Spoken Term Detection (STD) system. We demonstrate that automated tagging of OOV regions helps to reduce false alarms while incorporating phonetic confusability increases the hits. Additional features that boost the probability of a hit in accordance with the number of neighboring hits for the same query and query-length normalization also improve the overall performance of the spoken-term detection system. We show that these methods can be combined effectively ...
|
| |
posted to no-tag
by marlar
to the group searchingspeech2010
on 2010-11-22 09:15:20
Abstract
This paper deals with comparison of sub-word based methods for spoken term detection (STD) task and phone recognition. The sub-word units are needed for search for out-of-vocabulary words. We compared words, phones and multigrams. The maximal length and pruning of multigrams were investigated first. Then two constrained methods of multigram training were proposed. We evaluated on the NIST STD06 dev-set CTS data. The conclusion is that the proposed method improves the phone accuracy more than 9% relative and STD accuracy more ...
|
| |
In International Conference on Intelligence Analysis (2005)
posted to no-tag
by marlar
to the group searchingspeech2010
on 2010-11-21 20:55:27
|
| |
In Workshop On Content Visualization And Intermediate Representations (1998)
posted to no-tag
by marlar
to the group searchingspeech2010
on 2010-11-21 12:52:14
|
| |
In AAAI Technical Report SS-97-03 (1997)
posted to no-tag
by marlar
to the group searchingspeech2010
on 2010-11-21 12:48:09
|
| |
posted to weights
by marlar
to the group searchingspeech2010
on 2010-11-20 21:25:44
Abstract
Because different indexing features actually have different discriminative capabilities for spoken term detection and different levels of reliability in recognition, it is reasonable to weight the indexing features in the transcribed lattices differently during spoken term detection. In this paper, we present an initial attempt of using two weighting schemes, one context independent (fixed weight for each feature) and one context dependent(different weights for the same feature in different context). These weights can be learned by optimizing a desired spoken term ...
|
| |
In Proceedings of the tenth international conference on Information and knowledge management (2001), pp. 580-582, doi:10.1145/502585.502697
posted to subwords
by marlar
to the group searchingspeech2010
on 2010-11-20 20:52:32
Abstract
Phonetic speech retrieval is used to augment word based retrieval in spoken document retrieval systems, for in and out of vocabulary words. In this paper, we present a new indexing and ranking scheme using metaphones and a Bayesian phonetic edit distance. We conduct an extensive set of experiments using a hundred hours of HUB4 data with ground truth transcript and twenty-four thousands query words. We show improvement of up to 15% in precision compare to results obtained speech recognition alone, at ...
|
| |
posted to confidence
by marlar
to the group searchingspeech2010
on 2010-11-20 19:20:52
Abstract
There is increasing interest in systems which attempt to automate a task or a transaction using speech input and output. To function effectively with imperfect speech recognition, such systems require an estimate of which words in the output from the recogniser are likely to be correct and which can probably be disregarded as incorrect, i.e. a confidence-measure for each decoded word. We define a measure for evaluating the effectiveness of a post-classifier which estimates confidence-measures, and describe the development of a post-classifier for words decoded from the SWITCHBOARD database, ...
|
| |
Acoustics, Speech, and Signal Processing, 1994. ICASSP-94., 1994 IEEE International Conference on In Acoustics, Speech, and Signal Processing, 1994. ICASSP-94., 1994 IEEE International Conference on, Vol. ii (1994), pp. II/21-II/24, doi:10.1109/icassp.1994.389728
posted to asr confidence
by marlar
to the group searchingspeech2010
on 2010-11-20 17:18:14
Abstract
This paper describes and evaluates a new technique for evaluating confidence in word strings produced by a speech recognition system. It detects misrecognized and out-of-vocabulary words in spontaneous spoken dialogs. The system uses multiple, diverse knowledge sources including acoustics, semantics, pragmatics and discourse to determine if a word string is misrecognized. When likely misrecognitions are detected, a series of tests distinguishes out-of-vocabulary words from other error sources. The work is part of a larger effort to automatically recognize and understand new words when spoken in a spontaneous spoken dialog. ...
|
| |
Acoustics, Speech, and Signal Processing, 1989. ICASSP-89., 1989 International Conference on In Acoustics, Speech, and Signal Processing, 1989. ICASSP-89., 1989 International Conference on (1989), pp. 627-630, doi:10.1109/icassp.1989.266505
posted to no-tag
by marlar
to the group searchingspeech2010
on 2010-11-20 14:51:13
Abstract
A word-spotting system using Gaussian hidden Markov models is presented. Several aspects of this problem are investigated. Specifically, results are reported on the use of various signal processing and feature transformation techniques. The authors have observed that performance can be greatly affected by the choice of features used, the covariance structure of the Gaussian models, and transformations based on energy and feature distributions. Due to the open-set nature of the problem, the specific techniques for modeling out-of-vocabulary speech and the choice of scoring metric can have a significant effect on ...
|
| |
posted to confidence
by marlar
to the group searchingspeech2010
on 2010-11-20 11:46:11
Abstract
The authors report on the detection of new words for the speaker-dependent and speaker-independent paradigms. A useful operating point in a speaker-dependent paradigm is defined at 71% detection rate and 1% false alarm rate. The authors present a novel technique for obtaining a phonetic transcription for a new word, which is needed to add the new word to the system. The technique utilizes DECtalk's text-to-sound rules to obtain an initial phonetic transcription for the new word. Since these text-to-sound rules are ...
|
| |
In Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 01 (1996), pp. 515-517, doi:10.1109/icassp.1996.541146
posted to confidence
by marlar
to the group searchingspeech2010
on 2010-11-20 10:58:03
Abstract
An acoustic confidence measure for acceptance/rejection of recognition hypotheses for continuous speech utterances is proposed. This measure is useful for rejecting utterances that are out of domain, or contain out-of-vocabulary words or speech disfluencies. A phone-based approach is implemented so that a single global threshold can be applied to hypothesis rejection for any word sequence. Phone confidence is computed for each frame of speech as the posterior phone probability given the acoustic observation. Word sequence confidence is evaluated as the average ...
|
| |
posted to confidence
by marlar
to the group searchingspeech2010
on 2010-11-20 10:53:09
Abstract
In this paper we propose a novel way of estimating confidences for words that are recognized by a speech recognition system, together with a natural methodology for evaluating the overall quality of those confidence estimates. Our approach is based on an interpretation of a confidence as the probability that the corresponding recognized word is correct, and makes use of generalized linear models as a means for combining various predictor scores so as to arrive at confidence estimates. Experimental results using these models are presented based on four different sources ...
|
| |
In Third International Conference on Spoken Language Processing (ICSLP 94) (1994), pp. 2195-2198
posted to confidence
by marlar
to the group searchingspeech2010
on 2010-11-20 09:50:16
Abstract
To use a word spotting system efficiently, it is helpful to be able to predict the performance of the system accurately. In this paper, we investigate performance prediction under different conditions. First, we discuss how to use statistical techniques to predict performance, and its variability on new unseen testing data. Second, we show that classification trees can be used to estimate the posterior probability of putative hits and that posterior probability can predict performance of unlabeled test data. Thirdly, we show ...
|
| |
Acoustics, Speech, and Signal Processing, 1995. ICASSP-95., 1995 International Conference on, Vol. 1 (1995), pp. 221-224, doi:10.1109/icassp.1995.479404
Abstract
The goal of this work is to highlight aspects of an experiment other than the word error rate. When a speech recognition experiment is performed, the word error rate provides no insight into the factors responsible for the recognition errors. We begin this paper by describing an experiment which contrasts the language of conversational speech with that of text from the Wall Street Journal. The remainder of the paper is devoted to the description of a more general approach to performance diagnosis which identifies significant sources of error ...
|
| |
posted to no-tag
by marlar
to the group searchingspeech2010
on 2010-11-20 08:57:56
Abstract
There has been a considerable focus on information retrieval for multimedia databases. When speech is used as the source material for multimedia indexing, the effect of transcriber error on retrieval effectiveness must be considered. This paper describes a method for measuring the relevance of documents to queries when information about the probability of word transcription error is available. To support the use of this technique, a method is presented for estimating word error probability in speech recognition engines that use word graphs (lattices). An information retrieval experiment using this ...
|
| |
posted to no-tag
by marlar
to the group searchingspeech2010
on 2010-11-14 20:12:34
Abstract
The authors describe the design of a system to learn new words from spontaneous speech input, and present an initial experiment on detecting the new words to be learned. Learning a new word involves detecting an out-of-vocabulary word in the input, determining its meaning, and adding the word to the system lexicon and grammars. Such learning would enable later recognition, parsing, and interpretation of the new words. ...
|
| |
In Proceedings of HLT-NAACL 2004: Short Papers on XX (2004), pp. 85-88
Abstract
In this paper we present preliminary results of a novel unsupervised approach for high-precision detection and correction of errors in the output of automatic speech recognition systems. We model the likely contexts of all words in an ASR system vocabulary by performing a lexical co-occurrence analysis using a large corpus of output from the speech system. We then identify regions in the data that contain likely contexts for a given query word. Finally, we detect words or sequences of words in ...
|
| |
In Proceedings of WWW/Internet (2007), pp. 145-152
posted to no-tag
by marlar
to the group searchingspeech2010
on 2010-11-14 19:57:29
|