| |
In Proceeding of the 17th ACM conference on Information and knowledge management (CIKM'08) (2008), pp. 911-920
posted to no-tag
by mehrbod
on 2011-05-09 06:12:02
|
| |
Signal Processing Magazine, IEEE In Signal Processing Magazine, IEEE, Vol. 22, No. 5. (2005), pp. 70-80
posted to no-tag
by mehrbod
on 2011-05-09 06:06:40
Abstract
This article has described LSM, a data-driven framework for modeling globally meaningful relationships implicit in large volumes of data. LSM generalizes a paradigm originally developed to capture hidden word patterns in a text document corpus. Over the past decade, this paradigm has proven effective in an increasing variety of fields, gradually spreading from query-based information retrieval to word clustering, document/topic clustering, large-vocabulary speech recognition language modeling, automated call routing, semantic inference for spoken interface control, and several other speech processing applications. ...
|
| |
Abstract
Proactive learning is a generalization of active learning designed to relax unrealistic assumptions and thereby reach practical applications. Active learning seeks to select the most informative unlabeled instances and ask an omniscient oracle for their labels, so as to retrain the learning algorithm maximizing accuracy. However, the oracle is assumed to be infallible (never wrong), indefatigable (always answers), individual (only one oracle), and insensitive to costs (always free or always charges the same). Proactive learning relaxes all four of these assumptions, ...
|
| |
In SS'07: Proceedings of 16th USENIX Security Symposium on USENIX Security Symposium (2007), pp. 1-14
|
| |
In TREC, Vol. Special Publication 500-274 (2007)
posted to no-tag
by mehrbod
on 2011-03-25 13:34:16
|
| |
posted to no-tag
by mehrbod
on 2011-03-24 17:20:12
Abstract
While many cybersecurity tools are available to computer users, their default configurations often do not match needs of specific users. Since most modern users are not computer experts, they are often unable to customize these tools, thus getting either insufficient or excessive security. To address this problem, we are developing an automated assistant that learns security needs of the user and helps customize available tools. ...
|
| |
In In To appear at the 15th ACM Conference on Computer and Communications Security (CCS (2008)
posted to no-tag
by mehrbod
on 2011-03-24 17:15:35
along with 1 person
Mutjake
Abstract
Cross-Site Request Forgery (CSRF) is a widely exploited web site vulnerability. In this paper, we present a new variation on CSRF attacks, login CSRF, in which the attacker forges a cross-site request to the login form, logging the victim into the honest web site as the attacker. The severity of a login CSRF vulnerability varies by site, but it can be as severe as a cross-site scripting vulnerability. We detail three major CSRF defense techniques and find shortcomings with each technique. ...
|
| |
In Proceedings of the 20th international joint conference on Artifical intelligence (2007), pp. 714-719
posted to no-tag
by mehrbod
on 2011-03-24 17:08:43
Abstract
Many real-world classification problems involve large numbers of overlapping categories that are arranged in a hierarchy or taxonomy. We propose to incorporate prior knowledge on category taxonomy directly into the learning architecture. We present two concrete multi-label classification methods, a generalized version of Perceptron and a hierarchical multi-label SVM learning. Our method works with arbitrary, not necessarily singly connected taxonomies, and can be applied more generally in settings where categories are characterized by attributes and relations that are not necessarily induced ...
|
| |
In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (2006), pp. 821-826, doi:10.1145/1150402.1150510
Abstract
We present a risk minimization formulation for learning from both text and graph structures which is motivated by the problem of collective inference for hypertext document categorization. The method is based on graph regularization formulated as a well-formed convex optimization problem. We present numerical algorithms for our formulation, and show that such combination of local text features and link information can lead to improved predictive accuracy. ...
|
| |
In Proceedings of the NAACL HLT 2010 Workshop on Semantic Search (2010), pp. 10-18
posted to no-tag
by mehrbod
on 2011-03-24 17:01:24
Abstract
In this paper, we propose a multiword-enhanced author topic model that clusters authors with similar interests and expertise, and apply it to an information retrieval system that returns a ranked list of authors related to a keyword. For example, we can retrieve Eugene Charniak via search for statistical parsing. The existing works on author topic modeling assume a "bag-of-words" representation. However, many semantic atomic concepts are represented by multiwords in text documents. This paper presents a pre-computation step as a way ...
|
| |
In AIStats (2009)
posted to no-tag
by mehrbod
on 2011-03-24 16:06:20
|
| |
Abstract
A family of probabilistic time series models is developed to analyze the time evolution of topics in large document collections. The approach is to use state space models on the natural parameters of the multinomial distributions that represent the topics. Variational approximations based on Kalman filters and nonparametric wavelet regression are developed to carry out approximate posterior inference over the latent topics. In addition to giving quantitative, predictive models of a sequential corpus, dynamic topic models provide a qualitative window into ...
|
| |
In CIKM '08: Proceeding of the 17th ACM conference on Information and knowledge management (2008), pp. 911-920, doi:10.1145/1458082.1458202
Abstract
Topic modeling has been a key problem for document analysis. One of the canonical approaches for topic modeling is Probabilistic Latent Semantic Indexing, which maximizes the joint probability of documents and terms in the corpus. The major disadvantage of PLSI is that it estimates the probability distribution of each document on the hidden topics independently and the number of parameters in the model grows linearly with the size of the corpus, which leads to serious problems with overfitting. Latent Dirichlet Allocation ...
|
| |
In In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers (2006), pp. 77-80
posted to no-tag
by mehrbod
on 2011-03-24 14:42:24
Abstract
The goal of the on-going project described in this paper is evaluation of the utility of Latent Semantic Analysis (LSA) for unsupervised word sense discrimination. The hypothesis is that LSA can be used to compute context vectors for ambiguous words that can be clustered together – with each cluster corresponding to a different sense of the word. In this paper we report first experimental result on tightness, separation and purity of sense-based clusters as a function of vector space dimensionality and ...
|
| |
Journal of the American Society of Information Science, Vol. 41, No. 6. (1990), pp. 391-407
Abstract
A new method for automatic indexing and retrieval is described. The approach is to take advantage of implicit higher-order structure in the association of terms with documents ("semantic structure") in order to improve the detection of relevant documents on the basis of terms found in queries. The particular technique used is singular-value decomposition, in which a large term by document matrix is decomposed into a set of ca 100 orthogonal factors from which the original matrix can be... ...
|
| |
In Proceedings of the 11th Annual Conference on Computational Learning Theory (1998), pp. 92-100
Abstract
We consider the problem of using a large unlabeled sample to boost performance of a learning algorithm when only a small set of labeled examples is available. In particular, we consider a problem setting motivated by the task of learning to classify web pages, in which the description of each example can be partitioned into two distinct views. For example, the description of a web page can be partitioned into the words occurring on that page, and the words occurring in ...
|
| |
In Proceedings of the 34th annual meeting on Association for Computational Linguistics (1996), pp. 310-318, doi:10.3115/981863.981904
Abstract
We present an extensive empirical comparison of several smoothing techniques in the domain of language modeling, including those described by Jelinek and Mercer (1980), Katz (1987), and Church and Gale (1991). We investigate for the first time how factors such as training data size, corpus (e.g., Brown versus Wall Street Journal), and n-gram order (bigram versus trigram) affect the relative performance of these methods, which we measure through the cross-entropy of test data. In addition, we introduce two novel smoothing techniques, ...
|
| |
In STOC '97: Proceedings of the twenty-ninth annual ACM symposium on Theory of computing (1997), pp. 334-343, doi:10.1145/258533.258616
Abstract
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references. ...
|
| |
Abstract
No Abstract. ...
|
| |
Abstract
We present a novel Locality-Sensitive Hashing scheme for the Approximate Nearest Neighbor Problem under lp norm, based on p-stable distributions.Our scheme improves the running time of the earlier algorithm for the case of the lp norm. It also yields the first known provably efficient approximate NN algorithm for the case p<1. We also show that the algorithm finds the exact near neigbhor in O(log n) time for data satisfying certain "bounded growth" condition.Unlike earlier schemes, our LSH scheme works directly on ...
|
| |
Signal Processing Magazine, IEEE In Signal Processing Magazine, IEEE, Vol. 25, No. 2. (March 2008), pp. 128-131, doi:10.1109/msp.2007.914237
Abstract
This lecture note describes a technique known as locality-sensitive hashing (LSH) that allows one to quickly find similar entries in large databases. This approach belongs to a novel and interesting class of algorithms that are known as randomized algorithms. A randomized algorithm does not guarantee an exact answer but instead provides a high probability guarantee that it will return the correct answer or one close to it. By investing additional computational effort, the probability can be pushed as high as desired. ...
|
| |
Abstract
We show how to learn a deep graphical model of the word-count vectors obtained from a large set of documents. The values of the latent variables in the deepest layer are easy to infer and give a much better representation of each document than Latent Semantic Analysis. When the deepest layer is forced to use a small number of binary variables (e.g. 32), the graphical model performs ”semantic hashing”: Documents are mapped to memory addresses in such a way that semantically ...
|
| |
In CIKM '04: Proceedings of the thirteenth ACM international conference on Information and knowledge management (2004), pp. 78-87, doi:10.1145/1031171.1031186
Abstract
Automatically categorizing documents into pre-defined topic hierarchies or taxonomies is a crucial step in knowledge and content management. Standard machine learning techniques like Support Vector Machines and related large margin methods have been successfully applied for this task, albeit the fact that they ignore the inter-class relationships. In this paper, we propose a novel hierarchical classification method that generalizes Support Vector Machine learning and that is based on discriminant functions that are structured in a way that mirrors the class hierarchy. ...
|
| |
Abstract
Very large-scale classification taxonomies typically have hundreds of thousands of categories, deep hierarchies, and skewed category distribution over documents. However, it is still an open question whether the state-of-the-art technologies in automated text categorization can scale to (and perform well on) such large taxonomies. In this paper, we report the first evaluation of Support Vector Machines (SVMs) in web-page classification over the full taxonomy of the Yahoo! categories. Our accomplishments include: 1) a data analysis on the Yahoo! taxonomy; 2) the ...
|
| |
In Proceedings of the 27th International Conference on Machine Learning (ICML-10) (June 2010), pp. 591-598
posted to no-tag
by mehrbod
on 2011-03-21 07:17:05
|
| |
In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval (2001), pp. 137-145, doi:10.1145/383952.383975
Abstract
Thresholding strategies in automated text categorization are an underexplored area of research. This paper presents an examination of the effect of thresholding strategies on the performance of a classifier under various conditions. Using k-Nearest Neighbor (kNN) as the classifier and five evaluation benchmark collections as the testbets, three common thresholding methods were investigated, including rank-based thresholding (RCut), proportion-based assignments (PCut) and score-based local optimization (SCut); in addition, new variants of these methods are proposed to overcome significant problems in ...
|
| |
Journal of Machine Learning Research, Vol. 2 (December 2001), pp. 265-292
posted to no-tag
by mehrbod
on 2011-03-21 06:52:37
along with 1 person
kira
Abstract
In this paper we describe the algorithmic implementation of multiclass kernel-based vector machines. Our starting point is a generalized notion of the margin to multiclass problems. Using this notion we cast multiclass categorization problems as a constrained optimization problem with a quadratic objective function. Unlike most of previous approaches which typically decompose a multiclass problem into multiple independent binary classification tasks, our notion of margin yields a direct method for training multiclass predictors. By using the dual of the optimization problem ...
|
| |
In Proceedings of ICML-00, 17th International Conference on Machine Learning (2000), pp. 303-310
Abstract
This paper explores in detail the use of Error Correcting Output Coding (ECOC) for learning text classifiers. We show that the accuracy of a Naive Bayes Classifier over text classification tasks can be significantly improved by taking advantage of the error-correcting properties of the code. We also explore the use of different kinds of codes, namely Error-Correcting Codes, Random Codes, and Domain and Data-specific codes and give experimental results for each of them. The ECOC method ... ...
|
| |
In Machine Learning, Vol. 45 (2001), pp. 5-32
posted to no-tag
by mehrbod
on 2011-03-21 06:40:20
|
| |
IEEE International Conference on Granular Computing, Vol. 2 (2005), pp. 718-721 Vol. 2
Abstract
In multi-label learning, each instance in the training set is associated with a set of labels, and the task is to output a label set whose size is unknown a priori for each unseen instance. In this paper, a multi-label lazy learning approach named ML-kNN is presented, which is derived from the traditional k-nearest neighbor (kNN) algorithm. In detail, for each new instance, its k-nearest neighbors are firstly identified. After that, according to the label sets of these neighboring instances, maximum ...
|
| |
|
| |
In Proceedings of the third ACM international conference on Web search and data mining (2010), pp. 101-110
|
| |
In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2002)
Abstract
We consider the problem of classifying documents not by topic, but by overall sentiment, e.g., determining whether a review is positive or negative. Using movie reviews as data, we find that standard machine learning techniques definitively outperform human-produced baselines. However, the three machine learning methods we employed (Naive Bayes, maximum entropy classification, and support vector machines) do not perform as well on sentiment classification as on traditional topic-based... ...
|
| |
(2009)
posted to no-tag
by mehrbod
on 2011-03-21 05:25:13
|
| |
In UAI (2005), pp. 658-666
posted to no-tag
by mehrbod
on 2011-03-21 05:10:42
|
| |
Abstract
This survey covers fifteen years of research in the Named Entity Recognition and Classification (NERC) field, from 1991 to 2006. We report observations about languages, named entity types, domains and textual genres studied in the literature. From the start, NERC systems have been developed using hand-made rules, but now machine learning techniques are widely used. These techniques are surveyed along with other critical aspects of NERC such as features and evaluation methods. Features are word-level, dictionary-level and corpus-level representations of words ...
|
| |
Abstract
PhishGuru is an embedded training system that teaches users to avoid falling for phishing attacks by delivering a training message when the user clicks on the URL in a simulated phishing email. In previous lab and real-world experiments, we validated the effectiveness of this approach. Here, we extend our previous work with a 515-participant, real-world study in which we focus on long-term retention and the effect of two training messages. We also investigate demographic factors that influence training and general phishing ...
|
| |
In First International Workshop on Adversarial Information Retrieval on the Web (2005)
Abstract
Web spamming refers to actions intended to mislead search engines into ranking some pages higher than they deserve. Recently, the amount of web spam has increased dramatically, leading to a degradation of search results. This paper presents a comprehensive taxonomy of current spamming techniques, which we believe can help in developing appropriate countermeasures. ...
|
| |
Abstract
We describe efficient algorithms for projecting a vector onto the l1-ball. We present two methods for projection. The first performs exact projection in O(n) expected time, where n is the dimension of the space. The second works on vectors k of whose elements are perturbed outside the l1-ball, projecting in O(k log(n)) time. This setting is especially useful for online learning in sparse feature spaces such as text categorization applications. We demonstrate the merits and effectiveness of our algorithms in numerous ...
|
| |
Abstract
Evaluative texts on the Web have become a valuable source of opinions on products, services, events, individuals, etc. Recently, many researchers have studied such opinion sources as product reviews, forum posts, and blogs. However, existing research has been focused on classification and summarization of opinions using natural language processing and data mining techniques. An important issue that has been neglected so far is opinion spam or trustworthiness of online opinions. In this paper, we study this issue in the context of ...
|
| |
In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (2010), pp. 1268-1277
posted to no-tag
by mehrbod
on 2010-12-12 20:11:28
Abstract
In this paper, we address the task of mapping high-level instructions to sequences of commands in an external environment. Processing these instructions is challenging---they posit goals to be achieved without specifying the steps required to complete them. We describe a method that fills in missing information using an automatically derived environment model that encodes states, transitions, and commands that cause these transitions to happen. We present an efficient approximate approach for learning this environment model as part of a policy-gradient reinforcement ...
|
| |
(2001)
Abstract
Natural language is an easy and effective medium for describing visual ideas and mental images. Thus, we foresee the emergence of language-based 3D scene generation systems to let ordinary users quickly create 3D scenes without having to learn special software, acquire artistic skills, or even touch a desktop window-oriented interface. WordsEye is such a system for automatically converting text into representative 3D scenes. WordsEye relies on a large database of 3D models and poses to depict entities and actions. Every 3D ...
|
| |
In In Proceedings of the 3rd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb (2007)
posted to no-tag
by mehrbod
on 2010-11-15 15:54:04
Abstract
of ten content-based classifiers stacked using logistic regression. Each classifier used one of two state-of-the art email filters – DMC [2] or OSBF-Lua [1] – applied to simple text files, with each text file acting as a proxy for a host to be classified. All text files were derived from the home page (including ...
|
| |
Abstract
Given a large sparse graph, how can we find patterns and anomalies? Several important applications can be modeled as large sparse graphs, e.g., network traffic monitoring, research citation network analysis, social network analysis, and financial transactions. Low-rank decompositions, such as singular value decomposition (SVD) and CUR, are powerful techniques for revealing latent-hidden variables and associated patterns from high dimensional data. However, those methods often ignore the sparsity property of the graph, and hence usually incur too high memory and computational cost ...
|
| |
|
| |
In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining (2007), pp. 687-696, doi:10.1145/1281192.1281266
Abstract
How can we find communities in dynamic networks of socialinteractions, such as who calls whom, who emails whom, or who sells to whom? How can we spot discontinuity time-points in such streams of graphs, in an on-line, any-time fashion? We propose GraphScope, that addresses both problems, using information theoretic principles. Contrary to the majority of earlier methods, it needs no user-defined parameters. Moreover, it is designed to operate on large graphs, in a streaming fashion. We demonstrate the efficiency and effectiveness ...
|
| |
In VLDB '04: Proceedings of the Thirtieth international conference on Very large data bases (2004), pp. 576-587
Abstract
Web spam pages use various techniques to achieve higher-than-deserved rankings in a search engine's results. While human experts can identify spam, it is too expensive to manually evaluate a large number of pages. Instead, we propose techniques to semi-automatically separate reputable, good pages from spam. We first select a small set of seed pages to be evaluated by an expert. Once we manually identify the reputable seed pages, we use the link structure of the web to discover other pages that ...
|
| |
In Proceedings of the 28th international conference on Human factors in computing systems (2010), pp. 373-382, doi:10.1145/1753326.1753383
Abstract
In this paper we present the results of a roleplay survey instrument administered to 1001 online survey respondents to study both the relationship between demographics and phishing susceptibility and the effectiveness of several anti-phishing educational materials. Our results suggest that women are more susceptible than men to phishing and participants between the ages of 18 and 25 are more susceptible to phishing than other age groups. We explain these demographic factors through a mediation analysis. Educational materials reduced users' tendency to ...
|
| |
In International Conference on Weblogs and Social Media (May 2010)
|
| |
|