LDA-based document models for ad-hoc retrieval(2006), pp. 178-185.
|
Reviews
[Write a review of this article]
There are no reviews of this article
Notes for this articleAt the current stage of our work, the parameters are selected through exhaustive search or manually hill-climbing search. All parameter values are tuned based on average precision since retrieval is our final task. The parameter selection process, including the training set selection, also follows Liu and Croft (2004) to make the results comparable. Mean average precision is used as the basis of evaluation throughout this study.
we formulate our model through a linear combination obtained in one of the following ways: (a) linearly combining the original document model and LDA, which is illustrated in (7), (b) additively combining the LDA model with the maximum likelihood estimate of word w in the document D, and (c) combining the LDA model with the Dirichlet smoothing part, i.e. the maximum likelihood estimate of word w in the entire collection. Option (c) is similar to the combination used in Liu and Croft (2004). All methods have empirically shown similar performance with appropriate parameters, and we will only report results of Option (a) which performs slightly better in our experiments (parameter setting in our paper is for (a); it may be necessary to adjust λ and μ in (b)
Compared to the pLSI model, LDA possesses fully consistent generative semantics by treating the topic mixture distribution as a k-parameter hidden random variable rather than a large set of individual parameters which are explicitly linked to the training set; thus LDA overcomes the overfitting problem and the problem of generating new documents in pLSI.
In pLSI, the topic mixture is conditioned on each document. In LDA, the topic mixture is drawn from a conjugate Dirichlet prior that remains the same for all documents.
The cluster model possesses fully generative semantics, but the assumption that each string (document) is generated from a single topic is limiting and may become problematic for long documents and large collections.
The pLSI model itself has a problem in that its generative semantics are not well-defined (Blei et al, 2003); thus there is no natural way to predict a previously unseen document, and the number of parameters of pLSI grows linearly with the number of training documents, which makes the model susceptible to overfitting.
The roots of pLSI go back to Latent Semantic Indexing/Analysis (Deerwester et al, 1990). pLSI was designed as a discrete counterpart of LSI to provide a better fit to text data. It can also be regarded as an attempt to relax the assumption made in the mixture of unigrams model that each document is generated from a single topic. pLSI models each document as a mixture of topics.
As a much simpler topic model, the mixture of unigrams model generates a whole document from one topic under the assumption that each document is related to exactly one topic.
Find related articles from these CiteULike users
Find related articles with these CiteULike tags
AbstractSearch algorithms incorporating some form of topic model have a long history in information retrieval. For example, cluster-based retrieval has been studied since the 60s and has recently produced good results in the language model framework. An approach to building topic models based on a formal generative model of documents, Latent Dirichlet Allocation (LDA), is heavily cited in the machine learning literature, but its feasibility and effectiveness in information retrieval is mostly unknown. In this paper, we study how to efficiently use LDA to improve ad-hoc retrieval. We propose an LDA-based document model within the language modeling framework, and evaluate it on several TREC collections. Gibbs sampling is employed to conduct approximate inference in LDA and the computational complexity is analyzed. We show that improvements over retrieval using cluster-based models can be obtained with reasonable efficiency.
BibTeX record
RIS record