Eerika Savia, Samuel Kaski, Ville Tuulos and Petri Myllym¨aki
On Text-Based Estimation of Document Relevance
Eerika Savia, Samuel Kaski, Ville Tuulos and Petri Myllym¨aki
Helsinki Institute for Information Technology HIIT
URL: http://cosco.hiit.fi/Articles/hiit-2004-5.pdf
NB. The HIIT Technical Reports series is intended for rapid dissemination of resultsproduced by the HIIT researchers. Therefore, some of the results may also be laterpublished as scientific articles elsewhere.
On Text-Based Estimation of Document Relevance
Helsinki Institute for Information Technology
University of Helsinki & Helsinki University of Technology
P.O.Box 26, FIN-00014 University of Helsinki,Finland
E-mail: {Ville.Tuulos, Petri.Myllymaki}@cs.helsinki.fi
E-mail: {Eerika.Savia,Samuel.Kaski}@hut.fi
Abstract- This work is part of a proactive information re-
closely related to standard text classification, and some simple
trieval project that aims at estimating relevance from implicit
standard methods will be included in the comparisons. user feedback. The noisy feedback signal needs to be comple-
Here we report the results of a feasibility study that aims
mented with all available information, and textual content is
at answering the following research questions: (1) is the
one of the natural sources. Here we take the first steps by investigating whether this source is at all useful in the challenging
prediction accuracy high enough, (2) whether a rigorous
setting of estimating the relevance of a new document based
unsupervised model of the document collection will help in
on only few samples with known relevance. It turns out that
the task, and (3) whether suitable auxiliary data will help. even sophisticated unsupervised methods like multinomial PCA (or Latent Dirichlet Allocation) cannot help much. By contrast, feature extraction supervised by relevant auxiliary data may help.
In this paper we focus on the following setting. Let D de-
note a collection of I documents D1,. , DI. Each document
Di consists of words w, and the number of different possible
In proactive information retrieval, the system adapts to the
words is J. In the following we make the standard simplifying
interests of the user that are inferred from implicit feedback.
"bag-of-words" assumption. The order of the words within a
Feedback by explicitly indicating which documents are rele-
single document is considered irrelevant, and only the counts
vant to the user is naturally more accurate but the users often
of different words in a document are used as features.
consider it too laborious and time-consuming. The usability
A lot of research related to this type of a setting is focused
and accuracy of information retrieval applications would be
on unsupervised data exploration tasks like data clustering
greatly enhanced by complementing explicit feedback with
or dimensionality reduction. In data clustering the document
implicit feedback signals measured from the user and the
collection D is partitioned into several subsets such that the
interface. Research on implicit feedback potentially has even
documents within each subset are in some sense similar to
wider-ranging implications. If the feedback signal is reliable
each other, and different from the documents in the other
enough, it will be useful in a range of other applications as
subsets. In dimensionality reduction the goal is to find a
well. Ultimately, a genuine personal assistant could adapt to
low-dimensional representation of the document collection so
the goals and interests of the user and learn to disambiguate
that the coordinates of the resulting low-dimensional space
her vague commands and anticipate her actions.
correspond to some interesting factors that cannot be directly
In this first stage of the work we start with a simplified
observed in the data. The Websom system [2] is an example
setting, where the user is reading a given document. That is,
of both data clustering and dimensionality reduction.
we assume that the document has already been chosen in some
The models produced by data clustering or dimensionality
way or is a new one, and the task is to estimate whether it
reduction methods can be used for unsupervised data explo-
is relevant or not. This will be done using implicit feedback
ration tasks where the goal is to achieve a better understanding
such as eye movements, which we studied in [1].
of the regularities governing the domain where the data is
The problem with implicit relevance signals is that they
from. However, it is obvious that this type of models can also
will necessarily be noisy, and need to be complemented
be used for information retrieval tasks where the goal can,
with any available sources of relevant information. Textual
for example, be to find from a document collection D the
content is of course a natural one since it is the basis of
document Di that is the most similar to a given, previously
all standard information retrieval. In this paper we study how
accurately relevance can be estimated based on textual content
In this work we deviate from the standard unsupervised
only, when only few documents with known relevance are
data exploration setting and consider the following supervised
available. If textual content helps in prediction (compared to
modeling problem. We assume that each text document is
random performance), it will be used as prior knowledge in
provided with some labels. For simplicity, let us assume that
inferring relevance from implicit feedback. This problem is
the labels are simply binary, and let us denote the label of
document Di by ri. If ri = 1, we say that the document is
relevant, otherwise it is considered irrelevant.
We experimented with a data set of labeled text documents.
The meaning of relevance depends of course of the semantic
The user-specific labels were collected from a movie rating
interpretation of the binary labels r1,. , rI , and is subjective
database, where people have given ratings to movies according
to the person doing the labeling. Consequently, label ri = 1
to their likes and dislikes. The textual descriptions of the
could, for example, represent the fact that the person doing
movies were retrieved from the Internet, and the users' ratings
the labeling liked the text in document Di, or that she liked
were associated to these text documents. Given a set of
the matter that the text is referring to. In any case, the task we
subjectively labeled documents of an individual user we build
are facing is now the following: given a document collection
a model for this particular user's relevances and use the model
D = {D1,. , DI , DI+1}, and the corresponding relevance
to predict the relevance of a new document.
labels {r1,. , rI } for all the documents except the last, infer
This specific data set was chosen because of its size; it
the relevance of the last document DI+1. Note that this setting
contained more than 70,000 users and 2 million ratings. We
differs from standard information retrieval, in that we are not
will later combine the ratings of different users by modeling
searching relevant documents from D but instead want to
predict relevance of a given new document DI+1.
It should be noted that the task given above is supervised in
the sense that all we are interested in is predicting the value
The data was collected from a publicly available database of
of rI+1, the relevance of the unlabeled document - we are
people's ratings for a set of movies (EachMovie) [6]. Synopses
not necessarily interested in understanding the deeper structure
of a set of 1398 movies were gathered from the Allmovie
of the domain if that does not help us in our supervised
database [7] and they were used as the text documents. The
prediction task. Of course, one can first build an unsupervised
ratings in EachMovie database had been gathered in a scale
model of the problem domain and then use that model in
the prediction task, and as a matter of fact, that is one ofthe approaches explored in this paper. However, as discussed
and demonstrated in [3], [4], one should acknowledge that in
Low level preprocessing included removing words inside
this approach we are faced with the danger that the domain
brackets "()", which were typically names, and stemming
model only represents those regularities that are irrelevant with
according to Porter's algorithm [8]. Terms were required to
respect to the supervised prediction task, in which case the
appear at least 5 times in the document collection, which
prediction task becomes impossible to solve.
As already noted, the relevances {r1,. , rI } are subjective.
We gathered a data set that conforms to our assumption
In a more general setting one could assume to have a relevance
of binary labels of "relevance" by picking up, for each user,
vector for each document, consisting of the relevance labels
the 10% of the movies with the best ratings ("relevant"),
given by several individuals. In this case one could then
and the 10% with the lowest ratings ("irrelevant"). This
use collaborative filtering [5] techniques in our supervised
has the additional desirable consequence that the originally
prediction task. However, in this paper we restrict ourselves
possibly very different rating scales of different users become
to the single user case, where this type of techniques cannot
normalized. In this data set, the success probability of random
If we restrict ourselves to simple binary labels as above, the
Finally, we only accepted those users who had at least 80
prediction problem we are addressing is similar to the problem
ratings after this filtering. The resulting number of users was
of e-mail spam filtering, where the goal is to distinguish
useful e-mail messages from uninteresting ads, viruses and
such. However, in this case the relevance of a document canbe considered objective, not subjective, as most people seem
To reduce the dimensionality of the term space, 1000 terms
to agree upon what is spam and what is not. This means
were selected with the Odds Ratio algorithm [9] as described
that the amount of available data in spam filtering tasks is
in Section IV-A.4. In some of the experiments, the set was
typically huge, whereas we in our single-user setting need
reduced further to 500 terms (LDA500) by filtering with Linear
to work with relatively small data sets. On the other hand,
Discriminant Analysis as described in the same section.
the spam filtering task can be considered relatively easy asthe contents of the spam messages typically contain certain
D. Auxiliary Data About Movie Genres
elements - for example, key words like "offer", "viagra",
There was also a classification of the movies into 10 genres
etc. - so that detecting these messages is easy, while we
available in the EachMovie database. This classification was
address problem domains where the textual contents give only
utilized in some of the experiments (details below). The gen-
very weak signals of the relevance of the document. A typical
res were: Action, Animation, Art Foreign, Classic, Comedy,
example of such a domain is the movie database discussed in
Drama, Family, Horror, Romance and Thriller. Each movie
Intuitively, the components of the vector θi reveal to what
The methods we used for estimating relevance consist of
quently, as discussed in [11], mPCA can be seen as a multi-
two stages. First, a representation for the document was
faceted clustering method, where each document belongs to
formed, and then the relevance was predicted based on this
each cluster (topic) with some probability. On the other hand,
representation. A few alternatives were tried for each stage;
the model can also be viewed as a dimensionality reduction
they vary in the degree of sophistication and in what kind
scheme: for those familiar with standard principal component
of data they use for optimizing the predictions. The methods
analysis (see [12]), it is evident that the above model is a
were tested on leave-out data as described in Section V.
discrete equivalent for the standard PCA with the Gaussian
data generating function replaced by the multinomial. It shouldbe noted that although technically possible, it does not make
For computational simplicity, all methods are based on the
rigorously sense to apply the PCA model directly to textual
bag-of-words assumption: the order of the words is neglected.
data, as the discrete text data is typically very non-normally
1) Simple Unsupervised Features: The simplest represen-
tation is a binary term vector, where the entry corresponding
In summary, so far we have three different representations of
to term wj is zero if the term does not occur in the document,
text documents D1,. , DI. First, they can be seen as strings
of words. Second, ignoring the ordering of the words, they
The next, slightly more complex alternative would be to
can be thought of as word count vectors w1,. , wI (and
replace the binary numbers by frequency counts, or some
in the experiments we will further simplify them to binary
simple functions of them, as in the standard "vector-space
vectors). Third, they can be treated as topic probability vectors
model" of information retrieval. In preliminary experiments
θ1,. , θI. We used these topic probability vectors θi as
this did not improve the results-probably because the most
feature vectors for the classification. To see how the mPCA
important terms rarely occur multiple times in our short
model can be used for tasks like information retrieval, see for
documents-and we decided to use the binary vectors as the
3) Given Supervised Features: For comparison, we also
2) Unsupervised Feature Extraction with Multinomial PCA:
used the movie genres assigned to each movie (see Section III-
An alternative method that takes the frequency of occurrence
D). Documents were coded as binary vectors of these features,
of words into account in a rigorous probabilistic fashion, starts
where each component of the vector corresponds to a genre.
from a J -component vector wi, where the jth component of
The components of these 10-dimensional vectors indicate to
wi gives the number of occurrences of word wj in document
which genres the document belongs to.
Di. In the multinomial PCA (mPCA) approach [10] the
The genre assignments have been carefully chosen to de-
document collection is modeled by assuming that the words
scribe the movies and hence they are expected to be better
are generated from K probability distributions, where K is
features than the very noisy texts. Since the genres are not
a much smaller number than the number of words J. Each
known for new documents, however, they do not solve our
of these K probability distributions can be represented as
problem but they will be used as a kind of measure for "best
a J -component vector where the jth component gives the
probability for the occurrence of word wj. As these probability
4) Genre-Based Feature Extraction and Linear Discrimi-
distributions define which words occur together with high
nant Analysis: Odds Ratio algorithm [9] was used to initially
probability, they are often called "topics".
reduce the number of terms to 1000 that discriminate between
Let Ω denote a J ×K matrix, where the jth column gives the
the given movie genres. The Odds Ratio is defined as
probabilities for term wj in each of the K topic distributions. Now, intuitively it makes sense that a textual document may
contain text from several topic distributions, that is, a single
document can be related to several different topics. In the
where wk is a term, P (wk|c) is the frequency-based estimate
mPCA approach this is modeled by assuming that the text
of the probability that term wk occurs in a document of class
generating probability distribution for each document is a
c and ¬c is the complement of class c. Terms that had the
weighted linear combination of all the topic distributions.
highest Odds Ratio on the average were selected.
In some of the experiments we further reduced the dimen-
sionality with Linear Discriminant Analysis (LDA), a classical
linear classification method (see [14]). It finds a projection
where Li denotes the number of words in document Di,
that best discriminates between the classes, and for two-class
and θi gives the mixing coefficients of the text generating
case the projection is onto a one-dimensional feature. Since
probability distribution corresponding to document Di. The
our classes are non-exclusive, that is, each movie may belong
prior distribution for the vectors θi is usually assumed to
to several genres, we sought one feature for each genre, to
be the Dirichlet distribution, the conjugate distribution of the
discriminate between movies belonging and not belonging to
it. As a result we got 10 discriminative features. Projection of
the term vectors on these directions yielded a 10-dimensional
the KNN results were left out of the discussion. All the models
were trained for each user separately and tested with leave-
The LDA assumes that the given classes are normally
one-out crossvalidation. Mean prediction error over each user's
distributed with equal covariance matrices. This clearly does
predictions was taken as a user-specific error, and mean pre-
not hold for our data, but it turned out that the discrimination
diction error over all users was used as performance measure
between models. Since all the users had equal numbers of
In other experiments we also used LDA to reduce the
relevant (r = 1) and irrelevant (r = 0) ratings, prediction
dimensionality of the binary term space; from the 1000 terms
error of 0.50 corresponds to random guessing.
we chose those 500 terms (LDA500) that had the greatest
overall loadings on the discriminative directions. A. Comparison of Unsupervised Feature Extraction Methods
Our first hypothesis was that a multinomial PCA (mPCA),
Two simple but powerful methods, the log-linear model and
computed from the whole text collection, would find useful
the K-Nearest-Neighbor classifier, were used for the final clas-
topics that would help reducing noise in the texts and help
sification to relevant and irrelevant documents using different
in predicting relevance. We compared mPCA-based feature
vectorial representations of documents. The classification was
extraction with the completely unsupervised spam filter, and
done for each user separately. A spam-filtering algorithm was
with binary term vectors. To get an estimate of a lower limit
used as a baseline method for the classification.
for the prediction error, we further included genre vectors that
1) Log-linear Model: The log-linear classifier was used to
are supposed to be superior to the other features.
model the relevances of each user. The input xi denotes one of
In detail, the experiments were carried out as follows.
the vectorial representations for the document Di, for instance
mPCA: The number of topics was fixed to 10, and the
a binary term vector. The probability of document Di to be
output of the mPCA model was a point estimate of the topic
relevant (ri = 1) to the user is assumed Bernoulli distributed
distribution θ for each document. The log-linear model was
fitted for each user in this topic space. genre: The log-linear model was fitted to the genre vectors of each user. LDA500:
The binary term vectors are not strictly speaking unsupervised,
The logit of the mean is assumed to obey a linear model with
since the set of terms was reduced, for computational reasons,
with a partly supervised method (LDA500 described in sectionIV-A.4). A log-linear model was fit to the term vectors. crm114: a state-of-the-art spam filtering algorithm [17].
The parameters c are sought by maximizing the likelihood
The results shown in Figure 1 reveal that the term vector-
of the observed data, i.e., ratings of the individual user. For
based classification (LDA500) does not differ from that ob-
details of optimization see [15]. Predicted relevance of a new
tained by chance. The spam filter (crm114) is slightly and
mPCA considerably better, but both are far from the perfor-
In the tests the predictions were rounded to binary predictions
mance of the supervised genre vector.
The reason for the weak performance of the spam filter is
2) K-Nearest-Neighbor Classification: K-nearest-neighbor
probably that it has been designed for a different task. Typical
classifier (KNN) stores a reference set of labeled samples. A
spam is relatively homogenous and there is plenty of training
new unlabeled sample is classified according to a majority vote
material available. Hence, there is no need to optimize the
of its K nearest neighbors in the reference set. The size of the
performance of the filter for very small data sets, such as our
neighborhood is a free parameter, and the distance measure
that defines the neighborhood needs to be chosen as well. We
The mPCA feature extraction was clearly better than the
used Euclidean distances since our preliminary tests did not
binary term vectors but still far from the "best possible
show marked differences in the results for the other metric
performance" of the genre vectors. Note, though, that at this
considered (Hellinger distance [16]).
stage of the experiments it was of course not clear whether
3) Spam Filtering Method: A state-of-the-art spam filtering
the performance of genre vectors could be reached by texts
algorithm, CRM114 [17], was used as a baseline method.
only, and we were simply searching for the limits.
CRM114 works by sliding a five-word window over the
The mPCA was included to reduce the dimensionality. It
document. Each window increases the frequency counts of the
was, however, optimized in a purely unsupervised fashion,
corresponding words. Finally, a Naive-Bayes classifier based
and there is no theoretical reason why it should help in
on empirical the frequencies gives the classification.
our discriminative task. It should help if the variation itmodels is useful for discrimination but otherwise not. So the
main question was whether the bad performance was due to
The classification was initially computed with both the log-
overfitting of the log-linear model or that the mPCA loses the
linear model and the K-nearest neighbor classifier with K = 9,
information required for the classification. We checked this
but since the log-linear model consistently performed better,
by computing the performance on the training set (Table I).
features give almost the same performance as the originalgenre vectors.
Classification errors for predicting relevance of left-out documents
with a log-linear model, based on 4 different feature sets. genre: Binary genre
vector. mPCA: posterior estimates of mPCA-topic probabilities. crm114:
Spam filter CRM114. LDA500: Binary term vector. Dotted line: random performance.
Genre-supervised LDA-features (LDAproj) perform well. For
description of the other features see Figure 1.
Since the performance on the training set was clearly better
Finally, we checked whether selecting the terms discrimina-
than on the test set, the tentative conclusion was that the mPCA
tively before training mPCA would lead to any improvement,
does not lose all relevant information but that there still was
too much variation in the result even after the mPCA-based
dimensionality reduction. The few labeled samples are notsufficient for building reliable predictors in the mPCA space.
In this first feasibility study we have investigated prediction
of relevance of a given document based on only a small
set with known relevance. It turned out that a completely
DIFFERENCE OF THE PERFORMANCE OF THE MPCA FEATURE
unsupervised multinomial PCA model of the whole docu-
EXTRACTION IN THE TRAINING AND TEST SETS. THE FIGURES ARE MEAN
ment collection helped somewhat. If suitable auxiliary data
PREDICTION ERRORS IN LEAVE-ONE-OUT CROSSVALIDATION.
is available for a larger set of documents, here the genreclassifications, it can be used to help reduce the problem of
Train set
small data sets. Supervising feature selection by the genres
improved performance of subsequent prediction of relevance.
In this work we focused on single-user systems and did
not combine the models optimized for different users. Suchcollaborative filtering will be studied later, and the models will
B. Feature Extraction Supervised with Auxiliary Genre Data
additionally be combined with models of implicit feedback for
The conclusion from the previous section is that the number
of labeled samples was too small. On the other hand, we knowthat prediction based on the genre vectors was more successful,
and there are plenty of texts available with known genres.
This work was supported by the Academy of Finland, de-
Hence, the next idea was to supervise the feature extraction
cision no. 79017, and by the IST Programme of the European
by the genre vectors: Optimize such a feature extractor for
Community, under the PASCAL Network of Excellence, IST-
texts that it would give good predictions of the genres. This
2002-506778. This publication only reflects the authors' views.
will reduce the dimensionality of the term space, and a
The authors would like to thank Drs Kai Puolam¨aki and Janne
classifier in this reduced-dimensional space might perform
Sinkkonen, and all other people in the PRIMA project for
better in predicting relevance. Such a feature extractor would
useful discussions, and acknowledge that access rights to the
be applicable to new documents with unknown genres as well.
data sets and other materials produced in the PRIMA project
Linear Discriminant Analysis was used to obtain one dis-
are restricted due to other commitments.
criminative direction for each genre in the term space, and a
new document was then projected onto this 10-dimensionalfeature space as described in Section IV-A.4. The results
[1] J. Saloj¨arvi, I. Kojo, J. Simola, and S. Kaski, "Can relevance be inferred
from eye movements in information retrieval?" in Proceedings of the
(LDAproj) of predicting relevance with a log-linear model
Workshop on Self-Organizing Maps (WSOM'03), Hibikino, Kitakyushu,
in this space are shown in Figure 2. The genre-supervised
Japan, September 2003, pp. 261-266.
[2] T. Kohonen, S. Kaski, K. Lagus, J. Saloj¨arvi, J. Honkela, V. Paatero,
[11] W.Buntine and S. Perttu, "Is multinomial PCA multi-faceted clustering
and A. Saarela, "Self organization of a massive document collection,"
or dimensionality reduction?" in Proceedings of the Ninth InternationalIEEE Transactions on Neural Networks, vol. 11, pp. 574-585, 2000. Workshop on Artificial Intelligence and Statistics, C. Bishop and B. Frey,
[3] P. Kontkanen, P. Myllym¨aki, T. Silander, and H. Tirri, "On supervised
Society for Artificial Intelligence and Statistics, 2003, pp. 300-
selection of Bayesian networks," in Proceedings of the 15th InternationalConference on Uncertainty in Artificial Intelligence (UAI'99), K. Laskey
[12] M. Tipping and C. Bishop, "Mixtures of probabilistic principal compo-
and H. Prade, Eds., 1999, pp. 334-342.
nent analysers," Neural Computation, vol. 11, pp. 443-482, 1999.
[4] R. Cowell, "On searching for optimal classifiers among Bayesian
[13] W. Buntine, P. Myllym¨aki, and S. Perttu, "Language models for in-
networks," in Proceedings of the Eighth International Conference on
telligent search using multinomial PCA," in Proceedings of the FirstArtificial Intelligence and Statistics, T. Jaakkola and T. Richardson, Eds.,
European Web Mining Forum at the 14th European Conference onMachine Learning and the 7th European Conference on Principles and
[5] J. Griffith and C. O'Riordan, "Collaborative filtering," National Univer-
Practice of Knowledge Discovery in Databases, B. Berendt, D. M.
sity of Ireland, Galway, Tech. Rep. NUIG-IT-160900, 2000.
A. Hotho, M. van Someren, M. Spiliopoulou, and G. Stumme, Eds.
Ruder Boskovic Institute, 2003, pp. 37-50.
http://research.compaq.com/SRC/eachmovie/
[14] N. H. Timm, Applied Multivariate Analysis. New York: Springer, 2002.
[7] Allmovie database containing movie information. [Online]. Available:
[15] I. Nabney, "Efficient training of RBF networks for classication," in Proc.
[8] M. Porter, "An algorithm for suffix stripping," Program, vol. 14, no. 3,
[16] M. Fannes and P. Spincemaille, "The mutual affinity of random
[9] F. Sebastiani, "Machine learning in automated text categorization," ACMComputing Surveys, vol. 34, no. 1, pp. 1-47, 2002.
[10] W. Buntine, "Variational extensions to EM and multinomial PCA," in
Proceedings of the 13th European Conference on Machine Learning,
http://crm114.sourceforge.net/CRM114 paper.html
ser. Lecture Notes in Artificial Intelligence, T. Elomaa, H. Mannila, andH. Toivonen, Eds., vol. 2430.
Remarques • Les différentes parties du problème sont indépendantes ;• Lors de l’écriture de mécanismes, il n’est pas nécessaire d’écrire les molécules dans leurintégralité ; seul le fragment utile pour expliquer la réaction sera représenté. Lorsqu’on vousdemande d’identifier une structure, il faudra par contre la dessiner complètement ;• Toute réponse doit ê
A Randomized Controlled Trial of Male Circumcision to Reduce HIV Incidence in Kisumu, Robert C. Bailey,1 Stephen Moses,2 Corette B. Parker,3 Kawango Agot,4 Ian Maclean,5 John N. Krieger,6 Carolyn F. M. Williams,7 Richard T. Campbell,1Jeckoniah O. Ndinya-Achola8 PI Contact Information: rcbailey@uic.edu Summary Background Male circumcision may provide significant protection against HIV-