cosco.hiit.fi

Cosco.hiit.fi

Eerika Savia, Samuel Kaski, Ville Tuulos and Petri Myllym¨aki On Text-Based Estimation of Document Relevance Eerika Savia, Samuel Kaski, Ville Tuulos and Petri Myllym¨aki Helsinki Institute for Information Technology HIIT URL: http://cosco.hiit.fi/Articles/hiit-2004-5.pdf NB. The HIIT Technical Reports series is intended for rapid dissemination of resultsproduced by the HIIT researchers. Therefore, some of the results may also be laterpublished as scientific articles elsewhere.
On Text-Based Estimation of Document Relevance Helsinki Institute for Information Technology University of Helsinki & Helsinki University of Technology P.O.Box 26, FIN-00014 University of Helsinki,Finland E-mail: {Ville.Tuulos, Petri.Myllymaki}@cs.helsinki.fi E-mail: {Eerika.Savia,Samuel.Kaski}@hut.fi Abstract- This work is part of a proactive information re-
closely related to standard text classification, and some simple trieval project that aims at estimating relevance from implicit
standard methods will be included in the comparisons.
user feedback. The noisy feedback signal needs to be comple-
Here we report the results of a feasibility study that aims mented with all available information, and textual content is
at answering the following research questions: (1) is the one of the natural sources. Here we take the first steps by
investigating whether this source is at all useful in the challenging
prediction accuracy high enough, (2) whether a rigorous setting of estimating the relevance of a new document based
unsupervised model of the document collection will help in on only few samples with known relevance. It turns out that
the task, and (3) whether suitable auxiliary data will help.
even sophisticated unsupervised methods like multinomial PCA
(or Latent Dirichlet Allocation) cannot help much. By contrast,
feature extraction supervised by relevant auxiliary data may help.
In this paper we focus on the following setting. Let D de- note a collection of I documents D1,. , DI. Each document Di consists of words w, and the number of different possible In proactive information retrieval, the system adapts to the words is J. In the following we make the standard simplifying interests of the user that are inferred from implicit feedback.
"bag-of-words" assumption. The order of the words within a Feedback by explicitly indicating which documents are rele- single document is considered irrelevant, and only the counts vant to the user is naturally more accurate but the users often of different words in a document are used as features.
consider it too laborious and time-consuming. The usability A lot of research related to this type of a setting is focused and accuracy of information retrieval applications would be on unsupervised data exploration tasks like data clustering greatly enhanced by complementing explicit feedback with or dimensionality reduction. In data clustering the document implicit feedback signals measured from the user and the collection D is partitioned into several subsets such that the interface. Research on implicit feedback potentially has even documents within each subset are in some sense similar to wider-ranging implications. If the feedback signal is reliable each other, and different from the documents in the other enough, it will be useful in a range of other applications as subsets. In dimensionality reduction the goal is to find a well. Ultimately, a genuine personal assistant could adapt to low-dimensional representation of the document collection so the goals and interests of the user and learn to disambiguate that the coordinates of the resulting low-dimensional space her vague commands and anticipate her actions.
correspond to some interesting factors that cannot be directly In this first stage of the work we start with a simplified observed in the data. The Websom system [2] is an example setting, where the user is reading a given document. That is, of both data clustering and dimensionality reduction.
we assume that the document has already been chosen in some The models produced by data clustering or dimensionality way or is a new one, and the task is to estimate whether it reduction methods can be used for unsupervised data explo- is relevant or not. This will be done using implicit feedback ration tasks where the goal is to achieve a better understanding such as eye movements, which we studied in [1].
of the regularities governing the domain where the data is The problem with implicit relevance signals is that they from. However, it is obvious that this type of models can also will necessarily be noisy, and need to be complemented be used for information retrieval tasks where the goal can, with any available sources of relevant information. Textual for example, be to find from a document collection D the content is of course a natural one since it is the basis of document Di that is the most similar to a given, previously all standard information retrieval. In this paper we study how accurately relevance can be estimated based on textual content In this work we deviate from the standard unsupervised only, when only few documents with known relevance are data exploration setting and consider the following supervised available. If textual content helps in prediction (compared to modeling problem. We assume that each text document is random performance), it will be used as prior knowledge in provided with some labels. For simplicity, let us assume that inferring relevance from implicit feedback. This problem is the labels are simply binary, and let us denote the label of document Di by ri. If ri = 1, we say that the document is relevant, otherwise it is considered irrelevant.
We experimented with a data set of labeled text documents.
The meaning of relevance depends of course of the semantic The user-specific labels were collected from a movie rating interpretation of the binary labels r1,. , rI , and is subjective database, where people have given ratings to movies according to the person doing the labeling. Consequently, label ri = 1 to their likes and dislikes. The textual descriptions of the could, for example, represent the fact that the person doing movies were retrieved from the Internet, and the users' ratings the labeling liked the text in document Di, or that she liked were associated to these text documents. Given a set of the matter that the text is referring to. In any case, the task we subjectively labeled documents of an individual user we build are facing is now the following: given a document collection a model for this particular user's relevances and use the model D = {D1,. , DI , DI+1}, and the corresponding relevance to predict the relevance of a new document.
labels {r1,. , rI } for all the documents except the last, infer This specific data set was chosen because of its size; it the relevance of the last document DI+1. Note that this setting contained more than 70,000 users and 2 million ratings. We differs from standard information retrieval, in that we are not will later combine the ratings of different users by modeling searching relevant documents from D but instead want to predict relevance of a given new document DI+1.
It should be noted that the task given above is supervised in the sense that all we are interested in is predicting the value The data was collected from a publicly available database of of rI+1, the relevance of the unlabeled document - we are people's ratings for a set of movies (EachMovie) [6]. Synopses not necessarily interested in understanding the deeper structure of a set of 1398 movies were gathered from the Allmovie of the domain if that does not help us in our supervised database [7] and they were used as the text documents. The prediction task. Of course, one can first build an unsupervised ratings in EachMovie database had been gathered in a scale model of the problem domain and then use that model in the prediction task, and as a matter of fact, that is one ofthe approaches explored in this paper. However, as discussed and demonstrated in [3], [4], one should acknowledge that in Low level preprocessing included removing words inside this approach we are faced with the danger that the domain brackets "()", which were typically names, and stemming model only represents those regularities that are irrelevant with according to Porter's algorithm [8]. Terms were required to respect to the supervised prediction task, in which case the appear at least 5 times in the document collection, which prediction task becomes impossible to solve.
As already noted, the relevances {r1,. , rI } are subjective.
We gathered a data set that conforms to our assumption In a more general setting one could assume to have a relevance of binary labels of "relevance" by picking up, for each user, vector for each document, consisting of the relevance labels the 10% of the movies with the best ratings ("relevant"), given by several individuals. In this case one could then and the 10% with the lowest ratings ("irrelevant"). This use collaborative filtering [5] techniques in our supervised has the additional desirable consequence that the originally prediction task. However, in this paper we restrict ourselves possibly very different rating scales of different users become to the single user case, where this type of techniques cannot normalized. In this data set, the success probability of random If we restrict ourselves to simple binary labels as above, the Finally, we only accepted those users who had at least 80 prediction problem we are addressing is similar to the problem ratings after this filtering. The resulting number of users was of e-mail spam filtering, where the goal is to distinguish useful e-mail messages from uninteresting ads, viruses and such. However, in this case the relevance of a document canbe considered objective, not subjective, as most people seem To reduce the dimensionality of the term space, 1000 terms to agree upon what is spam and what is not. This means were selected with the Odds Ratio algorithm [9] as described that the amount of available data in spam filtering tasks is in Section IV-A.4. In some of the experiments, the set was typically huge, whereas we in our single-user setting need reduced further to 500 terms (LDA500) by filtering with Linear to work with relatively small data sets. On the other hand, Discriminant Analysis as described in the same section.
the spam filtering task can be considered relatively easy asthe contents of the spam messages typically contain certain D. Auxiliary Data About Movie Genres elements - for example, key words like "offer", "viagra", There was also a classification of the movies into 10 genres etc. - so that detecting these messages is easy, while we available in the EachMovie database. This classification was address problem domains where the textual contents give only utilized in some of the experiments (details below). The gen- very weak signals of the relevance of the document. A typical res were: Action, Animation, Art Foreign, Classic, Comedy, example of such a domain is the movie database discussed in Drama, Family, Horror, Romance and Thriller. Each movie Intuitively, the components of the vector θi reveal to what The methods we used for estimating relevance consist of quently, as discussed in [11], mPCA can be seen as a multi- two stages. First, a representation for the document was faceted clustering method, where each document belongs to formed, and then the relevance was predicted based on this each cluster (topic) with some probability. On the other hand, representation. A few alternatives were tried for each stage; the model can also be viewed as a dimensionality reduction they vary in the degree of sophistication and in what kind scheme: for those familiar with standard principal component of data they use for optimizing the predictions. The methods analysis (see [12]), it is evident that the above model is a were tested on leave-out data as described in Section V.
discrete equivalent for the standard PCA with the Gaussian data generating function replaced by the multinomial. It shouldbe noted that although technically possible, it does not make For computational simplicity, all methods are based on the rigorously sense to apply the PCA model directly to textual bag-of-words assumption: the order of the words is neglected.
data, as the discrete text data is typically very non-normally 1) Simple Unsupervised Features: The simplest represen- tation is a binary term vector, where the entry corresponding In summary, so far we have three different representations of to term wj is zero if the term does not occur in the document, text documents D1,. , DI. First, they can be seen as strings of words. Second, ignoring the ordering of the words, they The next, slightly more complex alternative would be to can be thought of as word count vectors w1,. , wI (and replace the binary numbers by frequency counts, or some in the experiments we will further simplify them to binary simple functions of them, as in the standard "vector-space vectors). Third, they can be treated as topic probability vectors model" of information retrieval. In preliminary experiments θ1,. , θI. We used these topic probability vectors θi as this did not improve the results-probably because the most feature vectors for the classification. To see how the mPCA important terms rarely occur multiple times in our short model can be used for tasks like information retrieval, see for documents-and we decided to use the binary vectors as the 3) Given Supervised Features: For comparison, we also 2) Unsupervised Feature Extraction with Multinomial PCA: used the movie genres assigned to each movie (see Section III- An alternative method that takes the frequency of occurrence D). Documents were coded as binary vectors of these features, of words into account in a rigorous probabilistic fashion, starts where each component of the vector corresponds to a genre.
from a J -component vector wi, where the jth component of The components of these 10-dimensional vectors indicate to wi gives the number of occurrences of word wj in document which genres the document belongs to.
Di. In the multinomial PCA (mPCA) approach [10] the The genre assignments have been carefully chosen to de- document collection is modeled by assuming that the words scribe the movies and hence they are expected to be better are generated from K probability distributions, where K is features than the very noisy texts. Since the genres are not a much smaller number than the number of words J. Each known for new documents, however, they do not solve our of these K probability distributions can be represented as problem but they will be used as a kind of measure for "best a J -component vector where the jth component gives the probability for the occurrence of word wj. As these probability 4) Genre-Based Feature Extraction and Linear Discrimi- distributions define which words occur together with high nant Analysis: Odds Ratio algorithm [9] was used to initially probability, they are often called "topics".
reduce the number of terms to 1000 that discriminate between Let Ω denote a J ×K matrix, where the jth column gives the the given movie genres. The Odds Ratio is defined as probabilities for term wj in each of the K topic distributions.
Now, intuitively it makes sense that a textual document may contain text from several topic distributions, that is, a single document can be related to several different topics. In the where wk is a term, P (wk|c) is the frequency-based estimate mPCA approach this is modeled by assuming that the text of the probability that term wk occurs in a document of class generating probability distribution for each document is a c and ¬c is the complement of class c. Terms that had the weighted linear combination of all the topic distributions.
highest Odds Ratio on the average were selected.
In some of the experiments we further reduced the dimen- sionality with Linear Discriminant Analysis (LDA), a classical linear classification method (see [14]). It finds a projection where Li denotes the number of words in document Di, that best discriminates between the classes, and for two-class and θi gives the mixing coefficients of the text generating case the projection is onto a one-dimensional feature. Since probability distribution corresponding to document Di. The our classes are non-exclusive, that is, each movie may belong prior distribution for the vectors θi is usually assumed to to several genres, we sought one feature for each genre, to be the Dirichlet distribution, the conjugate distribution of the discriminate between movies belonging and not belonging to it. As a result we got 10 discriminative features. Projection of the term vectors on these directions yielded a 10-dimensional the KNN results were left out of the discussion. All the models were trained for each user separately and tested with leave- The LDA assumes that the given classes are normally one-out crossvalidation. Mean prediction error over each user's distributed with equal covariance matrices. This clearly does predictions was taken as a user-specific error, and mean pre- not hold for our data, but it turned out that the discrimination diction error over all users was used as performance measure between models. Since all the users had equal numbers of In other experiments we also used LDA to reduce the relevant (r = 1) and irrelevant (r = 0) ratings, prediction dimensionality of the binary term space; from the 1000 terms error of 0.50 corresponds to random guessing.
we chose those 500 terms (LDA500) that had the greatest overall loadings on the discriminative directions.
A. Comparison of Unsupervised Feature Extraction Methods Our first hypothesis was that a multinomial PCA (mPCA), Two simple but powerful methods, the log-linear model and computed from the whole text collection, would find useful the K-Nearest-Neighbor classifier, were used for the final clas- topics that would help reducing noise in the texts and help sification to relevant and irrelevant documents using different in predicting relevance. We compared mPCA-based feature vectorial representations of documents. The classification was extraction with the completely unsupervised spam filter, and done for each user separately. A spam-filtering algorithm was with binary term vectors. To get an estimate of a lower limit used as a baseline method for the classification.
for the prediction error, we further included genre vectors that 1) Log-linear Model: The log-linear classifier was used to are supposed to be superior to the other features.
model the relevances of each user. The input xi denotes one of In detail, the experiments were carried out as follows.
the vectorial representations for the document Di, for instance mPCA: The number of topics was fixed to 10, and the
a binary term vector. The probability of document Di to be output of the mPCA model was a point estimate of the topic relevant (ri = 1) to the user is assumed Bernoulli distributed distribution θ for each document. The log-linear model was fitted for each user in this topic space. genre: The log-linear
model was fitted to the genre vectors of each user. LDA500:
The binary term vectors are not strictly speaking unsupervised, The logit of the mean is assumed to obey a linear model with since the set of terms was reduced, for computational reasons, with a partly supervised method (LDA500 described in sectionIV-A.4). A log-linear model was fit to the term vectors.
crm114: a state-of-the-art spam filtering algorithm [17].
The parameters c are sought by maximizing the likelihood The results shown in Figure 1 reveal that the term vector- of the observed data, i.e., ratings of the individual user. For based classification (LDA500) does not differ from that ob- details of optimization see [15]. Predicted relevance of a new tained by chance. The spam filter (crm114) is slightly and mPCA considerably better, but both are far from the perfor- In the tests the predictions were rounded to binary predictions mance of the supervised genre vector.
The reason for the weak performance of the spam filter is 2) K-Nearest-Neighbor Classification: K-nearest-neighbor probably that it has been designed for a different task. Typical classifier (KNN) stores a reference set of labeled samples. A spam is relatively homogenous and there is plenty of training new unlabeled sample is classified according to a majority vote material available. Hence, there is no need to optimize the of its K nearest neighbors in the reference set. The size of the performance of the filter for very small data sets, such as our neighborhood is a free parameter, and the distance measure that defines the neighborhood needs to be chosen as well. We The mPCA feature extraction was clearly better than the used Euclidean distances since our preliminary tests did not binary term vectors but still far from the "best possible show marked differences in the results for the other metric performance" of the genre vectors. Note, though, that at this considered (Hellinger distance [16]).
stage of the experiments it was of course not clear whether 3) Spam Filtering Method: A state-of-the-art spam filtering the performance of genre vectors could be reached by texts algorithm, CRM114 [17], was used as a baseline method.
only, and we were simply searching for the limits.
CRM114 works by sliding a five-word window over the The mPCA was included to reduce the dimensionality. It document. Each window increases the frequency counts of the was, however, optimized in a purely unsupervised fashion, corresponding words. Finally, a Naive-Bayes classifier based and there is no theoretical reason why it should help in on empirical the frequencies gives the classification.
our discriminative task. It should help if the variation itmodels is useful for discrimination but otherwise not. So the main question was whether the bad performance was due to The classification was initially computed with both the log- overfitting of the log-linear model or that the mPCA loses the linear model and the K-nearest neighbor classifier with K = 9, information required for the classification. We checked this but since the log-linear model consistently performed better, by computing the performance on the training set (Table I).
features give almost the same performance as the originalgenre vectors.
Classification errors for predicting relevance of left-out documents with a log-linear model, based on 4 different feature sets. genre: Binary genre
vector. mPCA: posterior estimates of mPCA-topic probabilities. crm114:
Spam filter CRM114. LDA500: Binary term vector. Dotted line: random
performance.
Genre-supervised LDA-features (LDAproj) perform well. For
description of the other features see Figure 1.
Since the performance on the training set was clearly better Finally, we checked whether selecting the terms discrimina- than on the test set, the tentative conclusion was that the mPCA tively before training mPCA would lead to any improvement, does not lose all relevant information but that there still was too much variation in the result even after the mPCA-based dimensionality reduction. The few labeled samples are notsufficient for building reliable predictors in the mPCA space.
In this first feasibility study we have investigated prediction of relevance of a given document based on only a small set with known relevance. It turned out that a completely DIFFERENCE OF THE PERFORMANCE OF THE MPCA FEATURE unsupervised multinomial PCA model of the whole docu- EXTRACTION IN THE TRAINING AND TEST SETS. THE FIGURES ARE MEAN ment collection helped somewhat. If suitable auxiliary data PREDICTION ERRORS IN LEAVE-ONE-OUT CROSSVALIDATION.
is available for a larger set of documents, here the genreclassifications, it can be used to help reduce the problem of Train set
small data sets. Supervising feature selection by the genres improved performance of subsequent prediction of relevance.
In this work we focused on single-user systems and did not combine the models optimized for different users. Suchcollaborative filtering will be studied later, and the models will B. Feature Extraction Supervised with Auxiliary Genre Data additionally be combined with models of implicit feedback for The conclusion from the previous section is that the number of labeled samples was too small. On the other hand, we knowthat prediction based on the genre vectors was more successful, and there are plenty of texts available with known genres.
This work was supported by the Academy of Finland, de- Hence, the next idea was to supervise the feature extraction cision no. 79017, and by the IST Programme of the European by the genre vectors: Optimize such a feature extractor for Community, under the PASCAL Network of Excellence, IST- texts that it would give good predictions of the genres. This 2002-506778. This publication only reflects the authors' views.
will reduce the dimensionality of the term space, and a The authors would like to thank Drs Kai Puolam¨aki and Janne classifier in this reduced-dimensional space might perform Sinkkonen, and all other people in the PRIMA project for better in predicting relevance. Such a feature extractor would useful discussions, and acknowledge that access rights to the be applicable to new documents with unknown genres as well.
data sets and other materials produced in the PRIMA project Linear Discriminant Analysis was used to obtain one dis- are restricted due to other commitments.
criminative direction for each genre in the term space, and a new document was then projected onto this 10-dimensionalfeature space as described in Section IV-A.4. The results [1] J. Saloj¨arvi, I. Kojo, J. Simola, and S. Kaski, "Can relevance be inferred from eye movements in information retrieval?" in Proceedings of the (LDAproj) of predicting relevance with a log-linear model Workshop on Self-Organizing Maps (WSOM'03), Hibikino, Kitakyushu, in this space are shown in Figure 2. The genre-supervised Japan, September 2003, pp. 261-266.
[2] T. Kohonen, S. Kaski, K. Lagus, J. Saloj¨arvi, J. Honkela, V. Paatero, [11] W.Buntine and S. Perttu, "Is multinomial PCA multi-faceted clustering and A. Saarela, "Self organization of a massive document collection," or dimensionality reduction?" in Proceedings of the Ninth International IEEE Transactions on Neural Networks, vol. 11, pp. 574-585, 2000.
Workshop on Artificial Intelligence and Statistics, C. Bishop and B. Frey, [3] P. Kontkanen, P. Myllym¨aki, T. Silander, and H. Tirri, "On supervised Society for Artificial Intelligence and Statistics, 2003, pp. 300- selection of Bayesian networks," in Proceedings of the 15th International Conference on Uncertainty in Artificial Intelligence (UAI'99), K. Laskey [12] M. Tipping and C. Bishop, "Mixtures of probabilistic principal compo- and H. Prade, Eds., 1999, pp. 334-342.
nent analysers," Neural Computation, vol. 11, pp. 443-482, 1999.
[4] R. Cowell, "On searching for optimal classifiers among Bayesian [13] W. Buntine, P. Myllym¨aki, and S. Perttu, "Language models for in- networks," in Proceedings of the Eighth International Conference on telligent search using multinomial PCA," in Proceedings of the First Artificial Intelligence and Statistics, T. Jaakkola and T. Richardson, Eds., European Web Mining Forum at the 14th European Conference on Machine Learning and the 7th European Conference on Principles and [5] J. Griffith and C. O'Riordan, "Collaborative filtering," National Univer- Practice of Knowledge Discovery in Databases, B. Berendt, D. M.
sity of Ireland, Galway, Tech. Rep. NUIG-IT-160900, 2000.
A. Hotho, M. van Someren, M. Spiliopoulou, and G. Stumme, Eds.
Ruder Boskovic Institute, 2003, pp. 37-50.
http://research.compaq.com/SRC/eachmovie/ [14] N. H. Timm, Applied Multivariate Analysis. New York: Springer, 2002.
[7] Allmovie database containing movie information. [Online]. Available: [15] I. Nabney, "Efficient training of RBF networks for classication," in Proc. [8] M. Porter, "An algorithm for suffix stripping," Program, vol. 14, no. 3, [16] M. Fannes and P. Spincemaille, "The mutual affinity of random [9] F. Sebastiani, "Machine learning in automated text categorization," ACM Computing Surveys, vol. 34, no. 1, pp. 1-47, 2002.
[10] W. Buntine, "Variational extensions to EM and multinomial PCA," in Proceedings of the 13th European Conference on Machine Learning, http://crm114.sourceforge.net/CRM114 paper.html ser. Lecture Notes in Artificial Intelligence, T. Elomaa, H. Mannila, andH. Toivonen, Eds., vol. 2430.

Source: http://cosco.hiit.fi/Articles/hiit-2004-5.pdf

Ds_centrale_2009_27 avril 2012

Remarques • Les différentes parties du problème sont indépendantes ;• Lors de l’écriture de mécanismes, il n’est pas nécessaire d’écrire les molécules dans leurintégralité ; seul le fragment utile pour expliquer la réaction sera représenté. Lorsqu’on vousdemande d’identifier une structure, il faudra par contre la dessiner complètement ;• Toute réponse doit ê

Microsoft word - unim study descriptions for website _1_

A Randomized Controlled Trial of Male Circumcision to Reduce HIV Incidence in Kisumu, Robert C. Bailey,1 Stephen Moses,2 Corette B. Parker,3 Kawango Agot,4 Ian Maclean,5 John N. Krieger,6 Carolyn F. M. Williams,7 Richard T. Campbell,1Jeckoniah O. Ndinya-Achola8 PI Contact Information: rcbailey@uic.edu Summary Background Male circumcision may provide significant protection against HIV-