[推荐]第六部分课堂讲义——信息检索评价 - 信息检索原理课件 - 课外天地李树青

5 Evaluation in information retrieval

下载信息 [文件大小：下载次数： ]
点击浏览该文件:

How to measure user happiness
The key utility measure is user happiness
Speed of response
Size of the index
Relevance of results
User interface design
Independent of the quality of the results
…

5.1 Introduction
How to measure information retrieval effectiveness
We need a test collection consisting of three things
A document collection
A test suite of information needs, expressible as queries
A set of relevance judgments, standardly a binary assessment of either relevant or not relevant for each query-document pair

And in this test collection
A document in the test collection is given a binary classification as either relevant or not relevant
Collection and suite of information needs have to be of a reasonable size
Results are highly variable over different documents and information needs
50 information needs at least

The difficulty
The difference of stated information need and query
Relevance is assessed relative to an information need, not a query
System issue
User issue
Python and python
One word query
The subjectivity of relevance decision
Many systems contain various parameters that can be adjusted to tune system performance
The correct procedure is to have one or more development test collections

5.2 Standard test collections
The Cranfield collection
Text Retrieval Conference (TREC)
GOV2
NII Test Collections for IR Systems (NTCIR)
Reuters-21578 and Reuters-RCV1

The Cranfield collection
Collected in the United Kingdom starting in the late 1950s, it contains 1398 abstracts of aerodynamics journal articles, a set of 225 queries
Allowing precise quantative measures
But too small

Text Retrieval Conference (TREC)
The U.S. National Institute of Standards and Technology (NIST) has run a large IR test bed evaluation series since 1992
Over a range of different test collections
But the best known test collections are the ones used for the TREC Ad Hoc track during the first 8 TREC evaluations from 1992 to 1999
Comprise 6 CDs containing 1.89 million documents (mainly, but not exclusively, newswire articles)

Relevance judgments for 450 information needs, which are called topics and specified in detailed text passages
TRECs 6–8 provide 150 information needs over about 528,000 newswire and Foreign Broadcast Information Service articles
Relevance judgments are available only for the documents that were among the top k returned for some system which was entered in the TREC evaluation

GOV2
Contains 25 million GOV2 web pages
GOV2 is one of the largest Web test collection but still more than 3 orders of magnitude smaller than the current size of the document collections indexed by the large web search companies

NII Test Collections for IR Systems (NTCIR)
Similar sizes to the TREC collections
Focusing on East Asian language and cross-language information retrieval

Reuters-21578 and Reuters-RCV1
Most used for text classification
Reuters-21578 collection of 21578 newswire articles
Reuters Corpus Volume 1 (RCV1) is much larger, consisting of 806,791 documents

8.3 Evaluation of unranked retrieval sets
Precision
Recall
Accuracy
F measure

Precision
Precision (P) is the fraction of retrieved documents that are relevant

Recall
Recall (R) is the fraction of relevant documents that are retrieved

Contingency table

	Relevant	Not relevant
Retrieved	true positives (tp)	false positives (fp)
Not retrieved	false negatives (fn)	true negatives (tn)

The other way to define P and R
P = tp/(tp + fp)
R = tp/(tp + fn)

Accuracy
Accuracy is the fraction of its classifications that are correct
An information retrieval system can be thought of as a two-class classifier
accuracy=(tp+tn)/(tp+fp+fn+tn)

Often used for evaluating machine learning classification problems
Not an appropriate measure for information retrieval problems
Normally over 99.9% of the documents are in the not relevant category
Can maximize accuracy by simply deeming all documents nonrelevant to all queries, that is to say tn maybe too large

The comparison of P and R
Typical web surfers prefer P to R
Various professional searchers such as paralegals and intelligence analysts prefer R to P
Individuals searching their hard disks prefer R to P

The two quantities clearly trade off against one another
Recall is a non-decreasing function of the number of documents retrieved
Can always get a recall of 1 (but very low precision) by retrieving all documents for all queries
Precision usually decreases as the number of documents retrieved is increased

F measure
A single measure that trades off precision versus recall

图片点击可在新窗口打开查看此主题相关图片如下：

α∈[0, 1] and thus β2∈[0,∞]
The default balanced F measure use β=1

Values of β < 1 emphasize precision, while values of β > 1 emphasize recall
Harmonic mean rather than the simpler average
Recall that we can always get 100% recall by just returning all documents, and therefore we can always get a 50% arithmetic mean by the same process

The harmonic mean is always less than either the arithmetic or geometric mean, and often quite close to the minimum of the two numbers
This strongly suggests that the arithmetic mean is an unsuitable measure to use because it closer to their maximum than harmonic mean

8.4 Evaluation of ranked retrieval results
Precision, recall, and the F measure are set-based measures and are computed using unordered sets of documents
In a ranked retrieval context, appropriate sets of retrieved documents are naturally given by the top k retrieved documents

The types
Precision-recall curve
Mean Average Precision (MAP)
Precision at k
R-precision
ROC curve

Precision-recall curve
If the (k + 1)th document retrieved is nonrelevant then recall is the same as for the top k documents, but precision has dropped
If it is relevant, then both precision and recall increase, and the curve jags up and to the right

Precision-recall curve with interpolated precision
Can remove the jiggles of curve
The interpolated precision at a certain recall level r is defined as the highest precision found for any recall level q ≥ r
The justification is that almost anyone would be prepared to look at more documents if it would increase the percentage of the viewed set that were relevant
Precision-recall curve with 11-point interpolated average precision1-2
Can boil curve’s information down to a few numbers

11 points means 0.0, 0.1, 0.2, … , 1.0
For each recall level, we then calculate the arithmetic mean of the interpolated precision at that recall level for each information need in the test collection

Precision-recall curve with 11-point interpolated average precision2-2

Mean Average Precision (MAP)
Most standard among the TREC community
A single-figure measure of quality across recall levels
Have especially good discrimination and stability

For a single information need, Average Precision is the average of the precision value obtained for the set of top k documents existing after each relevant document is retrieved

The set of relevant documents for an information need qj ∈ Q is {d1, . . . dmj}
Rjk is the set of ranked retrieval results from the top result until you get to document dk

The disadvantage of MAP
Fixed recall levels are not chosen, corresponding precision is at all recall levels
Calculated scores normally vary widely across information needs( 0.1 to 0.7 )
So a set of test information needs must be large and diverse enough to be representative of system effectiveness across different queries

Precision at k
What matters is rather how many good results there are on the first page or the first k pages, especially for search engine
This method measures precision at fixed low levels of retrieved results, such as 10 or 30 document
It has the advantage of not requiring any estimate of the size of the set of relevant documents but the disadvantages that it is the least stable

R-precision
It requires having a set of known relevant documents of size Rel, from which we calculate the precision of the top Rel documents returned
Like Precision at k, R-precision describes only one point on the precision-recall curve
And the R-precision and R-recall is equal ( r/Rel ) and is identical to the break-even point of the PR curve

But even a perfect system could only achieve a precision at 20 of 0.4 if there were only 8 relevant documents
Averaging this measure across queries thus makes more sense

ROC curve
Receiver Operating Characteristics, ROC
ROC curve plots the true positive rate ( sensitivity or recall ) against the false positive rate ( 1-specificity
The false positive rate is given by fp/( fp+tn )
Specificity is given by tn/( fp + tn)

ROC curve always goes from the bottom left to the top right of the graph
For a good system, the graph climbs steeply on the left side

8.5 Assessing relevance
About information needs
Information needs must be germane to the documents
Information needs are best designed by domain experts
Using random combinations of query terms as an information need is generally not a good idea because typically they will not resemble the actual distribution of information needs

About assessment
Time-consuming and expensive process involving human beings
For tiny collections like Cranfield, exhaustive judgments of relevance for each query and document pair were obtained
For large modern collections, it is usual for relevance to be assessed only for a subset of the documents for each query
The most standard approach is pooling, where relevance is assessed over a subset of the collection that is formed from the top k documents returned

Humans and their relevance judgments are quite variable
But this is not a problem to be solved
In the final analysis, the success of an IR system depends on how good it is at satisfying the needs of these variable humans
But needs to measure the agreement of these assessments

Kappa statistic
It can measure how much agreement between relevance judgments

图片点击可在新窗口打开查看此主题相关图片如下：

P(A) is the proportion of the times the judges agreed
P(E) is the proportion of the times they would be expected to agree by chance, which needs to be estimated

	Yes	No	Total
Yes	300	20	320
No	10	70	80
Total	310	90	400

P(A) = (300+ 70)/400 = 370/400 = 0.925
P(nonrelevant) = (80+90)/(400+400) = 170/800 = 0.2125
P(relevant) = (320+ 310)/(400+ 400) = 630/800 = 0.7878
P(E) = P(nonrelevant)2 + P(relevant)2 = 0.21252 + 0.78782 = 0.665
k= (P(A)- P(E))/(1- P(E)) = (0.925-0.665)/(1- 0.665) = 0.776

Result
Kappa value above 0.8 is taken as good agreement
Kappa value between 0.67 and 0.8 is taken as fair agreement
Kappa value below 0.67 is seen as dubious basis for evaluation
If there are more than two judges, it is normal to calculate an average pairwise kappa value

The level of agreement in TREC normally falls in the range of “fair” (0.67–0.8)
This means not requiring more fine-grained relevance labeling
But these deviation has in general been found to have little impact on the relative effectiveness ranking despite the variation of individual assessors’ judgments

About relevance
Some dubious hypothesis
The relevance of one document is treated as independent of the relevance of other documents in the collection
Assessments are binary and not nuanced
Information need is treated as an absolute, objective decision
So any results are heavily skewed by the choice of collection, queries, and relevance judgment set

Marginal relevance
It means whether a document still has distinctive usefulness after the user has looked at certain other documents
The most extreme case of this is documents that are duplicates – a phenomenon that is actually very common on the Web
Maximizing marginal relevance requires returning documents that exhibit diversity and novelty

[此贴子已经被作者于2010-12-14 09:20:49编辑过]

主题：[推荐]第六部分课堂讲义——信息检索评价