[推荐]第五部分课堂讲义——向量空间模型 - 信息检索原理课件 - 课外天地李树青

4 Vector space model

下载信息 [文件大小：下载次数： ]
点击浏览该文件:

Contents of this Chapter
Parametric and zone indexes
Term frequency and weighting
The vector space model for scoring
Efficient scoring and ranking
Reference rank
Vector space scoring and query operator interaction

VSM
How to choose the desirable apple from fruiterer?
Facing large collections but only need a little
Rank-order apples based on a score
In fact need not to check every apple
VSM is the study of assigning a score to a query-document pair

4.1 Parametric and zone indexes
Besides viewed as a sequence of terms, most documents have additional structure —— metadata
author(s)
title
date of publication
Metadata often is composed of fields and corresponding value

Parametric indexes
It is metric index for each field (say, date of creation) and allows us to select only the documents matching a field specified in the query
Example of retrieval based on fields
Find documents authored by William Shakespeare in 1601
Merge postings from parametric indexes which are the indexes for each field
Some of the fields may assume ordered values, such as dates
The search engine may support querying ranges on such ordered values

Zones and fields
Zone are similar to fields, except the contents of a zone can be arbitrary free text
titles and abstracts
Whereas a field may take on a relatively small set of values, a zone can be thought of as an arbitrary, unbounded amount of text
The possible values of a field should be thought of as finite, for instance, the set of all dates of authorship

The difference of field index and zone index
The dictionary for a parametric index comes from a fixed vocabulary
The dictionary for a zone index must structure whatever vocabulary stems from the text of that zone
Two methods to implement zone index
The dictionary of the latter is lesser

Weighted zone scoring
Given a Boolean query q and a document d, weighted zone scoring assigns to the pair (q, d) a score in the interval [0, 1], by computing a linear combination of zone scores, where each zone of the document contributes a Boolean value
Weighted zone scoring is sometimes referred to also as ranked Boolean retrieval

How to compute weighted zone scoring
gi∈[0,1], ∑gisi=1
si be the Boolean score denoting a match between q and the ith zone
Consider the query shakespeare in a collection in which each document has three zones: author, title and body
Weighted zone scoring in such a collection would require three weights g1, g2 and g3, respectively corresponding to the author, title and body zones. Suppose we set g1 = 0.2, g2 = 0.3 and g3 = 0.5 (so that the three weights add up to 1); this corresponds to an application in which a match in the author zone is least important to the overall score, the title zone somewhat more, and the body contributes even more
Thus if the term shakespeare were to appear in the title and body zones but not the author zone of a document, the score of this document would be 0.8

How to implement weighted zone scoring
A simple approach would be to compute the score for each document in turn, adding in all the contributions from the various zones
We may compute weighted zone scores directly from inverted indexes
We can consider more complex Boolean functions than the binary function, such as term frequency
We may assign a non-zero score to a document even if it does not contain all query terms

How to weighting
Could be specified by an expert
Could be learned using training examples based on machine-learned relevance algorithm

4.2 Term frequency and weighting
A document or zone that mentions a query term more often has more to do with that query and therefore should receive a higher score
Especially for the free text query, a plausible scoring mechanism then is to compute a score that is the sum, over the query terms

How to score the documents
The simplest approach is to assign the weight to be equal to the number of occurrences of term t in document d
This weighting scheme is referred to as term frequency and is denoted tft,d
Map the number of occurrences of t in d to a positive real value
The document is viewed as bag of words model, in which the exact ordering of the terms is ignored
Decide not to index all word, such as stop words

The disadvantage of TF
Raw term frequency as above suffers from a critical problem——All terms are considered equally important
Stop words
Other special terms which have little or no discriminating power in determining relevance
A collection of documents on the auto industry is likely to have the term auto in almost every document which has less discriminating power

Collection frequency & document frequency
Collection frequency
Defined to be the total number of occurrences of a term in the collection
Document frequency
Defined to be the number of documents in the collection that contain a term t
DF is more commonplace to use

The comparison of CF and DF
Complexity of computing
DF is document-level statistic
CF is collection-wide statistic
Other reason
The example from the Reuters collection
we want the few documents that contain insurance to get a higher boost for a query on insurance than the many documents containing try get from a query on try

Inverse document frequency
The idf of a rare term is high, whereas the idf of a frequent term is likely to be low
The example of the Reuters collection of 806791 documents
Tf-idf weighting
tf-idft,d = tft,d × idft

Document vector
Each document is viewed as a vector with one component corresponding to each term in the dictionary, together with a weight for each component
Overlap score measure
The score of a document d is the sum, over all query terms, of the number of times each of the query terms occurs in d

4.3 The vector space model for scoring
How do we quantify the similarity between two documents in this vector space or between query and documents?
Overlap score measure
Dot products

Cosine similarity
The numerator represents the dot product (also known as the inner product) of the vectors V(d1) and V(d2), while the denominator is the product of their Euclidean lengths
The effect of the denominator is thus to length-normalize the vectors V(d1) and V(d2) to unit vectors

Utility of this scoring
Identify a document and seek others like it
“more like this” feature in search engines
Compute the similarity of a query and a document
Note that a document may have a high cosine score for a query even if it does not contain all query terms

4.4 Variant tf-idf functions
Sublinear tf scaling
Maximum tf normalization
Document and query weighting schemes
Pivoted normalized document length

4.4.1 Sublinear tf scaling
It seems unlikely that twenty occurrences of a term in a document truly carry twenty times the significance of a single occurrence

4.4.2 Maximum tf normalization
This methods normalizes the tf weights of all terms occurring in a document by the maximum tf in that document
The term a is a smoothing term whose role is to damp the contribution of the tf term by the largest tf value in d

The advantage of this method
Longer documents often have higher term frequencies merely because they tend to repeat the same words over and over again
Extreme example: take a document d and create a new document d’ by simply appending a copy of d to itself, d’ would be assigned twice as high a score as d

The disadvantage of this method
The method is unstable when a change in the stop word list can dramatically alter term weightings
A document may contain an outlier term with an unusually large number of occurrences of that term, not representative of the content of that document

4.4.3 Document and query weighting schemes
Vector space scoring method hinges on the specific choices of weights
Some principal weighting schemes

SMART notation
The mnemonic for representing a combination of weights takes the form ddd.qqq
The first triplet gives the term weighting of the document vector
The second triplet gives the weighting in the query vector
The first letter in each triplet specifies the term frequency component of the weighting
The second letter specified the document frequency component
The third letter specified the form of normalization used
For example:
lnc.ltc means the document vector has log-weighted term frequency, no idf (for both effectiveness and efficiency reasons), and cosine normalization, while the query vector uses log-weighted term frequency, idf weighting, and cosine normalization

4.4.4 Pivoted normalized document length
Euclidean length of the vector eliminates all information on the length of the original document so that all document vectors turned into unit vectors
But longer documents still are important
Have higher tf values
Contain more distinct terms

Pivoted document length normalization
Suppose that we were given a Boolean judgment of whether or not d is relevant to the query q for each query q and for each document d
Compute a probability of relevance as a function of document length averaged over all queries
Bucket documents by length and compute the fraction of relevant documents in each bucket and the resulting plot may look like the curve drawn in thick lines
The curve in thin lines shows what might happen with the same documents and query ensemble if we were to use relevance as prescribed by cosine normalization Equation
The thin and thick curves crossover at a point p corresponding to document length Lp, which we refer to as the pivot length
The idea of pivoted document length normalization would then be to “rotate” the cosine normalization curve counter-clockwise about p so that it more closely matches thick line representing the relevance vs. document length curve
A normalization factor for each document vector is not the Euclidean length of that vector, but instead one that is larger than the Euclidean length for documents of length less than Lp, and smaller for longer documents

4.5 Heuristics for speeding up VSM computation
The principal cost in computing the output stems from computing cosine similarities between the query and a large number of documents
Achieve speed at the risk of not finding quite the top K documents matching the query

4.5.1 Speedup of standard cosine similarity
Really interested in the relative (rather than absolute) scores
The all non-zero components of the query vector are set to 1
For any document d, the cosine similarity is the weighted sum, over all terms in the query q, of the weights of those terms in d
can be computed by a postings intersection
The final presenting step could not sort the complete set of scores
A better approach is to retrieve only the top K documents in order

4.5.2 Inexact top K document retrieval
Can dramatically lower the cost of computing
In fact, cosine similarity is also a proxy for the user’s perceived relevance
Good inexact retrieval methods do not materially alter the user’s perceived relevance of the top K results

The main idea of this methods
Eliminate a large number of documents without computing their cosine scores
Having a large number of documents in contention increases the cost of computing cosine similarities and selecting in the final stage
Well-suited to free text queries, but not for Boolean or phrase queries

Two-step scheme
Find a set A of documents that are contenders, where K < |A| << N. A does not necessarily contain the K top-scoring documents for the query, but is likely to have many documents with scores near those of the top K

Return the K top-scoring documents in A
Methods
Index elimination
Champion lists
Static quality scores and ordering
…

4.5.2.1 Index elimination
For a multi-term query q, it is clear we only consider
documents that contain many (and as a special case, all) of the query terms
documents containing at least one of the query terms
Only consider documents containing terms whose idf exceeds a preset threshold
Because the postings lists of low-idf terms are generally long
Because low-idf terms are often treated as stop words and do not contribute to scoring

The disadvantage
May end up with fewer than K candidate documents in the output

4.5.2.2 Champion lists
For each term t in the dictionary, we can precompute the set of the r documents with the highest weights for t( the value of r is chosen in advance )
For tf-idf weighting, these would be the r documents with the highest tf values for term t
How to do it
Given a query q, we take the union of the champion lists for each of the terms comprising q and restrict cosine computation to only the documents in A

The disadvantage
A critical parameter in this scheme is the value r, which is highly application dependent
There is no reason to have the same value of r for all terms in the dictionary
Rarer terms could be set to be higher than common terms

4.5.2.3 Static quality scores and ordering
In many search engines, we have an available measure of quality g(d) for each document d that is query-independent and thus static
For instance, in the context of news stories on the web, g(d) may be derived from the number of favorable reviews of the story by web surfers
A direct extension of champion lists
For each term t maintain a global champion list of the r documents with the highest values for g(d) + tf-idf

几种常见方法
基于网页链接关系的测度指标
基于Web用户访问模型的测度指标
基于网站流量的测度指标
基于网站权威性的测度指标
基于用户相关度反馈信息的测度指标

基于网页链接关系的测度指标
基于网页链接关系的分析方法认为能够被更多网页链入的流行网页是更为重要的网页，也是质量较高的网页
事实证明这个方法比较成功，如Google的PageRank方法就采用了这样的方式来对网页进行加权
缺点
Web网页不具有类似于出版环境下的权威性评价特征（文献可以通过同行评审等方法来获得别人的认可）
易于产生所谓的“富愈富（rich-get-richer）”现象，特别是对于一些高质量的现有网页和一些不可能获得太多链入数量的、新出现的高质量网页而言，更为不公平
基于Web用户访问模型的测度指标
这个指标建立在一个假设基础上，即如果用户在浏览一个网页后，在较短的时间内对其建立了超链，则可以认为这种网页具有较高的质量
显然，虽然一位用户对网页建立超链的行为并不一定反映出该网页的质量，但是如果面向大多数用户，这种统计意义上的汇总信息将能在客观上表明网页的质量
使用单位时间内的PageRank的变化情况，不过这种方法具有主题偏向性（topic bias），也就是偏向谈论流行性话题的网页

基于网站流量的测度指标
它通过站点访问流量之间的对比关系来对网站网页进行排名
但是具有流行话题的网站通常会具有更大的访问流量
也有学者提出基于不同主题的网站流量排名方法

基于网站权威性的测度指标
有的学者声称网站的权威性（Authority）在一定程度上直接影响着网站信息内容的质量
这种权威性来自于两个方面，一个是专业能力（Competence），另一个是可信度（Trustworthiness）
除此以外，有很多评价网络信息的服务站点通常会强调网站内容的名誉度，具体指标包括相关度（Relevance）、信息可*性（Reliability）、权威性（Authority）、内容质量（Quality of Content）、可用性（Usability）和客观性（Objectivity）等
近些年来，诸如全球信息基础设施裁定组织（Global Information Infrastructure Award）等一些机构的排名服务也开始涉足网站质量的评价，包括对作者资质等情况的评价

基于用户相关度反馈信息的测度指标
信息检索系统收集用户相关度反馈信息的方式主要有两种
一种是显式的方法，它要求用户在检索时主动的对和查询相关的文档做标记，这种方式虽然效果明显，但是会增加用户使用负担，一般的用户很难愿意配合这种信息收集行为
一种是隐式的方法，它一般无需用户主动提交，通过探测用户行为，并以此来间接评价结果文档的相关度。它建立在一个假设基础之上，那就是用户在检索时会持续的进行隐式的结果相关性判断
从理论上看，利用隐式方法得到的信息并不十分准确，但是，隐式方式也具有显式方法不可比拟的优点
不增大用户使用负担
由于需要用户主动提交，显式方法所收集的信息相当有限，相比之下，利用隐式方法收集而来的信息更多和更为详细
即便是存在误差，只要收集到足够多的数据样本，通过一些数据分析方法就可以很好的去除那些噪音数据
研究者已经提出了很多可以用于隐式收集相关度反馈信息的途径
在搜索结果文档列表中点击选择某些文档的行为
在网页文本中的翻滚行为
对网页做书签的行为
打印网页的行为
还有浏览网页所花费的时间
其中，有些指标也存在一定的争议，如有的学者就认为浏览每个网页所花费的时间并不能有效代表用户对这个网页相关度的认可程度，主要原因在于存在一些和相关度没有关系的因素干扰，如任务本身、文档集合特点和检索环境等都会影响浏览时间
其他学者也提出了综合的方法来改善隐式方法的分析效果
如同时考虑浏览时间、是否打印网页和保存网页、翻滚网页和保存书签等用户行为将能取得更好的效果
还有学者认为在非试验环境下，将用户在查询时发出的点击数与用户和检索系统交互的全部时间结合起来，可以有效的测度用户对网页文档的满意度
但是，从总体来看，相关试验的效果并不是十分理想，即便是可行，但是相关数据的收集工作却较难展开，甚至无法得到较为丰富的数据
点击流数据就称为一种较好的隐式分析数据源，它在非试验环境下易于收集，而且比其他几种用于隐式收集相关度反馈信息的数据更为准确
它建立在一个假设之上，那就是被点击的文档应该比没有被点击的文档更为相关
从理论上看，利用点击流进行分析是一种协同过滤（Collaborative Filtering）技术

4.5.3 How to handle less results
Multi-tiered posting lists
Maintain for each term t two postings lists consisting of disjoint sets of documents
The first list, which we call high, contains the m documents with the highest tf values for t
The second list, which we call low, contains all other documents containing t
When processing a query, we first scan only the high lists of the query terms, computing net scores for any document on the high lists of all (or more than a certain number of) query terms
If we obtain scores for K documents in the process, we terminate
If not, we continue the scanning into the low lists, scoring documents in these postings lists
In this example we set a document threshold of 20 for tier 1 and 10 for tier 2
The tier 1 index only has postings entries with values exceeding 20
The tier 2 index only has postings entries with tf values exceeding 10

4.6 Components of an information retrieval system
How to treat free text queries
The answer of course depends on the user population, the query distribution and the collection of documents
Common process (if query includes 3 terms, such as “rising interest rates”)
Run the user-generated query string as a phrase query
If fewer than K documents contain the phrase, run the two 2-term phrase queries rising interest and interest rates; rank these using vector space scoring, as well
If we still have fewer than K results, run the vector space query consisting of the three individual query terms

[此贴子已经被作者于2010-12-14 09:19:38编辑过]

主题：[推荐]第五部分课堂讲义——向量空间模型