[推荐]第三部分课堂讲义之三——容错性检索 - 信息检索原理课件 - 课外天地李树青

2 Basic information retrieval

PPT课件下载链接：

下载信息 [文件大小：下载次数： ]
点击浏览该文件:

2.3 Tolerant retrieval
Search structures for dictionaries
Wildcard queries
Spelling correction
Phonetic correction

2.3.1 Search structures for dictionaries
The dictionary and has two broad classes of solutions:
Hashing
Search trees

The keys of dictionary
The entries in the dictionary are often referred to as keys
The pair of key and values often exists in data structure

2.3.1.1 Hashing
Often used for dictionary lookup in some search engines
Each vocabulary term ( key ) is hashed into an integer over a large enough space that hash collisions are unlikely
java.security.MessageDigest

Advantage of hashing dictionary
At query time, we hash each query term separately and following a pointer to the corresponding postings
There is no easy way to find minor variants of a query term, which is guaranteed by Hash arithmetic
The speed is so fast and easy

Disadvantage of hashing dictionary
We cannot seek all terms beginning with specified prefix
In Web setting where the size of the dictionary keeps growing, a hash function designed for current needs may not suffice in a few years’ time

2.3.1.2 Search trees
Search trees overcome many issues of hash dictionary
It should be noted that unlike hashing, search trees demand that the characters used in the document collection have a prescribed ordering

Binary tree
The best-known search tree is the binary tree, each internal node represents a binary test

Advantage of binary tree dictionary
Permit us to enumerate all vocabulary terms beginning with given prefix
Efficient search (with a number of comparisons that is O(logM)) hinges on the tree being balanced

Disadvantage of binary tree dictionary
The principal issue here is that of rebalancing
As terms are inserted into or deleted from the binary search tree, it needs to be rebalanced

B-tree
To mitigate rebalancing, one approach is to allow the number of sub-trees under an internal node to vary in a fixed interval
B-tree is a search tree in which every internal node has a number of children in the interval [a, b] (a and b are positive integers)
A B-tree may be viewed as collapsing multiple levels of the binary tree into one

Advantage of B-tree dictionary
Especially advantageous when some of the dictionary is disk-resident, in which case this collapsing serves the function of pre-fetching imminent binary tests
In such cases, the integers a and b are determined by the sizes of disk blocks

2.3.2 Wildcard queries
Trailing wildcard query
Leading wildcard query
General wildcard query

Used in following situations
The user is uncertain of the spelling of a query term
e.g., Sydney vs. Sidney, which leads to the wildcard query S*dney
The user is aware of multiple variants of spelling a term and (consciously) seeks documents containing any of the variants
e.g., color vs. colour
The user seeks documents containing variants of a term that would be caught by stemming, but is unsure whether the search engine performs stemming
e.g., judicial vs. judiciary, leading to the wildcard query judicia*
The user is uncertain of the correct rendition of a foreign word or phrase
e.g., the query Universit* Stuttgart

Trailing wildcard query
A query such as mon* is known as it, because * is at the end of the search string
A search tree on the dictionary is a convenient way of handling it
Walk down the tree following the symbols m, o and n in turn
Use |W| lookups on the standard inverted index to retrieve all documents containing any term in W
|W| means the number of terms in W

Leading wildcard query
For example
queries of the form *mon
A reverse B-tree on the dictionary can be used
For example:
The term lemon would be represented by the path root-n-o-m-e-l

General wildcard query
For example
queries of the form se*mon

How to handle it
Using regular B-tree with reverse B-tree
Permuterm index
K-gram index

Using regular B-tree with reverse B-tree
For example：query se*mon
Use the regular B-tree to enumerate the set W of dictionary terms beginning with the prefix se
Use the reverse B-tree to enumerate the set R of terms ending with the suffix mon
Take the intersection W ∩
Use the standard inverted index to retrieve all documents containing any terms in this intersection

Permuterm index
First, we introduce a special symbol $ to mark the end of a term
The term hello is shown as the augmented term hello$
Then construct a permuterm index, in which the various rotations of each term (augmented with $) all link to the original vocabulary term
The set of rotated terms in the permuterm index is called the permuterm vocabulary
Its dictionary may be quite large, almost ten-fold space increasing
How to handle general wildcard query
The key is to rotate such a wildcard query so that the * symbol appears at the end of the string
So the query m*n can be rotated as n$m*
But what about a query such as fi*mo*er
First enumerate the terms which are in the permuterm index of er$fi*
Then filter these out, checking each candidate to see if it contains mo, which is exhaustive enumeration

K-gram index
A k-gram is a sequence of k characters
If consider $, the full set of 3-grams generated for castle is: $ca, cas, ast, stl, tle, le$
How does such an index help us with wildcard queries
For example: query re*ve
Run the Boolean query $re AND ve$
Look up in the 3-gram index and yield a list of matching terms
Each of these matching terms is then looked up in the standard inverted index to yield documents matching the query
There is however a difficulty with the use of k-gram index
Consider using the 3-gram index for the query red*
First issue the Boolean query $re AND red to the 3-gram index
Lead to a match on terms such as retired, which contain the conjunction of the two 3-grams $re and red, yet do not match the original wildcard query red*
To cope with this, we need to introduce a post-filtering step, in which the terms enumerated are checked individually against the original query red*
What is the appropriate semantics for such a query: re*d AND fe*ri
All documents that contain any term matching re*d and any term matching fe*ri
So the processing of a wildcard query can be quite expensive
A search engine may support such rich functionality, but most commonly, the capabilityis hidden behind an interface (say an “Advanced Query” interface) that most users never use
Exposing such functionality in the search interface often encourages users to invoke it even when they do not require it (say, by typing a prefix of their query followed by a *), increasing the processing load on the search engine
注意：
Google中的通配符检索（*）只能匹配词语，而不能匹配词语的组成部分

2.3.3 Spelling correction
We may wish to retrieve documents containing the term carrot when the user types the query carot
misspelling reported on Google
http://www.google.com/jobs/britney.html
Based on utility of this misspelling list, we can recommend the correct spelling if users input wrong query words

Google V.S. Baidu
Google prefers retrieval more relevant Web pages
Baidu prefers retrieval precise relevant Web pages
Google can correct English misspelling better than Baidu
Baidu can correct Chinese misspelling better than Google

Basic principles of spelling correction
Choose the “nearest” one
This demands that we have a notion of nearness or proximity between a pair of queries
For grnt, we choose not great but grant
Choose the one that is more common
Especially when two correctly spelled queries are tied
For grnt, we choose not grunt but grant
In order to judge which one is more common, we can
consider the number of occurrences of the term in the collection
consider the number of occurrences of the term in the queries typed in by other users, especially for Web search engine

How to expose the correct words to users
On the query carot always retrieve documents containing carot as well as any “spell-corrected” version of carot, including carrot and tarot
As in first above, but only when the query term carot is not in the dictionary
As in first above, but only when the original query returned fewer than a threshold (say fewer than five)
When the original query returns fewer than a threshold , the search interface presents a spelling suggestion to the end user. this suggestion consists of the spell-corrected query term(s)
Thus, the search engine might respond to the user: “Did you mean carrot?”

Solution to misspelling
Isolated-term correction
Methods based on edit distance
Methods based on k-gram overlap
context-sensitive correction

Isolated-term correction
In isolated-term correction, we attempt to correct a single query term at a time
But such isolated-term correction would fail to detect, for instance, that the query flew form Heathrow contains a misspelling of the term from

Edit distance
Sometimes known as Levenshtein distance
Given two character strings s1 and s2, the edit distance between them is the minimum number of edit operations required to transform s1 into s2
Most commonly, the edit operations allowed for this purpose are
Insert a character into a string
Delete a character from a string
Replace a character of a string by another character
For example, the edit distance between cat and dog is 3
Ref:
http://www.merriampark.com/ld.htm
In fact, the notion of edit distance can be generalized to allowing different weights for different kinds of edit operations
For instance a higher weight may be placed on replacing the character s by the character p, than on replacing it by the character a (the latter being closer to s on the keyboard)
But popular edit operations have the same weight

The compute of edit distance
The typical cell [i, j] has four entries formatted as a 2 × 2 cell
The lower right entry in each cell is the min of the other three
The other three entries are the three entries m[i-1,j-1] + 0 or 1 depending on whether s1[i]=s2[j],m[i-1,j]+1 and m[i,j-1]+1
Given a set S of terms and a query string q, we seek the string having the least edit distance from q in order to get the most similar term
However, this exhaustive search is inordinately expensive
The simplest such heuristic is to restrict the search to dictionary terms beginning with the same letter as the query string
The hope would be that spelling errors do not occur in the first character of the query
A more sophisticated variant of this heuristic is to use a version of the permuterm index, in which we omit the end-of-word symbol $
For instance, if q is mase and we consider the rotation r = sema, we would retrieve terms such as semantic and semaphore
Do not have a small edit distance to q
Would miss more pertinent dictionary terms such as mare and mane
For each rotation, we omit a suffix of L characters before performing the B-tree traversal ( L can be set to 2)
This ensures that each term includes a long substring in common with q

K-gram indexes for spelling correction
Further limit the set of vocabulary terms for which we compute edit distances to the query term
In fact, we will use the k-gram index to retrieve vocabulary terms that have many k-grams in common with the query
For example:
The bigram index shows the postings for the three bigrams in the query bord
We wanted to retrieve terms that contained at least two of these three bigrams
However, terms like boardroom are not an implausible “correction” of bord
We need Jaccard coefficient for measuring the overlap between the candidate term and query term
For example:
For query q = bord reaches term t = boardroom
The Jaccard coefficient to be 2/(8+3-2)

Context-sensitive correction
Isolated-term correction would fail to correct typographical errors such as flew form Heathrow where all three query terms are correctly spelled and the all is wrong
The simplest way is to enumerate corrections of each of the query terms, then try substitutions of each correction in the phrase
fled form Heathrow
flew fore Heathrow
…
The search returns the result having the maximal number of matching results
But can be expensive if find many corrections
A heuristics method retains only the most frequent combinations in the collection or in the query logs
In this example, the biword fled fore is likely to be rare compared to the biword flew from
Only attempt to extend the list of top biwords

2.3.4 Phonetic correction
Misspellings arise because the user types a query that sounds like the target term

Phonetic hash
Similar-sounding terms hash to the same value.
Especially applicable to searches on the names of people
The idea owes its origins to work in international police departments from the early 20th century
Now is mainly used to correct phonetic misspellings in proper nouns

Soundex algorithm
Turn every term to be indexed into a 4-character reduced form. Build an inverted index from these reduced forms to the original terms; call this the soundex index
Do the same with query terms
When the query calls for a soundex match, search this soundex index

The approach of soundex algorithm
4-character code
With the first character being a letter of the alphabet
The other three being digits between 0 and 9
For example:
Hermann maps to H655

The assumption of this algorithm
Vowels are viewed as interchangeable in transcribing names
Consonants with similar sounds

Retain the first letter of the term
Change all occurrences of the following letters to ’0’:
’A’, E’, ’I’, ’O’, ’U’, ’H’, ’W’, ’Y’
Change letters to digits as follows:
B, F, P, V to 1
C, G, J, K, Q, S, X, Z to 2
D,T to 3
L to 4
M, N to 5
R to 6

Repeatedly remove one out of each pair of consecutive identical digits
Remove all zeros from the resulting string
Pad the resulting string with trailing zeros or return the first four positions

Such rules tend to be writing system dependent
Chinese names can be written in Wade-Giles or Pinyin transcription while soundex works for some of the differences in the two transcriptions
苏州：Suzhou/Soochow（Soochou）
北京：Beijing/Peking
青岛：Qingdao/Tsingtao
清华：Qinghua/Tsinghwa
中华：Zhonghua/Chonghwa
长江：Changjiang River/Yangtze River

[此贴子已经被作者于2010-12-14 08:52:55编辑过]

主题：[推荐]第三部分课堂讲义之三——容错性检索