[推荐]第三部分课堂讲义之二——字典和标记列表的设计 - 信息检索原理课件 - 课外天地李树青

2 Basic information retrieval

PPT课件下载链接：

下载信息 [文件大小：下载次数： ]
点击浏览该文件:

2.2 The term vocabulary and postings lists
Document dealineation and character sequence decoding
Determining the vocabulary of terms
Skip pointers
Positional postings and phrase queries

Two method of getting documents
Feeder——Information Repository
Crawler, spider, robot——Search engine

2.2.1 Document delineation and character sequence decoding
Obtaining the character sequence in a document
Choosing a document unit
Processing digital documents
For IR system, the first step of processing digital document is to convert this byte sequence into a linear sequence of characters
For the case of plain English text in ASCII encoding, this is trivial.
But often things get much more complex

Difficulties
Need to determine the correct encoding
The sequence of characters may be encoded by one of various single byte or multi-byte encoding schemes, such as Unicode, UTF-8, or various national or vendor-specific standards
Even for plain text documents, additional decoding may need to be done. In XML documents, character entities, such as “&”, need to be decoded to give the correct character, namely & for “&”

Support a broad range of document types
The characters may have to be decoded out of some binary representation like Microsoft Word DOC files and/or a compressed format such as zip files

The idea that text is a linear sequence of characters is also called into question by some writing systems
Arabic text takes on some two dimensional and mixed order characteristics

But, despite some complicated writing system conventions, there is an underlying sequence of sounds being represented
Hence an essentially linear structure remains, and this is what is represented in the digital representation of Arabic

Choosing a document unit
We have assumed that documents are fixed units for the purposes of indexing

Different document levels
Email file folder, email file, attached file
Going in the opposite direction, PowerPoint take things that you might regard as a single document and split them into separate HTML pages for each slide or subsection, stored as separate files in the Web. In these cases, you might want to combine multiple files into a single document

For very long documents, the issue of indexing granularity arises
A precision and recall tradeoff for long document(can be alleviated by explicit or implicit proximity search)
A need to simultaneously index documents at multiple levels of granularity, appears prominently in XML retrieval

2.2.2 Determining the vocabulary of terms
Tokenization
Dropping common terms: stop words
Normalization (equivalence classing of terms)
Stemming and lemmatization

2.2.2.1 Tokenization
Definition of Tokenization
Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens, perhaps at the same time throwing away certain characters, such as punctuation

Input: Friends, Romans, Countrymen, lend me your ears;
Output:
Friends
Romans
Countrymen
Lend
Me
Your
Ears

Definition of Token
These tokens are often loosely referred to as terms or words, but it is sometimes important to make a type/token distinction
A token is an instance of a sequence of characters in some particular document, which are useful semantic unit for processing

Definition of Type
A type is the class of all tokens containing the same character sequence

Definition of Term
A term is a (perhaps normalized) type that is included in the IR system’s dictionary
The set of index terms could be entirely distinct from the tokens, for instance, they could be semantic identifiers in a taxonomy
But in practice in modern IR systems term are strongly related to the tokens in the document ( that is full-text retrieval)

For example, if the document to be indexed is “ to sleep perchance to dream”, then there are 5 tokens, but only 4 types (since there are 2 instances of to)
However, if to is omitted from the index (as a stop word), then there will be only 3 terms: sleep, perchance, and dream

The major question of the tokenization
what are the correct tokens to emit?
For the example above, this looks fairly trivial: you chop on white space and throw away punctuation characters
This is a starting point, but even for English there are a number of tricky cases

Apostrophe for possession and contractions
Hyphenation
White space
Compositive question
Language-specific question

Apostrophe
For example:
Mr. O’Neill thinks that the boys’ stories about Chile’s capital aren’t amusing
For O’Neill, which of the following is the desired tokenization?
neill
oneill
o’neill
o’ neill
o neill
And for aren’t:
aren’t
arent
are n’t
aren t
A simple strategy is to just split on all non-alphanumeric characters, but while “o neill” looks okay, “aren t” looks intuitively bad
For all of them, the choices determine which Boolean queries will match
A query of “neill AND capital” will match in three cases but not the other two.
If no preprocessing of a query is done, then it would match in only one of the five cases

Hyphenation
In English, hyphenation is used for various purposes
Splitting up vowels in words (co-education)
Joining nouns as names (Hewlett-Packard)
Show word grouping(the hold-him-backand-drag-him-awaymaneuver)
It is easy to feel that the first example should be regarded as one token (and is indeed more commonly written as just coeducation)
The middle case is unclear
The last should be separated into words
Handling hyphens automatically can thus be complex
It can either be done as a classification problem
Or more commonly by some heuristic rules, such as allowing short hyphenated prefixes on words, but not longer hyphenated forms

White space
Conceptually, splitting on white space can also split what should be regarded as a single token
This occurs most commonly with names (San Fran-cisco, Los Angeles), borrowed foreign phrases (au fait) and compounds that are sometimes written as a single word and sometimes space separated (such as white space vs. whitespace)
Splitting tokens on spaces can cause bad retrieval results, for example, if a search for York University mainly returns documents containing New York University

Compositive question
The problems of hyphens and non-separating white space can even interact
Advertisements for air fares frequently contain items like San Francisco-Los Angeles, where simply doing white space splitting would give unfortunate results
Like queries for all of lowercase, lower-case and lower case to return the same results
The last two can be handled by splitting on hyphens and using a phrase index
Getting the first case right would depend on knowing that it is sometimes written as two words and also indexing it in this way
However, many search engine do not agree with this idea
But many professional retrieval system can provide an effective method to handle it
One effective strategy in practice, which is used by some Boolean retrieval systems such as Westlaw and Lexis-Nexis(律商联讯), is to encourage users to enter hyphens wherever they may be possible, and whenever there is a hyphenated form, the system will generalize the query to cover all three of the one word, hyphenated, and two word forms
So that a query for over-eager will search for over-eager OR “over eager” OR overeager
However, this strategy depends on user training, since if you query using either of the other two forms, you get no generalization

anguage-specific question
These issues of tokenization are language-specific. It thus requires the language of the document to be known
Most languages have distinctive signature patterns
For most languages and particular domains within them there are unusual specific tokens that we wish to recognize as terms, such as
The programming languages C++ and C#
Aircraft names like B-52
A T.V. show name such as M*A*S*H
Computer technology has introduced new types of character sequences that a tokenizer should probably tokenize as a single token, including
Email addresses
Web URLs
Numeric IP addresses
Package tracking numbers
And more
One possible solution is to omit from indexing tokens
However, this comes at a large cost in restricting what people can search for
Each new language presents some new issues
French has a variant use of the apostrophe for a reduced definite article ‘the’ before a word beginning with a vowel (e.g., l’ensemble) and has some uses of the hyphen with postposed clitic pronouns in imperatives and questions(e.g., donne-moi(‘give me’))
You would want documents mentioning both l’ ensemble and un ensemble to be indexed under ensemble
German writes compound nouns without spaces
Computerlinguistik ‘computational lin-guistics’
Lebensversicherungsgesellschaftsangestellter ‘life insurance company employee’
But retrieval systems for German greatly benefit from the use of a compound-splitter module, which is usually implemented by seeing if a word can be subdivided into multiple words that appear in a vocabulary
This phenomenon reaches its limit case with major East Asian Languages (e.g., Chinese, Taiwanese, Japanese, Korean, and Thai), where text is written without any spaces between words
Methods of word segmentation vary from having a large vocabulary and taking the longest vocabulary match with some heuristics for unknown words to the use of machine learning sequence models, such as hidden Markov models or conditional random fields, trained over hand-segmented words
The other approach is to abandon word-based indexing and to do all indexing via just short subsequences of characters (character k-grams) or single Chinese character index, regardless of whether particular sequences cross word boundaries or not
Three reasons why the approach abandoning word-based indexing is appealing：
An individual Chinese character is more like a syllable than a letter and usually has some semantic content
Most words are short (the commonest length is 2 characters)
Given the lack of standardization of word breaking in the writing system, it is not always clear where word boundaries should be placed anyway
Even in English, some cases of where to put word boundaries are just orthographic conventions – think of notwithstanding vs. not to mention or into vs. on to – but people are educated to write the words with consistent use of spaces

Suggestion of the tokenization
1)For either Boolean or free text queries, you always want to do the exact same tokenization of document and query words, generally by processing queries with the same tokenizer
This guarantees that a sequence of characters in a text will always match the same sequence typed in a query
2)The Boolean case is more complex: this tokenization may produce multiple terms from one query word. This can be handled by combining the terms with an AND or as a phrase query

2.2.2.2 Dropping common terms: stop words
The definition of stop words
Sometimes, some extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary entirely
These words are called stop words

The definition of stop list
The general strategy for determining a stop list:
To sort the terms by collection frequency (the total number of times each term appears in the document collection)
To take the most frequent terms, often hand-filtered for their semantic content relative to the domain of the documents being indexed, as a stop list, the members of which are then discarded during indexing
Using a stop list significantly reduces the number of postings that a system has to store
Not indexing stop words does little harm
However, this is not true for phrase searches
The phrase query “President of the United States”, which contains two stop words, is more precise than President AND “United States”
The meaning of flights to London is likely to be lost if the word to is stopped out
A search for Vannevar Bush’s article As we may think will be difficult if the first three words are stopped out, and the system searches simply for documents containing the word think
Some song titles and well known pieces of verse consist entirely of words that are commonly on stop lists (To be or not to be, Let It Be, I don’t want to be, . . . )

The stop list in Web search engines
Web search engines generally do not use stop lists
Can exploit the statistics of language so as to be able to cope with common words in better ways
Good compression techniques greatly reduce the cost of storing the postings for common words
Modern IR system with impact-sorted indexes can terminate scanning a postings list early when weights get small, and hence common words do not cause a large additional processing cost for the average query, even though postings lists for stop words are very long
So for most modern IR systems such as search engine, the additional cost of including stop words is not that big
Neither in terms of index size
Nor in terms of query processing time

2.2.2.3 Normalization
Be equivalent to classing of terms
Often implement by two ways:
Create equivalence classes
Maintain relations between unnormalized tokens

The definition of normalization
Token normalization is the process of canonicalizing tokens so that matches occur despite superficial differences in the character sequences of the tokens
Question
There are many cases when two character sequences are not quite the same but you would like a match to occur
For instance, if you search for USA, you might hope to also match documents containing U.S.A.

Create equivalence classes
The most standard way to normalize is to implicitly create equivalence classes, which are normally named after one member of the set
For instance, if the tokens anti-discriminatory and antidiscriminatory are both mapped onto the latter, then searches for one term will retrieve documents that contain either
The advantage of just using mapping rules that remove characters like hyphens is that the equivalence classing to be done is implicit, rather than being fully calculated in advance
If two word differ obviously in surface, this method would not work well
Since the equivalence classes are implicit, it is not obvious when you might want to add characters. For instance, it would be hard to know to turn antidiscriminatory into anti-discriminatory
So its function is not powerful

Maintain relations between unnormalized tokens
This method can be extended to hand-constructed lists of synonyms such as car and automobile
These term relationships can be achieved in two ways
The usual way is to index unnormalized tokens and to maintain a query expansion list of multiple vocabulary entries to consider for a certain query term. A query term is then effectively a disjunction of several postings lists
The alternative is to perform the expansion during index construction. When the document contains automobile, we index it under car as well (and, usually, also vice-versa)
Use of either of these methods is considerably less efficient than equivalence classing, as there are more postings to store and merge
The first method adds a query expansion dictionary and requires more processing at query time
The second method requires more space for storing postings.
Traditionally, expanding the space required for the postings lists was seen as more disadvantageous, but with modern storage costs, the increased flexibility that comes from distinct postings lists is appealing
These approaches are more flexible than equivalence classes because the expansion lists can overlap while not being identical
This means there can be an asymmetry in expansion
An example:
Query  Terms in documents that should be matched
Windows  Windows
windows   Windows, windows, window
window   window, windows
If the user enters windows, we wish to allow matches with the capitalized Windows operating system, but this is not plausible if the user enters window, even though it is plausible for this query to also match lowercase windows

The comparison of two methods
The best amount of equivalence classing or query expansion to do is a fairly open question
For instance, equivalence-classing U.S.A. and USA to the latter by deleting periods from tokens might at first seem very reasonable
However, if user put in as my query term C.A.T., he might be rather upset if it matches every appearance of the word cat in documents
In many cases they seem helpful, but they can also do harm

Common methods to normalization
Accents and diacritics
Capitalization/case-folding
Other issues in English
Issues in other languages

Accents and diacritics
This can be done by normalizing tokens to remove diacritics
cliché and cliche
Naive and na?ve
Occasionally words are distinguished only by their accents
In Spanish, pe?a is ‘a cliff’, while pena is ‘sorrow’
Nevertheless, the important question is usually not prescriptive or linguistic but is a question of how users are likely to write queries for these words
In many cases, users will enter queries for words without diacritics, whether for reasons of speed, laziness, limited software, or habits born of the days when it was hard to use non-ASCII text on many computer systems
In these cases, it might be best to equate all words to a form without diacritics

Capitalization/case-folding
A common strategy is to do case-folding by reducing all letters to lower case
It will also help on a web search engine when most of your users type in lowercase
On the other hand, such case folding can equate words that might better be kept apart
Many proper nouns are derived from common nouns and so are distinguished only by case, including companies (General Motors, The Associated Press), government organizations (the Fed vs. fed), person names (Bush, Black) and so on(it, IT)
For English, an alternative to making every token lowercase is to just make some tokens lowercase
The simplest heuristic is to convert to lowercase words at the beginning of a sentence and all words occurring in a title that is all uppercase
Mid-sentence capitalized words are left as capitalized
Trying to get capitalization right in this way probably doesn’t help if your users usually use lowercase regardless of the correct case of words
Thus, lowercasing everything often remains the most practical solution, especially for Web search engines

Other issues in English
For instance, you might wish to equate ne’er and never or the British spelling colour and the American spelling color
Dates, times and similar items come in multiple formats presenting additional challenges
The thesaurus retrieval in Google
使用~符号表示自动扩展同义词
并非字典扩展，而是算法扩展
The date retrieval in Google
能够在一定程度上识别

Issues in other languages
English hasmaintained a dominant position on theWWW
Approximately 60% of web pages are in English (Gerrand 2007)
But that still leaves 40% of the web, and the non-English portion might be expected to grow over time, since less than one third of Internet users and less than 10% of the world’s population primarily speak English
And there are signs of change: Sifry (2007) reports that only about one third of blog posts are in English
When document collections contain multiple languages, a single index may have to contain terms of several languages
One option is to run a language identification classifier on documents and then to tag terms in the vocabulary for their language
When dealing with foreign or complex words, particularly foreign names, the spelling may be unclear or there may be variant transliteration standards giving different spellings (Beijing and Peking)
One way of dealing with this is to use heuristics to equivalence class or expand terms with phonetic equivalents. The traditional and best known such algorithm is the Soundex algorithm

2.2.2.4 Stemming and lemmatization
In many situations, it seems as if it would be useful for a search for one of many words to return documents that contain another word in this set
organize
organizes
organizing

The definition of stemming
Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes
cars ? car

The definition of lemmatization
Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma
saw ? see, saw (depending on the token is as a verb or a noun)

The goal of both methods
Reduce inflectional forms and sometimes derivationally related forms of a word to a common base form
am, are, is ?be
car, cars, car’s, cars’?car
The result of this mapping of text will be something like:
the boy’s cars are different colors ? the boy car be differ color

Tools
Linguistic processing for stemming or lemmatization is often done by an additional plug-in component to the indexing process
A number of such components exist, both commercial and open-source

Stemming tools —— stemmer
Porter`s algorithm (1980)
One-pass Lovins stemmer (1968)
Paice/Husk stemmer (1990)

Porter`s algorithm
The most common algorithm for stemming English
Porter’s algorithm consists of 5 phases of word reductions, applied sequentially
Within each phase there are various conventions to select rules
Sample text:
Such an analysis can reveal features that are not easily visible from the variations in the individual genes
Porter stemmer:
Such an analysi can reveal featur that ar not easili visibl from the variat in the individu gene
Ref: http://tartarus.org/~martin/PorterStemmer/

Characteristics of stemmer
Use language-specific rules
Require less knowledge than a lemmatizer, which needs a complete vocabulary and morphological analysis to correctly lemmatize words
Particular domains may also require special stemming rules
The exact stemmed form does not matter, only the equivalence classes it forms

Characteristics of lemmatizer
A tool from Natural Language Processing which does full morphological analysis to accurately identify the lemma for each word

Dilemma
Suit languages with much more morphology
such as Spanish, German, and Finnish
Results in the European CLEF evaluations have repeatedly shown quite large gains from the use of stemmers
Compound splitting for languages like German
Tend not to improve English information retrieval performance in aggregate
While it helps a lot for some queries, it equally hurts performance a lot for others
Increase recall while harming precision
“operating AND system” is not a good match for “operate and system”

2.2.3 Skip pointers
A way to increase the efficiency of using postings lists
Can we do better than this?
We walk through the two postings lists simultaneously, in time linear in the total number of postings entries

Skip list
One way to do this is to use a skip list by augmenting postings lists with skip pointers at indexing time
Skip pointers are effectively shortcuts that allow us to avoid processing parts of the postings list that will not figure in the search results
The questions are
Where to place skip pointers
How to do efficient merging using skip pointers
It is a tradeoff
More skips
It means shorter skip spans, and that we are more likely to skip
But it also means lots of comparisons to skip pointers, and lots of space storing skip pointers
Fewer skips
It means few pointer comparisons
But it also means that there will be fewer opportunities to skip
A simple heuristic for placing skips, which has been found to work well in practice
For a postings list of length P, use √P evenly-spaced skip pointers

Characteristics of skip list
For a complex query, the call hasSkip(p) will always return false, that is to say, complex query does not use skip list
Presence of skip pointers only helps for AND queries, not for OR queries
It is harder if a postings list keeps changing because of updates

2.2.4 Positional postings and phrase queries
Biword indexes
Positional indexes
Combination schemes

As many as 10% of web queries are phrase queries, and many more are implicit phrase queries (such as person names), entered without use of double quotes
To be able to support such queries, it is no longer sufficient for traditional postings lists
A search engine should not only support phrase queries, but implement them efficiently
More complex concept is term proximity weighting, where a document is preferred to the extent that the query terms appear close to each other in the text

2.2.4.1 Biword indexes
Example
For example the text
“NanJing Normal University”
It would generate the biwords:
NanJing Normal
Normal University
Longer phrases can be processed by breaking them down
For example the text
“stanford university palo alto”
It would generate the biwords:
“stanford university” AND “university palo” AND
“palo alto”

Challenge
This query could be expected to work fairly well in practice
But there can and will be occasional false positives
误报（False-Postive）
漏报（False-Negative）
Because related nouns can often be divided from each other by various function words

Biword indexes with part-of-speech-tagging
Tokenize the text and perform part-of-speech-tagging(词语标注)
Group terms into nouns, including proper nouns, (N) and function words, including articles and prepositions, (X), among other classes
Deem any string of terms of the form NX*N to be an extended biword
Each such extended biword is made a term in the vocabulary
For example:
renegotiation of the constitution
N X X N
Need to also parse it into N’s and X’s sometime
Not always work in an intuitively optimal manner when parsing longer queries into Boolean queries
For example:
cost overruns on a power plant
is parsed into
“cost overruns” AND “overruns power” AND “power plant”
The middle biword can be omit
Many fairly accurate (c. 96% per-tag accuracy) part-of-speech taggers now exist, usually trained by machine learning methods on hand-tagged text(Manning and Schütze,1999)
Indeed, searches for a single term are not naturally handled in a biword index
So we also need to have an index of single-word terms

phrase index
The concept of a biword index can be extended to longer sequences of words
If the index includes variable length word sequences, it is generally referred to as a phrase index
While there is always a chance of false positive matches, the chance of a false positive match on indexed phrases of length 3 or more becomes very small indeed

The disadvantage of phrase index
Storing longer phrases has the potential to greatly expand the vocabulary size
Using a partial phrase index in a compound indexing scheme is a better idea

2.2.4.2 Positional indexes
For each term in the vocabulary, we store postings of the form docID
Example
Suppose the query is “to be or not to be”
The postings lists to access are: to, be, or, not
We will examine intersecting the postings lists for to and be
We first look for documents that contain both terms. Then, we look for places in the lists where there is an occurrence of be with a token index one higher than a position of to, and then we look for another occurrence of each word with token index 4 higher than the first occurrence
In the above lists, the pattern of occurrences that is a possible match is:
t { …;4: { …, 429, 433}; …}
be: { …;4: { …, 430, 434}; …}

k word proximity search
The same general method can be applied for within k word proximity searches

For example:
employment /3 place
Here, /k means “within k words of (on either side)”. Clearly, positional indexes can be used for such queries
biword indexes cannot

The disadvantage of positional indexes
Adopting a positional index expands required postings storage significantly
A positional index can be 2 to 4 times as large as a non-positional index
Even a compressed positional index can be about one third to one half the size of the raw text

2.2.4.3 Combination schemes
If users commonly query on particular phrases, such as Michael Jackson, it is quite inefficient to keep merging positional postings lists
A combination strategy uses a phrase index, or just a biword index, for certain queries and uses a positional index for other phrase queries

Combination strategy
Good queries to include in the phrase index are ones known to be common based on recent querying behavior
The most valuable phrase are ones where the individual words are common but the desired phrase is comparatively rare
“The Who” should be indexed, whereas ”Britney Spears” should not

[此贴子已经被作者于2010-12-14 08:52:19编辑过]

主题：[推荐]第三部分课堂讲义之二——字典和标记列表的设计