-- 作者:admin
-- 发布时间:2008/3/16 13:52:09
-- [推荐]第三部分课堂讲义之二——字典和标记列表的设计
2 Basic information retrieval
PPT课件下载链接:
2.2 The term vocabulary and postings lists Document dealineation and character sequence decoding Determining the vocabulary of terms Skip pointers Positional postings and phrase queries
Two method of getting documents Feeder——Information Repository Crawler, spider, robot——Search engine
2.2.1 Document delineation and character sequence decoding Obtaining the character sequence in a document Choosing a document unit Processing digital documents For IR system, the first step of processing digital document is to convert this byte sequence into a linear sequence of characters For the case of plain English text in ASCII encoding, this is trivial. But often things get much more complex
Difficulties Need to determine the correct encoding The sequence of characters may be encoded by one of various single byte or multi-byte encoding schemes, such as Unicode, UTF-8, or various national or vendor-specific standards Even for plain text documents, additional decoding may need to be done. In XML documents, character entities, such as “&”, need to be decoded to give the correct character, namely & for “&”
Support a broad range of document types The characters may have to be decoded out of some binary representation like Microsoft Word DOC files and/or a compressed format such as zip files
The idea that text is a linear sequence of characters is also called into question by some writing systems Arabic text takes on some two dimensional and mixed order characteristics
But, despite some complicated writing system conventions, there is an underlying sequence of sounds being represented Hence an essentially linear structure remains, and this is what is represented in the digital representation of Arabic
Choosing a document unit We have assumed that documents are fixed units for the purposes of indexing
Different document levels Email file folder, email file, attached file Going in the opposite direction, PowerPoint take things that you might regard as a single document and split them into separate HTML pages for each slide or subsection, stored as separate files in the Web. In these cases, you might want to combine multiple files into a single document
For very long documents, the issue of indexing granularity arises A precision and recall tradeoff for long document(can be alleviated by explicit or implicit proximity search) A need to simultaneously index documents at multiple levels of granularity, appears prominently in XML retrieval
2.2.2 Determining the vocabulary of terms Tokenization Dropping common terms: stop words Normalization (equivalence classing of terms) Stemming and lemmatization
2.2.2.1 Tokenization Definition of Tokenization Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens, perhaps at the same time throwing away certain characters, such as punctuation
Input: Friends, Romans, Countrymen, lend me your ears; Output: Friends Romans Countrymen Lend Me Your Ears
Definition of Token These tokens are often loosely referred to as terms or words, but it is sometimes important to make a type/token distinction A token is an instance of a sequence of characters in some particular document, which are useful semantic unit for processing
Definition of Type A type is the class of all tokens containing the same character sequence
Definition of Term A term is a (perhaps normalized) type that is included in the IR system’s dictionary The set of index terms could be entirely distinct from the tokens, for instance, they could be semantic identifiers in a taxonomy But in practice in modern IR systems term are strongly related to the tokens in the document ( that is full-text retrieval)
For example, if the document to be indexed is “ to sleep perchance to dream”, then there are 5 tokens, but only 4 types (since there are 2 instances of to) However, if to is omitted from the index (as a stop word), then there will be only 3 terms: sleep, perchance, and dream
The major question of the tokenization what are the correct tokens to emit? For the example above, this looks fairly trivial: you chop on white space and throw away punctuation characters This is a starting point, but even for English there are a number of tricky cases
Apostrophe for possession and contractions Hyphenation White space Compositive question Language-specific question
Apostrophe For example: Mr. O’Neill thinks that the boys’ stories about Chile’s capital aren’t amusing For O’Neill, which of the following is the desired tokenization? neill oneill o’neill o’ neill o neill And for aren’t: aren’t arent are n’t aren t A simple strategy is to just split on all non-alphanumeric characters, but while “o neill” looks okay, “aren t” looks intuitively bad For all of them, the choices determine which Boolean queries will match A query of “neill AND capital” will match in three cases but not the other two. If no preprocessing of a query is done, then it would match in only one of the five cases
Hyphenation In English, hyphenation is used for various purposes Splitting up vowels in words (co-education) Joining nouns as names (Hewlett-Packard) Show word grouping(the hold-him-backand-drag-him-awaymaneuver) It is easy to feel that the first example should be regarded as one token (and is indeed more commonly written as just coeducation) The middle case is unclear The last should be separated into words Handling hyphens automatically can thus be complex It can either be done as a classification problem Or more commonly by some heuristic rules, such as allowing short hyphenated prefixes on words, but not longer hyphenated forms
White space Conceptually, splitting on white space can also split what should be regarded as a single token This occurs most commonly with names (San Fran-cisco, Los Angeles), borrowed foreign phrases (au fait) and compounds that are sometimes written as a single word and sometimes space separated (such as white space vs. whitespace) Splitting tokens on spaces can cause bad retrieval results, for example, if a search for York University mainly returns documents containing New York University
Compositive question The problems of hyphens and non-separating white space can even interact Advertisements for air fares frequently contain items like San Francisco-Los Angeles, where simply doing white space splitting would give unfortunate results Like queries for all of lowercase, lower-case and lower case to return the same results The last two can be handled by splitting on hyphens and using a phrase index Getting the first case right would depend on knowing that it is sometimes written as two words and also indexing it in this way However, many search engine do not agree with this idea But many professional retrieval system can provide an effective method to handle it One effective strategy in practice, which is used by some Boolean retrieval systems such as Westlaw and Lexis-Nexis(律商联讯), is to encourage users to enter hyphens wherever they may be possible, and whenever there is a hyphenated form, the system will generalize the query to cover all three of the one word, hyphenated, and two word forms So that a query for over-eager will search for over-eager OR “over eager” OR overeager However, this strategy depends on user training, since if you query using either of the other two forms, you get no generalization
anguage-specific question These issues of tokenization are language-specific. It thus requires the language of the document to be known Most languages have distinctive signature patterns For most languages and particular domains within them there are unusual specific tokens that we wish to recognize as terms, such as The programming languages C++ and C# Aircraft names like B-52 A T.V. show name such as M*A*S*H Computer technology has introduced new types of character sequences that a tokenizer should probably tokenize as a single token, including Email addresses Web URLs Numeric IP addresses Package tracking numbers And more One possible solution is to omit from indexing tokens However, this comes at a large cost in restricting what people can search for Each new language presents some new issues French has a variant use of the apostrophe for a reduced definite article ‘the’ before a word beginning with a vowel (e.g., l’ensemble) and has some uses of the hyphen with postposed clitic pronouns in imperatives and questions(e.g., donne-moi(‘give me’)) You would want documents mentioning both l’ ensemble and un ensemble to be indexed under ensemble German writes compound nouns without spaces Computerlinguistik ‘computational lin-guistics’ Lebensversicherungsgesellschaftsangestellter ‘life insurance company employee’ But retrieval systems for German greatly benefit from the use of a compound-splitter module, which is usually implemented by seeing if a word can be subdivided into multiple words that appear in a vocabulary This phenomenon reaches its limit case with major East Asian Languages (e.g., Chinese, Taiwanese, Japanese, Korean, and Thai), where text is written without any spaces between words Methods of word segmentation vary from having a large vocabulary and taking the longest vocabulary match with some heuristics for unknown words to the use of machine learning sequence models, such as hidden Markov models or conditional random fields, trained over hand-segmented words The other approach is to abandon word-based indexing and to do all indexing via just short subsequences of characters (character k-grams) or single Chinese character index, regardless of whether particular sequences cross word boundaries or not Three reasons why the approach abandoning word-based indexing is appealing: An individual Chinese character is more like a syllable than a letter and usually has some semantic content Most words are short (the commonest length is 2 characters) Given the lack of standardization of word breaking in the writing system, it is not always clear where word boundaries should be placed anyway Even in English, some cases of where to put word boundaries are just orthographic conventions – think of notwithstanding vs. not to mention or into vs. on to – but people are educated to write the words with consistent use of spaces
Suggestion of the tokenization 1)For either Boolean or free text queries, you always want to do the exact same tokenization of document and query words, generally by processing queries with the same tokenizer This guarantees that a sequence of characters in a text will always match the same sequence typed in a query 2)The Boolean case is more complex: this tokenization may produce multiple terms from one query word. This can be handled by combining the terms with an AND or as a phrase query
2.2.2.2 Dropping common terms: stop words The definition of stop words Sometimes, some extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary entirely These words are called stop words
The definition of stop list The general strategy for determining a stop list: To sort the terms by collection frequency (the total number of times each term appears in the document collection) To take the most frequent terms, often hand-filtered for their semantic content relative to the domain of the documents being indexed, as a stop list, the members of which are then discarded during indexing Using a stop list significantly reduces the number of postings that a system has to store Not indexing stop words does little harm However, this is not true for phrase searches The phrase query “President of the United States”, which contains two stop words, is more precise than President AND “United States” The meaning of flights to London is likely to be lost if the word to is stopped out A search for Vannevar Bush’s article As we may think will be difficult if the first three words are stopped out, and the system searches simply for documents containing the word think Some song titles and well known pieces of verse consist entirely of words that are commonly on stop lists (To be or not to be, Let It Be, I don’t want to be, . . . )
The stop list in Web search engines Web search engines generally do not use stop lists Can exploit the statistics of language so as to be able to cope with common words in better ways Good compression techniques greatly reduce the cost of storing the postings for common words Modern IR system with impact-sorted indexes can terminate scanning a postings list early when weights get small, and hence common words do not cause a large additional processing cost for the average query, even though postings lists for stop words are very long So for most modern IR systems such as search engine, the additional cost of including stop words is not that big Neither in terms of index size Nor in terms of query processing time
2.2.2.3 Normalization Be equivalent to classing of terms Often implement by two ways: Create equivalence classes Maintain relations between unnormalized tokens
The definition of normalization Token normalization is the process of canonicalizing tokens so that matches occur despite superficial differences in the character sequences of the tokens Question There are many cases when two character sequences are not quite the same but you would like a match to occur For instance, if you search for USA, you might hope to also match documents containing U.S.A.
Create equivalence classes The most standard way to normalize is to implicitly create equivalence classes, which are normally named after one member of the set For instance, if the tokens anti-discriminatory and antidiscriminatory are both mapped onto the latter, then searches for one term will retrieve documents that contain either The advantage of just using mapping rules that remove characters like hyphens is that the equivalence classing to be done is implicit, rather than being fully calculated in advance If two word differ obviously in surface, this method would not work well Since the equivalence classes are implicit, it is not obvious when you might want to add characters. For instance, it would be hard to know to turn antidiscriminatory into anti-discriminatory So its function is not powerful
Maintain relations between unnormalized tokens This method can be extended to hand-constructed lists of synonyms such as car and automobile These term relationships can be achieved in two ways The usual way is to index unnormalized tokens and to maintain a query expansion list of multiple vocabulary entries to consider for a certain query term. A query term is then effectively a disjunction of several postings lists The alternative is to perform the expansion during index construction. When the document contains automobile, we index it under car as well (and, usually, also vice-versa) Use of either of these methods is considerably less efficient than equivalence classing, as there are more postings to store and merge The first method adds a query expansion dictionary and requires more processing at query time The second method requires more space for storing postings. Traditionally, expanding the space required for the postings lists was seen as more disadvantageous, but with modern storage costs, the increased flexibility that comes from distinct postings lists is appealing These approaches are more flexible than equivalence classes because the expansion lists can overlap while not being identical This means there can be an asymmetry in expansion An example: Query Terms in documents that should be matched Windows Windows windows Windows, windows, window window window, windows If the user enters windows, we wish to allow matches with the capitalized Windows operating system, but this is not plausible if the user enters window, even though it is plausible for this query to also match lowercase windows
The comparison of two methods The best amount of equivalence classing or query expansion to do is a fairly open question For instance, equivalence-classing U.S.A. and USA to the latter by deleting periods from tokens might at first seem very reasonable However, if user put in as my query term C.A.T., he might be rather upset if it matches every appearance of the word cat in documents In many cases they seem helpful, but they can also do harm
Common methods to normalization Accents and diacritics Capitalization/case-folding Other issues in English Issues in other languages
Accents and diacritics This can be done by normalizing tokens to remove diacritics cliché and cliche Naive and na?ve Occasionally words are distinguished only by their accents In Spanish, pe?a is ‘a cliff’, while pena is ‘sorrow’ Nevertheless, the important question is usually not prescriptive or linguistic but is a question of how users are likely to write queries for these words In many cases, users will enter queries for words without diacritics, whether for reasons of speed, laziness, limited software, or habits born of the days when it was hard to use non-ASCII text on many computer systems In these cases, it might be best to equate all words to a form without diacritics
Capitalization/case-folding A common strategy is to do case-folding by reducing all letters to lower case It will also help on a web search engine when most of your users type in lowercase On the other hand, such case folding can equate words that might better be kept apart Many proper nouns are derived from common nouns and so are distinguished only by case, including companies (General Motors, The Associated Press), government organizations (the Fed vs. fed), person names (Bush, Black) and so on(it, IT) For English, an alternative to making every token lowercase is to just make some tokens lowercase The simplest heuristic is to convert to lowercase words at the beginning of a sentence and all words occurring in a title that is all uppercase Mid-sentence capitalized words are left as capitalized Trying to get capitalization right in this way probably doesn’t help if your users usually use lowercase regardless of the correct case of words Thus, lowercasing everything often remains the most practical solution, especially for Web search engines
Other issues in English For instance, you might wish to equate ne’er and never or the British spelling colour and the American spelling color Dates, times and similar items come in multiple formats presenting additional challenges The thesaurus retrieval in Google 使用~符号表示自动扩展同义词 并非字典扩展,而是算法扩展 The date retrieval in Google 能够在一定程度上识别
Issues in other languages English hasmaintained a dominant position on theWWW Approximately 60% of web pages are in English (Gerrand 2007) But that still leaves 40% of the web, and the non-English portion might be expected to grow over time, since less than one third of Internet users and less than 10% of the world’s population primarily speak English And there are signs of change: Sifry (2007) reports that only about one third of blog posts are in English When document collections contain multiple languages, a single index may have to contain terms of several languages One option is to run a language identification classifier on documents and then to tag terms in the vocabulary for their language When dealing with foreign or complex words, particularly foreign names, the spelling may be unclear or there may be variant transliteration standards giving different spellings (Beijing and Peking) One way of dealing with this is to use heuristics to equivalence class or expand terms with phonetic equivalents. The traditional and best known such algorithm is the Soundex algorithm
2.2.2.4 Stemming and lemmatization In many situations, it seems as if it would be useful for a search for one of many words to return documents that contain another word in this set organize organizes organizing
The definition of stemming Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes cars ? car
The definition of lemmatization Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma saw ? see, saw (depending on the token is as a verb or a noun)
The goal of both methods Reduce inflectional forms and sometimes derivationally related forms of a word to a common base form am, are, is ?be car, cars, car’s, cars’?car The result of this mapping of text will be something like: the boy’s cars are different colors ? the boy car be differ color
Tools Linguistic processing for stemming or lemmatization is often done by an additional plug-in component to the indexing process A number of such components exist, both commercial and open-source
Stemming tools —— stemmer Porter`s algorithm (1980) One-pass Lovins stemmer (1968) Paice/Husk stemmer (1990)
Porter`s algorithm The most common algorithm for stemming English Porter’s algorithm consists of 5 phases of word reductions, applied sequentially Within each phase there are various conventions to select rules Sample text: Such an analysis can reveal features that are not easily visible from the variations in the individual genes Porter stemmer: Such an analysi can reveal featur that ar not easili visibl from the variat in the individu gene Ref: http://tartarus.org/~martin/PorterStemmer/
Characteristics of stemmer Use language-specific rules Require less knowledge than a lemmatizer, which needs a complete vocabulary and morphological analysis to correctly lemmatize words Particular domains may also require special stemming rules The exact stemmed form does not matter, only the equivalence classes it forms
Characteristics of lemmatizer A tool from Natural Language Processing which does full morphological analysis to accurately identify the lemma for each word
Dilemma Suit languages with much more morphology such as Spanish, German, and Finnish Results in the European CLEF evaluations have repeatedly shown quite large gains from the use of stemmers Compound splitting for languages like German Tend not to improve English information retrieval performance in aggregate While it helps a lot for some queries, it equally hurts performance a lot for others Increase recall while harming precision “operating AND system” is not a good match for “operate and system”
2.2.3 Skip pointers A way to increase the efficiency of using postings lists Can we do better than this? We walk through the two postings lists simultaneously, in time linear in the total number of postings entries
Skip list One way to do this is to use a skip list by augmenting postings lists with skip pointers at indexing time Skip pointers are effectively shortcuts that allow us to avoid processing parts of the postings list that will not figure in the search results The questions are Where to place skip pointers How to do efficient merging using skip pointers It is a tradeoff More skips It means shorter skip spans, and that we are more likely to skip But it also means lots of comparisons to skip pointers, and lots of space storing skip pointers Fewer skips It means few pointer comparisons But it also means that there will be fewer opportunities to skip A simple heuristic for placing skips, which has been found to work well in practice For a postings list of length P, use √P evenly-spaced skip pointers
Characteristics of skip list For a complex query, the call hasSkip(p) will always return false, that is to say, complex query does not use skip list Presence of skip pointers only helps for AND queries, not for OR queries It is harder if a postings list keeps changing because of updates
2.2.4 Positional postings and phrase queries Biword indexes Positional indexes Combination schemes
As many as 10% of web queries are phrase queries, and many more are implicit phrase queries (such as person names), entered without use of double quotes To be able to support such queries, it is no longer sufficient for traditional postings lists A search engine should not only support phrase queries, but implement them efficiently More complex concept is term proximity weighting, where a document is preferred to the extent that the query terms appear close to each other in the text
2.2.4.1 Biword indexes Example For example the text “NanJing Normal University” It would generate the biwords: NanJing Normal Normal University Longer phrases can be processed by breaking them down For example the text “stanford university palo alto” It would generate the biwords: “stanford university” AND “university palo” AND “palo alto”
Challenge This query could be expected to work fairly well in practice But there can and will be occasional false positives 误报(False-Postive) 漏报(False-Negative) Because related nouns can often be divided from each other by various function words
Biword indexes with part-of-speech-tagging Tokenize the text and perform part-of-speech-tagging(词语标注) Group terms into nouns, including proper nouns, (N) and function words, including articles and prepositions, (X), among other classes Deem any string of terms of the form NX*N to be an extended biword Each such extended biword is made a term in the vocabulary For example: renegotiation of the constitution N X X N Need to also parse it into N’s and X’s sometime Not always work in an intuitively optimal manner when parsing longer queries into Boolean queries For example: cost overruns on a power plant is parsed into “cost overruns” AND “overruns power” AND “power plant” The middle biword can be omit Many fairly accurate (c. 96% per-tag accuracy) part-of-speech taggers now exist, usually trained by machine learning methods on hand-tagged text(Manning and Schütze,1999) Indeed, searches for a single term are not naturally handled in a biword index So we also need to have an index of single-word terms
phrase index The concept of a biword index can be extended to longer sequences of words If the index includes variable length word sequences, it is generally referred to as a phrase index While there is always a chance of false positive matches, the chance of a false positive match on indexed phrases of length 3 or more becomes very small indeed
The disadvantage of phrase index Storing longer phrases has the potential to greatly expand the vocabulary size Using a partial phrase index in a compound indexing scheme is a better idea
2.2.4.2 Positional indexes For each term in the vocabulary, we store postings of the form docID Example Suppose the query is “to be or not to be” The postings lists to access are: to, be, or, not We will examine intersecting the postings lists for to and be We first look for documents that contain both terms. Then, we look for places in the lists where there is an occurrence of be with a token index one higher than a position of to, and then we look for another occurrence of each word with token index 4 higher than the first occurrence In the above lists, the pattern of occurrences that is a possible match is: t { …;4: { …, 429, 433}; …} be: { …;4: { …, 430, 434}; …}
k word proximity search The same general method can be applied for within k word proximity searches
For example: employment /3 place Here, /k means “within k words of (on either side)”. Clearly, positional indexes can be used for such queries biword indexes cannot
The disadvantage of positional indexes Adopting a positional index expands required postings storage significantly A positional index can be 2 to 4 times as large as a non-positional index Even a compressed positional index can be about one third to one half the size of the raw text
2.2.4.3 Combination schemes If users commonly query on particular phrases, such as Michael Jackson, it is quite inefficient to keep merging positional postings lists A combination strategy uses a phrase index, or just a biword index, for certain queries and uses a positional index for other phrase queries
Combination strategy Good queries to include in the phrase index are ones known to be common based on recent querying behavior The most valuable phrase are ones where the individual words are common but the desired phrase is comparatively rare “The Who” should be indexed, whereas ”Britney Spears” should not
[此贴子已经被作者于2010-12-14 08:52:19编辑过]
|