-- 作者:admin
-- 发布时间:2008/3/25 6:07:52
-- [推荐]第四部分课堂讲义——索引
3 Index PPT课件下载链接:
About hardware basics The design of indexing algorithms is governed by hardware constraints
The principle of index Using cache Keep as much data as possible in memory Storing data in less chunks Seek time is about 5ms on average 0.2 s to transfer 10 MB from disk to memory if data is stored sequentially, or else 0.7 s 0.2 + 100 × (5 × 0.001) = 0.7
Using buffer A single byte from disk can take as much time as reading the entire block Block sizes of 8 KB, 16 KB, 32 KB and 64 KB are common
Compressing data Data transfers from disk to memory are handled by the system bus, not by the processor Assuming an efficient decompression algorithm, the total time of reading and then decompressing compressed data is usually less than reading uncompressed data
3.1 Index construction Blocked sort-based indexing Single-pass in-memory indexing Distributed indexing Dynamic indexing
3.1.1 Blocked sort-based indexing Conventional procession of indexing First make a pass through the collection assembling all term-docID pairs Then sort the pairs with the term as the dominant key and docID as the secondary key Finally organize the docIDs for each term into a postings list and compute statistics like term and document frequency
How to accelerate index for large collection For small collections, all this can be done in memory Mapping on the fly while processing the collection In a two-pass approach Mapping in the first pass Construct the index in the second pass For large collections, collections cannot fit into memory even after compression, so the use of secondary storage is required Represent terms as termIDs Design efficient algorithms
Reuters-RCV1 collection Covers a wide range of international topics 1 GB of text 0.8 million documents during a one year period between August 20, 1996, and August 19, 1997 100 million tokens 0.4 million terms Collecting all termID-docID pairs of the collection using 4 bytes each for termID and docID will therefore require 0.8 GB of storage 200 tokens per document 7.5 bytes per term
Solution Use external sorting algorithm of minimizing the number of random disk seeks during sorting Blocked sort-based indexing algorithm, BSBI, similar to Tow-phase Multiway Merge-Sort algorithm Single-pass in-memory indexing algorithm,SPIMI
Blocked sort-based indexing algorithm Approach Parse document to termID-docID pairs collection Segments the collection into parts of equal size Sorts the termID-docID pairs of each part in memory Stores intermediate sorted results on disk Merges all intermediate results into the final index Blocked sort-based indexing algorithm Phase I Choose the block size to fit comfortably into memory to permit a fast in-memory sort The block is then inverted Sort the termID-docID pairs Collect all termID-docID pairs with the same termID into a postings list The block is written to disk Phase II Assume Memory:1G Collection:100G Blocks number:100 Blocks size: 1G Result Read 0.01G(10M) simultaneously from every blocks every times
Characteristics of BSBI Has excellent scaling properties, But it needs a data structure for mapping terms to termIDs. For very large collections, this data structure will not fit into memory
3.1.2 Single-pass in-memory indexing Use terms instead of termIDs Write each block’s dictionary to disk and then start a new dictionary for the next block Can index collections of any size as long as there is enough disk space available
Approach of SPIMI Parse documents into a stream of term-docID pairs, called items Items are processed one by one, until the entire collection has been processed When a item occurs for the first time, it is added to the dictionary, and a new postings list is created, which length is dynamic Allocate space for a short postings list initially and double the space each time it is full
When memory has been exhausted, sort the posting lists, and write the block to disk Finally merge the blocks into the final inverted index
Characteristics of SPIMI Faster as there is no sorting required Save memory since keep track of the term a postings list belongs to, so the termID need not be stored Easily compress the postings and the dictionary terms
3.1.3 Distributed indexing Distributed index is partitioned across several machines in search engine, either according to term or according to document
Two basic strategies for index partitioning According to document In the local inverted file (IFL) organization, each node is responsible for a disjoint subset of pages in the collection A search query is broadcast to all the nodes, each of which returns disjoint lists of page identifiers containing the search terms
According to term Global inverted file (IFG) organization partitions on index terms so that each query server stores inverted lists only for a subset of the terms in the collection A search query that asks for pages containing the specified term only involves node storing corresponding term Query “process” only involves node beginning with characters in the ranges [a-q]
The comparison of two methods The IFL strategy can be resilient to node failures and reduced network load, Performance studies also indicate that IFL organization uses system resources effectively and provides good query throughput in most cases
MapReduce A general cluster architecture for distributed computing designed by Google Aim at solving large computing problems on cheap commodity machines or nodes that are built from standard parts
One requirement for robust distributed indexing is therefore that we divide the work up into chunks that we can easily assign and – in case of failure – reassign. A master node directs the process of assigning and reassigning asks to individual worker nodes An example of distributed indexing with MapReduce
Approach of distributed indexing with MapReduce The input data ( documents ) is split into n splits where the size of the split is chosen to ensure that the work can be distributed evenly and efficiently Evenly: chunks shouldn’t be too large Efficiently: the total number of chunks shouldn’t be too large 16MB or 64MB are good sizes Splits are assigned by the master node
Be sure that the mapping from terms to termIDs is also distributed A simple solution is to maintain a mapping for frequent terms that is copied to all nodes and to use terms directly for infrequent terms
The map phase consists of mapping splits of the input data ( documents ) to key-value pairs by parsers Each parser writes its output to local intermediate files, the segment files
To minimize write times before inverters reduce the data, each parser writes its segment files to its local disk In the reduce phase, the master communicates to an inverter the locations of the relevant segment files and all values for a given key to be stored close together by inverters
Finally, the list of values is sorted for each key and written to the final sorted postings list
3.1.4 Dynamic indexing The only thing that does not change is change itself
How to construct dynamic index Periodically reconstruct the index from scratch Require small number of changes over time a Delay in making new documents searchable is acceptable Enough resources are available
How to construct dynamic index Maintain two indexes A large main index and a small auxiliary index that stores new documents The auxiliary index is kept in memory Searches are run across both indexes and results merged Deletions are stored in an invalidation bit vector and can then filter out deleted documents Documents are updated by deleting and reinserting them
Each time the auxiliary index becomes too large, we merge it into the main index If each postings list is stored as a separate file, then the merge simply consists of extending each postings list of the main index by the corresponding postings list of the auxiliary index If each postings list is stored as one large file, log-arithmic merging can be used for accelerating The dynamic indexing strategy of search engine In fact, all aspects of an IR system – index maintenance, query processing, distribution etc. – are more complex in logarithmic merging
large search engines adopt a reconstruction-from-scratch strategy Query processing is then switched from the new index and the old index is deleted
[此贴子已经被作者于2010-12-14 09:17:02编辑过]
|