3 Index
PPT课件下载链接:
About hardware basics
The design of indexing algorithms is governed by hardware constraints
The principle of index
Using cache
Keep as much data as possible in memory
Storing data in less chunks
Seek time is about 5ms on average
0.2 s to transfer 10 MB from disk to memory if data is stored sequentially, or else 0.7 s
0.2 + 100 × (5 × 0.001) = 0.7
Using buffer
A single byte from disk can take as much time as reading the entire block
Block sizes of 8 KB, 16 KB, 32 KB and 64 KB are common
Compressing data
Data transfers from disk to memory are handled by the system bus, not by the processor
Assuming an efficient decompression algorithm, the total time of reading and then decompressing compressed data is usually less than reading uncompressed data
3.1 Index construction
Blocked sort-based indexing
Single-pass in-memory indexing
Distributed indexing
Dynamic indexing
3.1.1 Blocked sort-based indexing
Conventional procession of indexing
First make a pass through the collection assembling all term-docID pairs
Then sort the pairs with the term as the dominant key and docID as the secondary key
Finally organize the docIDs for each term into a postings list and compute statistics like term and document frequency
How to accelerate index for large collection
For small collections, all this can be done in memory
Mapping on the fly while processing the collection
In a two-pass approach
Mapping in the first pass
Construct the index in the second pass
For large collections, collections cannot fit into memory even after compression, so the use of secondary storage is required
Represent terms as termIDs
Design efficient algorithms
Reuters-RCV1 collection
Covers a wide range of international topics
1 GB of text
0.8 million documents
during a one year period between August 20, 1996, and August 19, 1997
100 million tokens
0.4 million terms
Collecting all termID-docID pairs of the collection using 4 bytes each for termID and docID will therefore require 0.8 GB of storage
200 tokens per document
7.5 bytes per term
Solution
Use external sorting algorithm of minimizing the number of random disk seeks during sorting
Blocked sort-based indexing algorithm, BSBI, similar to Tow-phase Multiway Merge-Sort algorithm
Single-pass in-memory indexing algorithm,SPIMI
Blocked sort-based indexing algorithm
Approach
Parse document to termID-docID pairs collection
Segments the collection into parts of equal size
Sorts the termID-docID pairs of each part in memory
Stores intermediate sorted results on disk
Merges all intermediate results into the final index
Blocked sort-based indexing algorithm
Phase I
Choose the block size to fit comfortably into memory to permit a fast in-memory sort
The block is then inverted
Sort the termID-docID pairs
Collect all termID-docID pairs with the same termID into a postings list
The block is written to disk
Phase II
Assume
Memory:1G
Collection:100G
Blocks number:100
Blocks size: 1G
Result
Read 0.01G(10M) simultaneously from every blocks every times
Characteristics of BSBI
Has excellent scaling properties,
But it needs a data structure for mapping terms to termIDs. For very large collections, this data structure will not fit into memory
3.1.2 Single-pass in-memory indexing
Use terms instead of termIDs
Write each block’s dictionary to disk and then start a new dictionary for the next block
Can index collections of any size as long as there is enough disk space available
Approach of SPIMI
Parse documents into a stream of term-docID pairs, called items
Items are processed one by one, until the entire collection has been processed
When a item occurs for the first time, it is added to the dictionary, and a new postings list is created, which length is dynamic
Allocate space for a short postings list initially and double the space each time it is full
When memory has been exhausted, sort the posting lists, and write the block to disk
Finally merge the blocks into the final inverted index
Characteristics of SPIMI
Faster as there is no sorting required
Save memory since keep track of the term a postings list belongs to, so the termID need not be stored
Easily compress the postings and the dictionary terms
3.1.3 Distributed indexing
Distributed index is partitioned across several machines in search engine, either according to term or according to document
Two basic strategies for index partitioning
According to document
In the local inverted file (IFL) organization, each node is responsible for a disjoint subset of pages in the collection
A search query is broadcast to all the nodes, each of which returns disjoint lists of page identifiers containing the search terms
According to term
Global inverted file (IFG) organization partitions on index terms so that each query server stores inverted lists only for a subset of the terms in the collection
A search query that asks for pages containing the specified term only involves node storing corresponding term
Query “process” only involves node beginning with characters in the ranges [a-q]
The comparison of two methods
The IFL strategy can be resilient to node failures and reduced network load,
Performance studies also indicate that IFL organization uses system resources effectively and provides good query throughput in most cases
MapReduce
A general cluster architecture for distributed computing designed by Google
Aim at solving large computing problems on cheap commodity machines or nodes that are built from standard parts
One requirement for robust distributed indexing is therefore that we divide the work up into chunks that we can easily assign and – in case of failure – reassign.
A master node directs the process of assigning and reassigning asks to individual worker nodes
An example of distributed indexing with MapReduce
Approach of distributed indexing with MapReduce
The input data ( documents ) is split into n splits where the size of the split is chosen to ensure that the work can be distributed evenly and efficiently
Evenly: chunks shouldn’t be too large
Efficiently: the total number of chunks shouldn’t be too large
16MB or 64MB are good sizes
Splits are assigned by the master node
Be sure that the mapping from terms to termIDs is also distributed
A simple solution is to maintain a mapping for frequent terms that is copied to all nodes and to use terms directly for infrequent terms
The map phase consists of mapping splits of the input data ( documents ) to key-value pairs by parsers
Each parser writes its output to local intermediate files, the segment files
To minimize write times before inverters reduce the data, each parser writes its segment files to its local disk
In the reduce phase, the master communicates to an inverter the locations of the relevant segment files and all values for a given key to be stored close together by inverters
Finally, the list of values is sorted for each key and written to the final sorted postings list
3.1.4 Dynamic indexing
The only thing that does not change is change itself
How to construct dynamic index
Periodically reconstruct the index from scratch
Require small number of changes over time a
Delay in making new documents searchable is acceptable
Enough resources are available
How to construct dynamic index
Maintain two indexes
A large main index and a small auxiliary index that stores new documents
The auxiliary index is kept in memory
Searches are run across both indexes and results merged
Deletions are stored in an invalidation bit vector and can then filter out deleted documents
Documents are updated by deleting and reinserting them
Each time the auxiliary index becomes too large, we merge it into the main index
If each postings list is stored as a separate file, then the merge simply consists of extending each postings list of the main index by the corresponding postings list of the auxiliary index
If each postings list is stored as one large file, log-arithmic merging can be used for accelerating
The dynamic indexing strategy of search engine
In fact, all aspects of an IR system – index maintenance, query processing, distribution etc. – are more complex in logarithmic merging
large search engines adopt a reconstruction-from-scratch strategy
Query processing is then switched from the new index and the old index is deleted
[此贴子已经被作者于2010-12-14 09:17:02编辑过]