[推荐]第四部分课堂讲义——索引 - 信息检索原理课件 - 课外天地李树青

3 Index
PPT课件下载链接：

下载信息 [文件大小：下载次数： ]
点击浏览该文件:

About hardware basics
The design of indexing algorithms is governed by hardware constraints

The principle of index
Using cache
Keep as much data as possible in memory
Storing data in less chunks
Seek time is about 5ms on average
0.2 s to transfer 10 MB from disk to memory if data is stored sequentially, or else 0.7 s
0.2 + 100 × (5 × 0.001) = 0.7

Using buffer
A single byte from disk can take as much time as reading the entire block
Block sizes of 8 KB, 16 KB, 32 KB and 64 KB are common

Compressing data
Data transfers from disk to memory are handled by the system bus, not by the processor
Assuming an efficient decompression algorithm, the total time of reading and then decompressing compressed data is usually less than reading uncompressed data

3.1 Index construction
Blocked sort-based indexing
Single-pass in-memory indexing
Distributed indexing
Dynamic indexing

3.1.1 Blocked sort-based indexing
Conventional procession of indexing
First make a pass through the collection assembling all term-docID pairs
Then sort the pairs with the term as the dominant key and docID as the secondary key
Finally organize the docIDs for each term into a postings list and compute statistics like term and document frequency

How to accelerate index for large collection
For small collections, all this can be done in memory
Mapping on the fly while processing the collection
In a two-pass approach
Mapping in the first pass
Construct the index in the second pass
For large collections, collections cannot fit into memory even after compression, so the use of secondary storage is required
Represent terms as termIDs
Design efficient algorithms

Reuters-RCV1 collection
Covers a wide range of international topics
1 GB of text
0.8 million documents
during a one year period between August 20, 1996, and August 19, 1997
100 million tokens
0.4 million terms
Collecting all termID-docID pairs of the collection using 4 bytes each for termID and docID will therefore require 0.8 GB of storage
200 tokens per document
7.5 bytes per term

Solution
Use external sorting algorithm of minimizing the number of random disk seeks during sorting
Blocked sort-based indexing algorithm, BSBI, similar to Tow-phase Multiway Merge-Sort algorithm
Single-pass in-memory indexing algorithm,SPIMI

Blocked sort-based indexing algorithm
Approach
Parse document to termID-docID pairs collection
Segments the collection into parts of equal size
Sorts the termID-docID pairs of each part in memory
Stores intermediate sorted results on disk
Merges all intermediate results into the final index
Blocked sort-based indexing algorithm
Phase I
Choose the block size to fit comfortably into memory to permit a fast in-memory sort
The block is then inverted
Sort the termID-docID pairs
Collect all termID-docID pairs with the same termID into a postings list
The block is written to disk
Phase II
Assume
Memory:1G
Collection:100G
Blocks number:100
Blocks size: 1G
Result
Read 0.01G(10M) simultaneously from every blocks every times

Characteristics of BSBI
Has excellent scaling properties,
But it needs a data structure for mapping terms to termIDs. For very large collections, this data structure will not fit into memory

3.1.2 Single-pass in-memory indexing
Use terms instead of termIDs
Write each block’s dictionary to disk and then start a new dictionary for the next block
Can index collections of any size as long as there is enough disk space available

Approach of SPIMI
Parse documents into a stream of term-docID pairs, called items
Items are processed one by one, until the entire collection has been processed
When a item occurs for the first time, it is added to the dictionary, and a new postings list is created, which length is dynamic
Allocate space for a short postings list initially and double the space each time it is full

When memory has been exhausted, sort the posting lists, and write the block to disk
Finally merge the blocks into the final inverted index

Characteristics of SPIMI
Faster as there is no sorting required
Save memory since keep track of the term a postings list belongs to, so the termID need not be stored
Easily compress the postings and the dictionary terms

3.1.3 Distributed indexing
Distributed index is partitioned across several machines in search engine, either according to term or according to document

Two basic strategies for index partitioning
According to document
In the local inverted file (IFL) organization, each node is responsible for a disjoint subset of pages in the collection
A search query is broadcast to all the nodes, each of which returns disjoint lists of page identifiers containing the search terms

According to term
Global inverted file (IFG) organization partitions on index terms so that each query server stores inverted lists only for a subset of the terms in the collection
A search query that asks for pages containing the specified term only involves node storing corresponding term
Query “process” only involves node beginning with characters in the ranges [a-q]

The comparison of two methods
The IFL strategy can be resilient to node failures and reduced network load,
Performance studies also indicate that IFL organization uses system resources effectively and provides good query throughput in most cases

MapReduce
A general cluster architecture for distributed computing designed by Google
Aim at solving large computing problems on cheap commodity machines or nodes that are built from standard parts

One requirement for robust distributed indexing is therefore that we divide the work up into chunks that we can easily assign and – in case of failure – reassign.
A master node directs the process of assigning and reassigning asks to individual worker nodes
An example of distributed indexing with MapReduce

Approach of distributed indexing with MapReduce
The input data ( documents ) is split into n splits where the size of the split is chosen to ensure that the work can be distributed evenly and efficiently
Evenly: chunks shouldn’t be too large
Efficiently: the total number of chunks shouldn’t be too large
16MB or 64MB are good sizes
Splits are assigned by the master node

Be sure that the mapping from terms to termIDs is also distributed
A simple solution is to maintain a mapping for frequent terms that is copied to all nodes and to use terms directly for infrequent terms

The map phase consists of mapping splits of the input data ( documents ) to key-value pairs by parsers
Each parser writes its output to local intermediate files, the segment files

To minimize write times before inverters reduce the data, each parser writes its segment files to its local disk
In the reduce phase, the master communicates to an inverter the locations of the relevant segment files and all values for a given key to be stored close together by inverters

Finally, the list of values is sorted for each key and written to the final sorted postings list

3.1.4 Dynamic indexing
The only thing that does not change is change itself

How to construct dynamic index
Periodically reconstruct the index from scratch
Require small number of changes over time a
Delay in making new documents searchable is acceptable
Enough resources are available

How to construct dynamic index
Maintain two indexes
A large main index and a small auxiliary index that stores new documents
The auxiliary index is kept in memory
Searches are run across both indexes and results merged
Deletions are stored in an invalidation bit vector and can then filter out deleted documents
Documents are updated by deleting and reinserting them

Each time the auxiliary index becomes too large, we merge it into the main index
If each postings list is stored as a separate file, then the merge simply consists of extending each postings list of the main index by the corresponding postings list of the auxiliary index
If each postings list is stored as one large file, log-arithmic merging can be used for accelerating
The dynamic indexing strategy of search engine
In fact, all aspects of an IR system – index maintenance, query processing, distribution etc. – are more complex in logarithmic merging

large search engines adopt a reconstruction-from-scratch strategy
Query processing is then switched from the new index and the old index is deleted

[此贴子已经被作者于2010-12-14 09:17:02编辑过]

主题：[推荐]第四部分课堂讲义——索引