[推荐]第八部分课堂讲义之二——搜索引擎 - 信息检索原理课件 - 课外天地李树青

7 Web search and search engine

PPT课件下载链接：

下载信息 [文件大小：下载次数： ]
点击浏览该文件:

7.2 Search engine
The components of web search engine

7.2.1 Web crawling
7.2.1.1 Web crawler
Sometimes referred to as a spider or robot
The objective of crawler is to quickly and efficiently gather as many useful web pages as possible, together with the link structure that interconnects them

Features of crawler
Features a crawler must provide
Robustness
Resilient to spider traps
Politeness
Features a crawler should provide
Distributed
Scalable
Performance and efficiency
Quality
Biased towards fetching “useful” pages first
Freshness
Crawl Web pages with a frequency that approximates the rate of change of pages
Extensible
Cope with new data formats, new fetch protocols

7.2.1.2 Web crawling
The process of Web crawling
Begins with one or more URLs that constitute a seed set
Picks a URL from this seed set, then fetches the web page at that URL
The fetched page is then parsed, to extract both the text and the links from the page
The extracted text is fed to a text indexer
The extracted links (URLs) are then added to a URL frontier
As pages are fetched, the corresponding URLs are deleted from URL frontier

Some spiders of famous search engines
Mercator=Altavista.com
ArchitextSpider=excite.com
Googlebot=Google.com
googlebot@googlebot.com=Google.com
google=Google.com
Slurp.so/1.0=Yahoo
slurp@inktomi.com=Yahoo
MSNBOT/0.1=MSN
baiduspider=Baidu

7.2.1.3 Crawler architecture
The basic crawler architecture
Multi-thread
Iterative process

URL frontier module
Containing URLs yet to be fetched in the current crawl
A URL is back in the frontier for re-fetching

DNS resolution module
Determines the web server by URL

Fetch module
Uses the http protocol to retrieve the web page

Parsing module
Extracts the text and set of links from a fetched web page

Duplicate elimination module

What is URL frontier?
Give a URL by its crawl process
Maintain the URLs in the frontier
Regurgitate them in some order
The priority of a page should be a function of both its change rate and its quality
Avoid repeated fetch requests to a host within a short time span
A common heuristic is to insert a gap between successive fetch requests

The sub-modules of URL frontier
F front queues
FIFO queues
Implement the prioritization
With assigned priority i, the URL is appended to the ith of the front queues
B back queues
FIFO queues
Implement the politeness
Each queue contains URLs from a single host
A heap
One entry for each back queue which has the earliest time at which the host corresponding to that queue

What is DNS resolution?
DNS: Domain Name Service
Translating URL to an IP address through contacting DNS servers
DNS resolution is a well-known bottleneck in web crawling
May recursively call upon other DNS servers
The lookup implementations are generally synchronous

How to remedy?
Caching
URLs for which we have recently performed DNS lookups are likely to be found in the DNS cache
Using own DNS resolver
Most web crawlers implement their own DNS resolver as a component of the crawler
Mercator recommend the order of five attempts
The time quantum of the wait increases exponentially with each of these attempts
Mercator started with one second and ended with roughly 90 seconds

What is Doc FP?
Tests whether a web page with the same content has already been seen at another URL
Methods
Use a simple fingerprint such as a checksum
A more sophisticated test would use shingles

What is URL filter?
Determine whether the extracted URL should be excluded from the frontier
Exclude certain domains
Include specific domains (vertical search)
Comply with robots.txt

Robots Exclusion Protocol
Placing a file with the name robots.txt at the root of the URL hierarchy at the site
Declaring their websites off-limits to crawling
We should perform the robots filtering before adding such a URL to the frontier but this file may be changed
In face, maintaining a cache of robots.txt files is still highly effective

robots.txt
For example
User-agent: *
Disallow: /yoursite/temp/
User-agent: searchengine
Disallow:
No robot should visit any URL whose position in the file hierarchy starts with /yoursite/temp/, except for the robot called “searchengine”

URL normalization
Relative link should be converted into absolute link
Subdir/onefile.html
../../othersubdir/otherfile.html
/rootfile.html
Eliminate URL of other protocol
mailtleeshuqing@yahoo.cn

What is duplicate elimination?
If the URL is already in the frontier or (in the case of a non-continuous crawl) already crawled, we do not add it to the frontier

Other tasks in crawling
Performed by a dedicated thread it waking up periodically
Log crawl progress statistics
Decide whether to terminate the crawl
Checkpoint the crawl for failure recovery

7.2.1.4 Distributing crawler
The basic distributing crawl architecture

How to crawl?
Each node crawls hosts “near” it
By domain
Do not always reflect geographic proximity
By hash value of URL

How to communicate and share URLs
Use a host splitter to dispatch each surviving URL to the crawler node responsible for the URL
URL set need receive URLs from other nodes

The difficulty in identify same content
Nothing can prevent the same (or highly similar) content from appearing on different web servers
The set of fingerprints/shingles must be partitioned across the nodes based on some property of the fingerprint/shingle

7.2.2 Indexes
Based on inverted index
Distributing indexes

7.2.2.1 Distributing indexes
Partitioning by terms
Known as global index organization
Partitioning by documents
Known as local index organization
More common

Partitioning by terms
Known as global index organization
A query is routed to the nodes corresponding to its query terms
Allows greater concurrency
But multi-word queries require the sending of long postings lists between sets of nodes for merging
The implementation of dynamic indexing is more difficult

Partitioning by documents
Known as local index organization
Each node contains the index for a subset of all documents
Each query is distributed to all nodes, with the results from various nodes being merged before presentation to the user
One difficulty is that global statistics must be computed across the entire document collection in all nodes

How to decide the partition of documents to nodes?
One simple approach would be to assign all pages from a host to a single node
Unbalanced load
Assign pages based on hash value
More uniform distribution

How to get results?
At query time, the query is broadcast to each of the nodes, with the top k results from each node being merged to find the top k documents for the query
A common implementation heuristic is to partition the document collection into indexes of documents that are more likely to score highly on most queries and low-scoring indexes with the remaining document

7.2.2.2 Connectivity indexes
Connectivity queries
Which URLs link to a given URL?
Which URLs does a given URL link to?

Applications of connectivity indexes
Crawl control
Web graph analysis
Sophisticated crawl optimization
Link analysis

The size of connectivity indexes
Suppose that the Web had four billion pages, each with ten links to other pages
We would require 32 bits or 4 bytes to specify each end (source and destination) of each link
Total cost in space
4 ×109 × 10×8 = 3.2× 1011 ( about 300G )

Methods of decreasing memory requirement
Data compression
Must efficiently supports connectivity queries
Adjacency table
It has a row for each web page, with the rows ordered by the corresponding integers
The row for any page contains a sorted list of integers, each corresponding to a web page that links to or the pages linked to by this web page
Nearly cuts the space taken by the naive representation by 50%
Use as few as 3 bits per link, on average – a dramatic reduction from the 64 required in the naive representation

More ideas in adjacency table
Similarity between lists
If we explicitly represent a prototype row for several similar rows, the remainder can be succinctly expressed in terms of the prototypical row
We use gap encodings in sorted lists: rather than store the destination of each link, we store the offset from the previous entry in the row
Locality
Many links from a page go to “nearby” pages ( pages on the same host, for instance)
We can often use small integers in encoding the destination of a link and thereby save space

An example of adjacency table
1: www.stanford.edu/alchemy
2: www.stanford.edu/biology
3: www.stanford.edu/biology/plant
4: www.stanford.edu/biology/plant/copyright
5: www.stanford.edu/biology/plant/people
6: www.stanford.edu/chemistry
...
1: 1, 2, 4, 8, 16, 32, 64
2: 1, 4, 9, 16, 25, 36, 49, 64
3: 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144
4: 1, 4, 8, 16, 25, 36, 49, 64
…

[此贴子已经被作者于2010-12-14 09:23:19编辑过]

主题：[推荐]第八部分课堂讲义之二——搜索引擎