[推荐]第九部分课堂讲义——XML检索 - 信息检索原理课件 - 课外天地李树青

8 XML Retrieval
PPT课件下载链接：

下载信息 [文件大小：下载次数： ]
点击浏览该文件:

Two Retrieval Methods
Unstructured text retrieval
Handled by IR systems
Structured text retrieval
Relational data retrieval
Handled by DBMS
Structured documents retrieval
XML Retrieval

Relational data retrieval
For example:
select * from stu where name like ’李%’
Meaning that an inverted index is created and Boolean or vector space search enabled, effectively combining core database with information retrieval technologies

Semi-structured retrieval
The term structured retrieval is used for XML retrieval
The term semi-structured retrieval is mainly used for database querying

Structured documents retrieval
Applications of structured documents retrieval include:
Digital libraries
Patent databases
Blogs
Text in which entities like persons and locations have been tagged (in a process called named entity tagging)
Output from office suites like OpenOffice that save documents as marked up text

We want to query that combine textual criteria with structural criteria
For example:
Give me patents whose claims mention RSA public key encryption and that cite US patent 4,405,829
These three queries are structured queries that cannot be answered well by an unranked retrieval system
And these retrieved data is not in DBMS but in raw text documents

The comparison of these methods
There is no consensus yet as to which methods work best for structured retrieval although many researchers believe that XQuery will become the standard for structured queries

The comparison of XML retrieval and relational databases retrieval
The XML retrieval is characterized by:
Long text fields (e.g., sections of a document)
Inexact matching
Relevance-ranked results
Relational databases do not deal well with this use case

XML retrieval
XML retrieval is one standard retrieval methods for encoding structured documents
These retrieval method is also applicable to markup languages in general such as HTML etc.

The characteristics of XML retrieval
Aimed at text-centric XML
Not aimed at data-centric XML
Most data-centric XML is stored in relational databases in contrast to the inverted index-based methods for text-centric XML

8.1 Basic XML concepts
An example of XML document
<play>
<author>Shakespeare</author>
<title>Macbeth</title>
<act number="I">
<scene number="vii">
<title>Macbeth’s castle</title>
<verse>Will I with wine and wassail ...</verse>
</scene>
</act>
</play>

Document Object Model ( DOM )
This figure shows a simplified DOM object
DOM is the standard for accessing and processing XML documents
With a DOM API, we can process an XML document by starting at the root element and then descending down the tree from parents to children

XPath
Also called as XML contexts
Standard for enumerating paths in an XML document collection

The characteristics of XPath
The XPath expression node selects all nodes of that name
“act/scene” selects all scene elements whose parent is an act element
Double slashes indicate that an arbitrary number of elements can intervene on a path
“play//scene” selects all scene elements occurring in a play element

An initial slash starts the path at the root element
“/play/title” selects the play’s title
“/play//title” selects a set with two members
The final element of a path to be a vocabulary term and separate it from the element path by the symbol #
Only for notational convenience but not conforming to the XPath standard
“ title#"Macbeth" ” selects all titles containing the term Macbeth

NEXI
Narrowed Extended XPath I
A common format for XML queries
For example:
//article [.//yr = 2001 or .//yr = 2002] //section [about(.,summer holidays)]
Specifieing a search for a section about the summer holidays that are part of articles from 2001 or 2002
The dot in a clause in square brackets refers to the element the clause modifies
The about clause is a ranking constraint

8.2 Challenges in XML retrieval
8.2.1 How to choose the result unit
Users want to return parts of documents (i.e., XML elements), not entire documents as IR systems usually do in unstructured retrieval
If users query Shakespeare’s plays for Macbeth’s castle, they are probably looking for the scene, not the act or the entire play

Structured document retrieval principle
A system should always retrieve the most specific part of a document answering the query
One criterion for selecting the most appropriate part of a document
This principle motivates a retrieval strategy that returns the smallest unit that contains the information sought, but does not go below this level

One difficulty
However, it is not invariable and can be hard to implement this principle algorithmically
Consider the query title#"Macbeth"
The title of the tragedy, Macbeth, and the title of Act I, Scene vii, Macbeth’s castle, are both good hits because they contain the matching term Macbeth
But in this case, the title of the tragedy, the higher node, is preferred

8.2.2 How to choose the index unit
The difficulties also include deciding which parts of a document to index
In unstructured retrieval, it is usually clear what the right document unit is: files on your desktop, email messages, web pages on the web etc.
In structured retrieval, there are a number of different approaches to defining the indexing unit

In the example, books, chapters and sections have been designated to be indexing units, but without overlap
The disadvantage of this approach is that pseudo-documents may not make sense to the user because they are not coherent units

Top-down method
We can also use one of the largest elements as the indexing unit, for example, the book element in a collection of books
We can then postprocess search results to find for each book the subelement that is the best hit
For example, the query Macbeth’s castle may return the play Macbeth, which we can then postprocess to identify act I, scene vii as the best-matching subelement
Unfortunately, the relevance of the whole is often not a good predictor of the relevance of small subelements within it

Bottom-up method
We can also search all leaves, select the most relevant ones and then extend them to larger units in postprocessing
For the query Macbeth’s castle, we would retrieve the title Macbeth’s castle in the first pass and then decide in a postprocessing step whether to return the title, the scene, the act or the play
Unfortunately, the relevance of a leaf element is also often not a good predictor of the relevance of elements it is contained in

Indexing-all-elements method
The least restrictive approach is to index all elements
This is also problematic
Many XML elements are not meaningful search results, e.g., typographical elements like <b>definitely</b> or an ISBN number which cannot be interpreted without context
Also, indexing all elements means that search results will be highly redundant
For the query Macbeth’s castle, we would return all of the play, act, scene and title elements on the path between the root node and Macbeth’s castle

Restriction strategies in Indexing-all-elements method
Discard all small elements
Discard all element types that users do not look at (this requires a working XML retrieval system that logs this information)
Discard all element types that assessors generally do not judge to be relevant (if relevance assessments are available)
Only keep element types that a system designer or librarian has deemed to be useful search results

8.2.3 How to reduce redundancy of result
In most of these approaches, result sets will still contain nested elements

The solution
Removing some elements in a postprocessing step
Collapsing several nested elements in the results list and use highlighting of query terms
An additional advantage of this approach is that the paragraph is presented together with its context section
This context may be helpful in interpreting the paragraph
If the user knows the schema of the collection and is able to specify the desired type of element, then the problem of redundancy is alleviated as few nested elements have the same type

8.2.4 How to distinguish different contexts of a term in compute term statistics for ranking
For example, the term Gates under the node author is unrelated to an occurrence under a content node like section if used to refer to the plural of gate
It makes little sense to compute a single document frequency for Gates in this example

The solution
Computing idf for XML-context/term pairs
For example, computing different idf weights for author#"Gates" and section#"Gates“
Unfortunately, this scheme will run into sparse data problems
Many XML-context pairs occur too rarely to reliably estimate df
A compromise is only to consider the parent node x of the term and not the rest of the path from the root to x to distinguish contexts
Unfortunately, we do not distinguish names of authors and names of corporations if both have the parent node name

8.2.5 How to handle schema heterogeneity
The XML documents in an IR application often come from more than one source
This phenomenon is called schema heterogeneity or schema diversity
An example of schema heterogeneity
Many elements may have different names
creator vs. author

The solution
Some form of approximate matching of element names in combination with semi-automatic matching of different document structures can help here
Human editing of correspondences of elements in different schemas will usually do better than automatic methods

The challenge for interface design
Another reason is that users often are not familiar with the element names and the structure of the schemas of collections they search as mentioned before
This poses a challenge for interface design in XML retrieval
Ideally, the user interface should expose the tree structure of the collection and allow users to specify the elements they are querying
Designing the query interface in structured retrieval is more complex than a search box for keyword queries in unstructured retrieval

Extended queries
supporting the user by interpreting all parent-child relationships in queries as descendant relationships with any number of intervening nodes allowed
Some examples with pseudo-XPath notation
book//#"Gates"
book//section//#"Gates"

This is a case where we may want to interpret the structural constraints specified in the query
Elements that do not meet structural constraints perfectly should be ranked lower, but they should not be omitted from search results

8.3 A vector space model for XML retrieval
An example
We want a book entitled Julius Caesar to be a match for q1 and no match (or a lower weighted match) for q2

The necessary of VSM in XML retrieval
In unstructured retrieval, there would be a single dimension of the vector space for Caesar
In XML retrieval, we must separate the title word Caesar from the author name Caesar
One way of doing this is to have each dimension of the vector space encode a word together with its position within the XML tree

Lexicalized subtrees
The dimensions of the vector space
Not vocabulary terms
The subtrees that contain at least one vocabulary term
We can now represent queries and documents as vectors in this space of lexicalized subtrees and compute matches between them

About space of VSM in XML retrieval
There is a tradeoff between the dimensionality of the space and accuracy of query results
A compromise is to index all paths that end in a single vocabulary term, in other words, all XML-context/term pairs
Such an XML-context/term pair is called by a structural term and denote it by <c, t>: a pair of XML-context c and vocabulary term t
This is why only seven structural term are shown in figure

The computation of VSM in XML retrieval
Because of users’ habit, we will therefore interpret all queries as extended queries
But we still prefer documents that match the query structure closely by inserting fewer additional nodes

A simple measure of the similarity of a path cq in a query and a path cd in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path, respectively
cq matches cd iff we can transform cq into cd by inserting additional nodes

SimNoMerge
The final score for a document is computed as a variant of the cosine measure which we call
We compute the weights using one of the weightings, e.g., idft·wft,d

8.4 Evaluation of XML Retrieval
The premier venue for research on XML retrieval is the INEX (INitiative for the Evaluation of XML retrieval) program
INEX is a collaborative effort that has produced reference collections, sets of queries, and relevance judgments

INEX
The INEX 2002 collection consisted of about 12,000 articles from IEEE journals
The IEEE journal collection was expanded in 2005
Since 2006 INEX uses the much larger English Wikipedia as a test collection

The schema of INEX collection
Two types of information needs in INEX
Content-Only: CO topics
Content-And-Structure: CAS topics
Since CAS queries have both structural and content criteria, relevance assessments are more complicated than in unstructured retrieval

How to evaluate
INEX 2002 defined two orthogonal dimensions of relevance
Component coverage
Topical relevance

Component coverage
It evaluates whether the element retrieved is “structurally” correct, i.e., neither too low nor too high in the tree
We distinguish four cases:
Exact coverage (E): The information sought is the main topic of the component and the component is a meaningful unit of information
Too small (S): The information sought is the main topic of the component, but the component is not a meaningful (self-contained) unit of information
Too large (L): The information sought is present in the component, but is not the main topic
No coverage (N): The information sought is not a topic of the component

Topical relevance
The topical relevance dimension also has four levels:
highly relevant (3)
fairly relevant (2)
marginally relevant (1)
nonrelevant (0)

The denotation of evaluation result
Components are judged on both dimensions and the judgments are then combined into a digit-letter code
2S is a fairly relevant component that is too small
3E is a highly relevant component that has exact coverage
In theory, there are 16 combinations of coverage and relevance, but many cannot occur
A nonrelevant component cannot have exact coverage, so the combination 3N is not possible

Evaluation scheme
The relevance-coverage combinations are then quantized as follows:
Binary relevance judgments, which are standard in unstructured information retrieval, are not appropriate for XML retrieval
A 2S component provides incomplete information and may be difficult to interpret without more context, but it does answer the query partially
The quantization function Q does not impose a binary choice relevant/nonrelevant and instead allows us to grade the component as partially relevant

The number of relevant components in a retrieved set A of components can then be computed as:

As an approximation, the standard definitions of precision, recall and F can be applied to this modified definition of relevant items retrieved

The disadvantage of this evaluation scheme
One flaw of measuring relevance this way is that overlap is not accounted for
Need to consider marginal relevance
Much of the recent focus at INEX has been on developing algorithms and evaluation measures that return non-redundant results lists and evaluate them properly

Why the effectiveness in XML retrieval is often lower?
XML retrieval is harder
Instead of just finding a document, we have to find the subpart of a document that is most relevant to the query
Graded judgments lower measured performance
Consider a system that returns a document with graded relevance 0.6 and binary relevance 1 at the top of the retrieved list. Then, interpolated precision at 0.00 recall is 1.0 on a binary evaluation, but can be as low as 0.6 on a graded evaluation

A comparison of content-only and full-structure search
In INEX 2003/2004
The evaluation metric is precision at k
The discretization function used for the evaluation maps highly relevant elements (roughly corresponding to the 3E elements defined for Q) to 1 and all other elements to 0

These results demonstrate the benefits of structured retrieval
The table shows that structure helps increase precision at the top of the results list
There is a large increase of precision at k = 5 and at k = 10
There is almost no improvement at k = 30
Structured retrieval imposes additional constraints on what to return and documents that pass the structural filter are more likely to be relevant
Recall may suffer at high levels because some relevant documents will be filtered out, but for precision-oriented tasks structured retrieval is superior

8.5 Some complicated XML retrieval
8.5.1 Data-centric XML retrieval
Data-centric XML mainly encodes numerical and non-text attribute- value data
When querying data-centric XML, we want to impose exact match conditions in most cases
An example of data-centric XML retrieval
Find employees whose salary is the same this month as it was 12months ago
This query requires
no ranking
exact matching

8.5.2 XML retrieval with joins and ordering constraints.
The query for employees with unchanged salary requires a join
The following query imposes an ordering constraint:
Retrieve the chapter of the book Introduction to algorithms that follows the chapter IR

About relational databases retrieval
Relational databases are better equipped to handle many structural constraints, particularly joins although ordering is also difficult in a database framework
The tuples of a relation in the relational calculus are not ordered
For this reason, most data-centric XML retrieval systems are extensions of relational databases
If text fields are short, exact matching meets user needs and retrieval results in form of unordered sets are acceptable, then using a relational database for XML retrieval is appropriate

XQuery
A query language proposed for standardization by the W3C
A powerful query language for XML that can handle numerical attributes, joins and ordering constraints
Due to its complexity, it is challenging to implement an XQuery-based ranked retrieval system
This is currently one of the most active areas of research in XML retrieval

[此贴子已经被作者于2010-12-14 09:28:23编辑过]

主题：[推荐]第九部分课堂讲义——XML检索