正文

Semantic Thumbnails

(2005-02-01 13:40:47) 下一个

Semantic Thumbnails - Summarizing XML Documents and Collections

Extended Abstract

Keywords: Content Repurposing, Content Management, Conversion, Data representation, Graphic, Ontology, Search, Semantic Web, Document semantics, Document summarization, Thumbnails

Dr Mehmet Dalkilic
Assistant Professor
Indiana University
School of Informatics
Bloomington
Indiana
United States of America

Biography

Mehmet M. Dalkilic is an Assistant Professor in the School of Informatics at Indiana University Bloomington. He also works in the Center for Genomics and Bioinformatics. His primary interests are data mining, data integration in biology, motif discovery, and streams.

Dr Arijit Sengupta
Assistant Professor
Indiana University
Kelley School of Business
Bloomington
Indiana
United States of America

Biography

Dr. Sengupta has a Ph.D. in Computer Science from Indiana University. He served as the director of Educational Development in Computer Science at Indiana University, and Assistant Professor in Computer Information Systems at Georgia State University before joining the Kelley School of Business. His main area of research is databases for complex structured documents. He has authored and published many papers in this area. Dr. Sengupta is also the developer of DocBase, a system for document querying, as well as query languages DSQL (Document SQL) for querying XML documents using SQL, and QBT (Query By Templates) for querying XML documents using a graphical form. Dr. Sengupta has presented papers in many conferences including XML03, SGML95, SGML96, WITS2002, CAiSE2002, DSI 2002.


Abstract


The concept of thumbnails is common in image representation. A thumbnail is a highly compressed version of an image that provides a small, yet complete visual representation to the human eye. We propose the adaptation of the concept of thumbnails to the domain of documents, whereby a thumbnail of any document can be generated from its semantic content, providing an adequate amount of information about the documents. However, unlike image thumbnails, document thumbnails are mainly for the consumption of software such as search engines, and other content processing systems. With the advent of the semantic web, the requirement for machine processing of documents has become extremely important. We give particular attention to electronic documents in XML and in RDF/XML, with a view towards the processing of documents in the semantic web.


Table of Contents


1. Introduction
2. Literature Review
3. Semantic Thumbnails
4. System and Program Architecture
5. Evaluation
6. Conclusion and Future Work
Acknowledgements
Bibliography
Footnotes

1. Introduction

In the last few decades, improvements in technology has allowed us to collect and store both data and information [1] inexpensively and easily. The Internet has, similarly, allowed inexpensive and easy "publishing" of these data and information. The grim side of this new information space is that navigation or search is usually difficult, because it is conducted by moving through text--most often a serial list of phrases (and the respective documents in which they appear) that contain matches to keywords initially provided by the user [2]. The obvious motivation here is a chain of relationships: the semantic content of the document is related to certain keywords (syntactic elements) which are, in turn, related to the search terms provided by the user. Unfortunately, as we all have experienced, this connection provides a very high selectivity, but very low specificity. The user is forced to wade through many thousands, if not tens or hundreds of thousands, of documents navigating by keywords alone.

A similar, smaller version of this problem exists on personal computers. Popular operating systems try to help by iconifying the data type, e.g., associating a cup and smokelike swirls with a Java program, an MS Word document with a sheet of paper and a 'W', and so forth. The idea motivating this is that the user can quickly scan different documents and choose the most meaningful. This is improved slightly by adding the file name to the icon--the user being able to make a better evaluation of the contents with little more overhead.

When searching for images instead of text, the visual element becomes even more useful and pronounced. An image is compressed into an image thumbnail (IT) that, in a very small footprint, can usually provide enough visual information to the user that a good guess as to its content can be reasonably made. Generating image thumbnails involves symmetric compression of pixels so that in spite of the loss in clarity, the thumbnail still keeps the basic shape and aspect of the original image. We believe that a document can likewise have a semantic thumbnail (ST): a "semantically" compressed representation of a document's content that would provide enough meaning to the user, so that it could be visually inspected quickly. We are interested in bringing the best of both visual and textual search. Like an IT, an ST should have a small footprint, and present enough information to make navigation concerning the content simple. We are also motivated by a current project in bioinformatics that requires a significant amount of text searching. We decided to implement STs as a component of this text search component and develop it on its own called BioKnOT [Dalkilic and Costello, 2004].

There have been attempts to provide document thumbnailing as a graphical problem, but from our point of view, there is not enough semantics provided (discussed below). The work on Resource Description Framework (RDF-discussed briefly below) provided us with the inspiration to create what amounts to mini-ontologies for the documents. Discussed more fully below, an ontology is a collection of entities (terms) and their interrelationships.

The concept of thumbnails has been extended for use with textual documents, potentially with embedded images. The research on document thumbnailing essentially treats the original document as an image representing the snapshot of the document when it is viewed, and uses the same compression techniques for image thumbnailing. This can be used for the purpose of quick summarization [Ogden et al, 1998], representation of search results [Ogden and Davis, 2000], enhanced browsing and scrolling using page thumbnails [Adobe 2003], understanding documents in other languages [Ogden, 1999], as well as for the purpose of interactive browsing [Lin and Hovy, 2002].

Treating documents as images, however, only summarizes layout, and not content. While this is adequate for the purpose of human viewing and browsing, this method of thumbnailing is not appropriate for deriving any semantic content from the thumbnail. In this paper, we adapt the concept of thumbnailing more for the purpose of capturing the semantics of the documents, rather than the layout.

The advent of the semantic web [Berners-Lee et al, 2001] provides an additional motivation for this work. Unlike the current world-wide web, documents in the semantic web are interlinked semantically, and search techniques for this new web will need to adequately, yet efficiently, use such embedded semantic information. For large document repositories, ontologies embedded in STs enable semantic applications (such as search engines) to quickly make retrieval decisions even without indexing.

Automated document keywording and summarizing is not entirely a new concept. Content analysis of documents is a common task for search engines, especially in search engines that do not create full-text indices. Many word processing tools include facilities for automated summarization. In such approaches, frequently occurring keywords are generated and sentences from the documents are ordered. A summary is then generated by picking sentences having the most number of keywords. The problem with this method is that it generates summarizations that are highly irregular, and although they give the appearance of being a readable document, they do not provide enough information for machine consumption.

The most important contribution of this work is in its implications for the next generation of semantic web systems where machines will be required to quickly process large sets of XML documents, often without the opportunity to index them ahead of time.

The rest of the paper is organized as follows. In Section Chapter 2, we discuss pertinent literature. In Section Chapter 3, we begin discussion of STs at a broad level and in Section Chapter 4, we present an overview of the system. Finally, we discuss proposed evaluation methods in Section Chapter 5, and conclude in Section Chapter 6.

2. Literature Review

Thumbnailing: Thumbnailing is primarily a visualization technique used for better interactive handling of large documents or document collections. Typically the thumbnail of an image representing the layout of the document is shown, potentially one image thumbnail (IT) for each page (e.g., [Adobe 2003]). The purpose of thumbnailing is primarily to retain the layout, since the thumbnails have no content information. The size of the thumbnails can be easily controlled by the user. Documents can also be thumbnailed by treating the layout of the documents as images (e.g., [Ogden and Davis, 2000], [Brin and Page, 1998])

Summarization: Summarization is the process of extracting keywords or potentially complete sentences that capture the text of the document. No layout information is retained. Some of the semantics of the document is captured. Again, the size of the summaries can be controlled by the user. Several applications of summarization exist primarily in single and multiple document retrieval. See e.g., [Salton et al, 1994], [Salton and Yang, 1973], [Lin and Hovy, 2002].

Compression: Compression is the process of reducing the size of a document by algorithms that make use of the unused bit spaces and repetitions in the document. This is an altogether different dimension, since usually compressions are lossless and reversible. The size of the compressed document depends on the algorithm used, and cannot be controlled by the user. Because of the textual nature, documents can be heavily compressed using standard compression methods (e.g., [Welch, 1984]). A particularly interesting observation is that because of the highly repetitive content of XML documents, compression of XML documents can result in very high compression ratio [Liefke and Suciu, 2000],[Tolani and Haritsa, 2002].

In order to properly motivate this research, we will consider current work being done in all three of the above areas. Compression is important because it plays a significant role in reducing bandwidth, although in this context its less important since the compressed documents are not human readable. The goal of this work is to produce summaries of documents that are effectively readable by both human and machine, and that can capture a significant portion of document semantics. Figure Figure 1 shows how the above three directions compare with our approach of semantic thumbnails. We briefly discuss some of the research in the above three areas below.

Figure Figure 1 shows a graph that places the above three document reduction methods in a 2x2 quadrant. The graph shows that compression retains both structure as well as semantic information of the documents, but since compressed documents are not human readable, and requires potentially processor-intensive decompression techniques to be usable, they are not suitable for fast searching and ranking. Document thumbnails are highly user-centric, and retain the document structure (layout), but they are not semantically rich, and cannot be used for machine-based automated retrieval. Document summarization provides adequate amount of information for automated retrieval methods, but loses semantic knowledge embedded in the document. This leads to the conceptualization of STs that fill the void. STs provide semantically rich thumbnails of documents that can be used for the purpose of user-centric, as well as machine-centric, retrieval purposes, while retaining adequate amount of semantic information within the documents.

3. Semantic Thumbnails

As discussed above, STs have many potential applications from visual searching by human agents to parsing, classification, and searching via machine agents. BioKnOT [Dalkilic and Costello, 2004]is a practical application of STs that illustrate how useful and effective they can be in the bioinformatics setting.

BioKnOT is an interactive document retrieval system that allows users to quickly and easily "drill-down" on a topic. It implements the use of STs and also allows for the iteration of document sets, which allows for refinement of the specificity of a user's search. To aid discussion we present some notation. Formally, we have a set of documents D. By di we mean the ith document in D. Let Tf denotes a set of terms from the documents of D, formally, img8.gif

A semantic thumbnail for a document di in D is a directed, weighted graph Gi = <Vi, Ei> where img11.gif , the set of nodes, is a collection of terms, and an edge img12.gif is a pair of weights that reflect intra-sentence and inter-sentence signficance. For this paper, we are focusing on the intra-sentence value. (see Figure 2).

Since the STs are built dynamically and interactively centered on the user, we describe the process here and treat some of the important elements in detail further in the paper. STs are built by first identifying the important nodes by TFIDF, then by establishing the weight of the edges through analysis of nearness by looking through the corpus of selected documents.

The generation of STs is initiated by a Boolean search provided by the user (see Figure Figure 3). Those documents of D for which the Boolean function is true are used for Tf generation. From Tf, a scoring matrix [Korf, 2003]Sij is created that indicates numerically whether pairs of words i,j where img19.gif, that occur within a certain reading frame of no more than 20 words (arrived at experimentally), are likely to be present other than by chance. The value is actually a log-odds ratio comparing observed frequency to, in this case, a random model, which is found from the set of documents that meet the Boolean search criteria. Scoring matrices are a universal tool in sequence alignments techniques that allow for disparate, though related molecules, to be substituted for one another in a sequence of molecules, and therefore, allow for non-identical sequences to be compared (see [Dalkilic and Costello, 2004] for a more complete discussion). The scoring matrix is used to compare STs generated from the abstracts of the documents with a ST constructed from user input. The relationships are captured in the spirit of ontologies.

4. System and Program Architecture

BioKnOT (figure Figure 4) consists of 5 core interfaces that interact with a document database. The mode of communication from client to server is CGI with Perl 5.8.0. The DBI Perl module was used to interact with the database.

Figure Figure 4 shows an illustration of the flow of the program. First the user enters a query on the initial search page. (1) A TFIDF calculation is done on the initially searched documents and the user is asked to rank these terms on the filter page. Next (2), the user is asked to enter a few sentences stating what type of document is being searhed for. (3) The scoring matrix is built and term relationships are constructed, and then the user is asked to supply these relationships with a score. (4) All the documents are scored and returned based on rank to the results page, which supplies the user with document data, illustration of the term relationships, and the URL to the document itself. The results page also serves as the refinement page, which (5) allows the user to iterate over the search with a more specific set of data, based on selected documents instead of a random model

The first web-based interface is the initial query page. This is where a user can enter Boolean search terms and using Boolean logic, a query is dynamically created to search the document database in the abstract and title fields.

The random model for comparing documents consists of the set of documents that meet the Boolean search criteria, noted Ds. The abstracts from set Ds are then pooled into a text file that will be used for the TFIDF calculations. This text file is passed to LUCAS. LUCAS is written in Java and communicates with BioKnOT through the SOAP protocol. Behind LUCAS is a term representation database, which is needed for the inverse document frequency calculations. The top 50 returned words from LUCAS, noted Tf, are used to create the filter page.

The filter page, which places set Tf into an HTML form, asks the user to select the most relevant terms to the search from Tf. These set of user selected terms, noted Tu are stored in hidden HTML fields and the user is given two options to proceed. First, the user can select the "Quick Search" option, which will bring the user directly to the results page, or second the user can select the "Enter More User Input" option, which will prompt the user to enter more data for a more precise search.

The overall concept behind BioKnOT is to supply the user with an effective interactive way to find documents related to very specific search criteria. Knowing this, one of the most important features of BioKnOT is to allow the user to do an iterative search over the database with more strictly defined search criteria.

The results page also provides the user with a means to further narrow down a search by selecting documents that are related to the user's specifications. After these documents have been selected, LUCAS is called again, but instead of the large broad sample set that was used on the first pass of the search, a very specific user selected set of documents are used to create the term filter page.

The process is then started all over again and can be run indefinitely.

BioKnOT saves the state of the users' previous searches and passes that information along to the scoring of the documents. The user is supplied this information and can change it at any time during the search. Figure Figure 3 shows a screen image of different parts of the system.

5. Evaluation

An initial pilot of BioKnOT was performed using data from the PubMed Life Sciences journals databasehttp://www.pubmedcentral.nih.gov/ and the Gene Ontology (GO) Consortium http://www.geneontology.org. The result from the pilot study was encouraging, and it demonstrated that without the presence of XML tags, the semantic thumbnails contain adequate amount of information regarding the document keywords. In addition, the relationships in the generated ontology adds more semantic knowledge regarding the documents. With the presence of XML tags, however, the ontologies generated become much more precise. A full study of the quality of the STs generated is currently being prepared. In this study, novice users will be given a collection of retrieval tasks. We will perform a between-groups study with one group having the semantic thumbnail information, and one group with only generated keywords. The efficiency (task completion speed) and accuracy (task answer correctness) will then be statistically compared to measure differences. The results from the pilot indicate that potential differences in user perspective do exist.

6. Conclusion and Future Work

We have presented a framework for document summarization utilizing the semantic content embedded in documents. This summarization, which we call Semantic Thumbnails (ST) provides a means for visualizing and comparing the document content at a high level. These thumbnails capture more semantic information from documents than purely graphical representations of search results, as well as visual representation of the layout of the documents.

The generated thumbnails have a number of highly desirable properties. First of all, semantic thumbnailing is closed in the document format, e.g., the generated structure for an RDF document is valid RDF, although the summary documents do not correspond to the original RDF schema. The most important aspect of this summarization strategy is in its accuracy of recall for purely keyword-based searches.

For future work, we are investigating other techniques to derive the semantic content than term frequencies. Also, we are implementing a method for automatically generating the document STs without user interaction. We are also in the process of developing STs for RDF/XML documents by utilizing the ontologies already embedded in such documents. We are also in the process of generating our own TFIDF repository (instead of LUCAS) -- presumably from bioinformatics documents to more closely reflect the domain. Lastly, we are implementing a temporal component of the Semantic Thumbnails to take into account the timeliness of the content of the documents.

Acknowledgements

We thank Dr. Dennis Groth and Dr. Javed Mostafa for their valuable comments during the process of developing the work.

Bibliography

[Adobe 2003]
Adobe Systems, San Jose, CA, USA. Adobe Reader 6.0 for Windows and Macintosh User Manual, 2003.
[Berners-Lee et al, 2001]
T. Berners-Lee, J. Hendler, and O. Lassila. The semantic web. Scientific American, May 2001.
[Brin and Page, 1998]
S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1-7):107-117, 1998.
[Dalkilic and Costello, 2004]
M. Dalkilic and J. Costello. BioKnOT: Biological knowledge through ontologies and TFIDF. In Proceedings, Workshop on Search and Discovery in Bioinformatics, SIGIR-Bio, 2004.
[Korf, 2003]
I. Korf, M. Yandell, and J. Bedell. Blast. O'Reilly And Associates, 2003
[Liefke and Suciu, 2000]
H. Liefke and D. Suciu. XMill: an efficient compressor for XML data. In Proceedings, ACM SIGMOD 2000, SIGMOD RECORD 29(2), pages 153-164, 2000.
[Lin and Hovy, 2002]
C.-Y. Lin and E. Hovy. From single to multi-document summarization: a prototype system and its evaluation. In Proceedings of the 40th Anniversity Meeting of the Association for Computational Linguistics (ACL-02), Philadelphia, PA, USA, 2002.
[Ogden, 1999]
W. Ogden. Getting information from documents you cannot read: An interactive cross-language text retrieval and summarization system, In SIGIR/DL Workshop on Multilingual Information Discovery and Access, Aug. 1999.
[Ogden et al, 1998]
W. C. Ogden, M. W. Davis, and S. Rice Document thumbnail visualization for rapid relevance judgments: When do they pay off?In Text REtrieval Conference, pages 528-534, 1998.
[Ogden and Davis, 2000]
W. C. Ogden and M. W. Davis. Improving cross-language text retrieval with human interactions. In Proceedings, HICSS, 2000.
[Salton et al, 1994]
G. Salton, J. Allan, C. Buckley, and A. Singhal. Automatic analysis, term generation and summarization of machine readable texts. Science, 264:1421-1426, June 1994.
[Salton and Yang, 1973]
G. Salton and C. Yang. On the specification of term values in automatic indexing. Journal of Documentation, 29:351-372, April 1973.
[Tolani and Haritsa, 2002]
P. Tolani and J. R. Haritsa. XGRIND: A query-friendly XML compressor. In ICDE, 2002.
[Welch, 1984]
T. Welch. A technique for high-performance data compression. IEEE Computer, 17(6):8-19, 1984.

Footnotes

  1. We draw a distinction between data, e.g., FASTA format for genomics data, and information, e,g., an experimental biology paper on, say, apoptosis (programmed cell-death). Information is much richer, semi- or unstructured, and is much more difficult to search.

  2. In this work we are not interested in examining how the documents themselves are ordered, say though link analysis.

XHTML rendition made possible by SchemaSoft's Document Interpreter™ technology.

[ 打印 ]
阅读 ()评论 (0)
评论
目前还没有任何评论
登录后才可评论.