SWAT Publications

Heflin, Jeff and Song, Dezhao . Ontology Instance Linking: Towards Interlinked Knowledge Graphs. The Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16). Phoenix, AZ, USA. AAAI Press. February 2016.
Abstract |Full paper local copy | OWL annotation

Due to the decentralized nature of the Semantic Web, the same real-world entity may be described in various data sources with different ontologies and assigned syntactically distinct identifiers. In order to facilitate data utilization and consumption in the Semantic Web, without compromising the freedom of people to publish their data, one critical problem is to appropriately interlink such heterogeneous data. This interlinking process is sometimes referred to as Entity Coreference, i.e., finding which identifiers refer to the same real-world entity. In this paper, we first summarize state-of-the-art algorithms in detecting such coreference relationships between ontology instances. We then discuss various techniques in scaling entity coreference to large-scale datasets. Finally, we present well-adopted evaluation datasets and metrics, and compare the performance of the state-of-the-art algorithms on such datasets.

2015 Papers

Priya, Sambhawa, Jiang, Guoqian, Dasari, Surendra, Zimmermann, Michael T., Wang, Chen, Heflin, Jeff and Chute, Christopher G. . A Semantic Web-based System for Mining Genetic Mutations in Cancer Clinical Trials. AMIA 2015 Joint Summits on Translational Science (Clinical Research Informatics). San Francisco, CA. 2015.
Abstract |Full paper local copy | OWL annotation

Textual eligibility criteria in clinical trial protocols contain important information about potential clinically relevant pharmacogenomic events. Manual curation for harvesting this evidence is intractable as it is error prone and time consuming. In this paper, we develop and evaluate a Semantic Web-based system that captures and manages mutation evidences and related contextual information from cancer clinical trials. The system has 2 main components: an NLP-based annotator and a Semantic Web ontology-based annotation manager. We evaluated the performance of the annotator in terms of precision and recall. We demonstrated the usefulness of the system by conducting case studies in retrieving relevant clinical trials using a collection of mutations identified from TCGA Leukemia patients and Atlas of Genetics and Cytogenetics in Oncology and Haematology. In conclusion, our system using Semantic Web technologies provides an effective framework for extraction, annotation, standardization and management of genetic mutations in cancer clinical trials.

Song, Dezhao, Kim, Edward, Huang, Xiaolei, Patruno, Joseph E., Munoz-Avila, Hector, Heflin, Jeff, Long, Rodney and Antani, Sameer . Multi-modal Entity Coreference for Cervical Dysplasia Diagnosis. IEEE Transactions on Medical Imaging (IEEE TMI). 34( 1) IEEE. 2016. pp.229-245.
Abstract |Full paper local copy | OWL annotation

Cervical cancer is the second most common type of cancer for women. Existing screening programs for cervical cancer, such as Pap Smear, suffer from low sensitivity. Thus, many patients who are ill are not detected in the screening process. Using images of the cervix as an aid in cervical cancer screening has the potential to greatly improve sensitivity, and can be especially useful in resource-poor regions of the world. In this work, we develop a data-driven computer algorithm for interpreting cervical images based on color and texture. We are able to obtain 74% sensitivity and 90% specificity when differentiating high-grade cervical lesions from low-grade lesions and normal tissue. On the same dataset, using Pap tests alone yields a sensitivity of 37% and specificity of 96%, and using HPV test alone gives a 57% sensitivity and 93\% specificity. Furthermore, we develop a comprehensive algorithmic framework based on Multi-Modal Entity Coreference for combining various tests to perform disease classification and diagnosis. When integrating multiple tests, we adopt information gain and gradient-based approaches for learning the relative weights of different tests. In our evaluation, we present a novel algorithm that integrates cervical images, Pap, HPV and patient age, which yields 83.21% sensitivity and 94.79% specificity, a statistically significant improvement over using any single source of information alone.

2014 Papers

Priya, Sambhawa, Guo, Yuanbo, Spear, Michael and Heflin, Jeff . Partitioning OWL Knowledge Bases for Parallel Reasoning. Eighth IEEE International Conference on Semantic Computing (ICSC 2014). Newport Beach, CA. 2014.
Abstract |Full paper local copy | OWL annotation

The ability to reason over large scale data and return responsive query results is widely seen as a critical step to achieving the Semantic Web vision. We describe an approach for partitioning OWL Lite datasets and then propose a strategy for parallel reasoning about concept instances and role instances on each partition. The partitions are designed such that each can be reasoned on independently to find answers to each query subgoal, and when the results are unioned together, a complete set of results are found for that subgoal. Our partitioning approach has a polynomial worst case time complexity in the size of the knowledge base. In our current implementation, we partition semantic web datasets and execute reasoning tasks on partitioned data in parallel on independent machines. We implement a master-slave architecture that distributes a given query to the slave processes on different machines. All slaves run in parallel, each performing sound and complete reasoning to execute each subgoal of its query on its own set of partitions. As a final step, master joins the results computed by the slaves. We study the impact of our parallel reasoning approach on query performance and show some promising results on LUBM data.

Zhang, Xingjian, Song, Dezhao, Priya, Sambhawa, Daniels, Zachary, Reynolds, Kelly and Heflin, Jeff . Exploring Linked Data with Contextual Tag Clouds. Journal of Web Semantics. 24( January 2014. pp.33-39.
Abstract |Full paper local copy | OWL annotation

In this paper we present the contextual tag cloud system: a novel application that helps users explore a large scale RDF dataset. Unlike folksonomy tags used in most traditional tag clouds, the tags in our system are ontological terms (classes and properties), and a user can construct a context with a set of tags that defines a subset of instances. Then in the contextual tag cloud, the font size of each tag depends on the number of instances that are associated with that tag and all tags in the context. Each contextual tag cloud serves as a summary of the distribution of relevant data, and by changing the context, the user can quickly gain an understanding of patterns in the data. Furthermore, the user can choose to include RDFS taxonomic and/or domain/range entailment in the calculations of tag sizes, thereby understanding the impact of semantics on the data. In this paper, we describe how the system can be used as a query building assistant, a data explorer for casual users, or a diagnosis tool for data providers. To resolve the key challenge of how to scale to Linked Data, we combine a scalable preprocessing approach with a specially-constructed inverted index, use three approaches to prune unnecessary counts for faster online computations, and design a paging and streaming interface. Together, these techniques enable a responsive system that in particular holds a dataset with more than 1.4 billion triples and over 380,000 tags. Via experimentation, we show how much our design choices benefit the responsiveness of our system.

2013 Papers

Song, Dezhao and Heflin, Jeff . Domain-Independent Entity Coreference for Linking Ontology Instances. ACM Journal of Data and Information Quality (ACM JDIQ). 4( 2) ACM. 2013. pp.1-29.
Abstract |Full paper local copy | OWL annotation

The objective of entity coreference is to determine if different mentions (e.g., person names, place names, database records, ontology instances, etc.) refer to the same real word object. Entity coreference algorithms can be used to detect duplicate database records and to determine if two Semantic Web instances represent the same underlying real word entity. The key issues in developing an entity coreference algorithm include how to locate context information and how to utilize the context appropriately. In this paper, we present a novel entity coreference algorithm for ontology instances. For scalability reasons, we select a neighborhood of each instance from an RDF graph. To determine the similarity between two instances, our algorithm computes the similarity between comparable property values in the neighborhood graphs. The similarity of distinct URIs and blank nodes is computed by comparing their outgoing links. In an attempt to reduce the impact of distant nodes on the final similarity measure, we explore a distance-based discounting approach. To provide the best possible domain-independent matches, we propose an approach to compute the discriminability of triples in order to assign weights to the context information. We evaluated our algorithm using different instance categories from five datasets. Our experiments show that the best results are achieved by including both our discounting and triple discrimination approaches.

Tao, Cui, Song, Dezhao, Sharma, Deepak and Chute, Christopher G. . Semantator: Semantic Annotator for Converting Biomedical Text to Linked Data. Journal of Biomedical informatics (JBI). 46( 5) Elsevier. 2013. pp.882-893.
Abstract |Full paper local copy | OWL annotation

More than 80% of biomedical data is embedded in plain text. The unstructured nature of these text-based documents makes it challenging to easily browse and query the data of interest in them. One approach to facilitate browsing and querying biomedical text is to convert the plain text to a linked web of data, i.e., converting data originally in free text to structured formats with defined meta-level semantics. In this paper, we introduce Semantator (Semantic Annotator), a semantic-web-based environment for annotating data of interest in biomedical documents, browsing and querying the annotated data, and interactively refining annotation results if needed. Through Semantator, information of interest can be either annotated manually or semi-automatically using plug-in information extraction tools. The annotated results will be stored in RDF and can be queried using the SPARQL query language. In addition, semantic reasoners can be directly applied to the annotated data for consistency checking and knowledge inference. Semantator has been released online and was used by the biomedical ontology community who provided positive feedbacks. Our evaluation results indicated that 1) Semantator can perform the annotation functionalities as designed; 2) Semantator can be adopted in real applications in clinical and transactional research; and 3) the annotated results using Semantator can be easily used in Semantic-web-based reasoning tools for further inference.

Zhang, Xingjian and Heflin, Jeff . Using Instance Texts to Improve Keyword-based Class Retrieval. 2013 IEEE/WIC/ACM International Conference on Web Intelligence (WI2013). Atlanta, GA. IEEE. November 2013.
Abstract |Full paper local copy | OWL annotation

In this paper we investigate the keyword based class retrieval problem, which we define as how to identify ontological classes that best match a keyword based query. Most previous applications use simple syntactic matching approaches on the class labels and/or comments, or expand the keyword query by using lexicons such as WordNet, but fail to retrieve relevant resources in many scenarios. Instead of relying on external sources, we investigate this problem by using the annotations of instances associated with classes in the knowledge base. We propose a general framework of this approach, which consists of two phases: the keyword query is first used to locate relevant instances; then we induce the classes given this list of weighted matched instances. If we identify sufficient text for the instances, then the first phase can be solved by a traditional information retrieval (IR) query, however the second phase might be cast in different ways: as an additive value function, as an IR problem with instance as queries, or as an instance-based ontology alignment problem. With many applicable strategies initiated from different viewpoints, we find that some of them are mathematically equivalent or very similar. In the experiments we compare our proposed framework to simple syntactic approaches and evaluate different strategies.

Zhang, Xingjian, Song, Dezhao, Priya, Sambhawa and Heflin, Jeff . Infrastructure for Efficient Exploration of Large Scale Linked Data via Contextual Tag Clouds. International Semantic Web Conference. Sydney, Australia. International Semantic Web Conference. 2013.
Abstract |Full paper local copy | OWL annotation

Zhang, Xingjian, Song, Dezhao, Priya, Sambhawa and Heflin, Jeff . Infrastructure for Efficient Exploration of Large Scale Linked Data via Contextual Tag Clouds. Technical Report LU-CSE-13-002. Dept. of Computer Science and Engineering. 2013.
Abstract | Full paper | OWL annotation

In this paper we present the infrastructure of the contextual tag cloud system which can execute large volumes of queries about the number of instances that use particular ontological terms. The contextual tag cloud system is a novel application that helps users explore a large scale RDF dataset: the tags are ontological terms (classes and properties), the context is a set of tags that defines a subset of instances, and the font sizes reflect the number of instances that use each tag. It visualizes the patterns of instances specified by the context a user constructs. Given a request with a specific context, the system needs to quickly find what other tags the instances in the context use, and how many instances in the context use each tag. The key question we answer in this paper is how to scale to Linked Data; in particular we use a dataset with 1.4 billion triples and over 380,000 tags. This is complicated by the fact that the calculation should, when directed by the user, consider the entailment of taxonomic and/or domain/range axioms in the ontology. We combine a scalable preprocessing approach with a specially-constructed inverted index and use three approaches to prune unnecessary counts for faster intersection computations. We compare our system with a stateof-the-art triple store, examine how pruning rules interact with inference and analyze our design choices.

2012 Papers

Frye, Lisa, Cheng, Liang and Heflin, Jeff . An Ontology-Based System to Identify Complex Network Attacks. First IEEE International Workshop on Security and Forensics in Communication Systems, part of IEEE International Conference on Communications 2012. Ottawa, Canada. 2012.
Abstract |Full paper local copy | OWL annotation

Intrusion Detection Systems are tools used to detect attacks against networks. Many of these attacks are a sequence of multiple simple attacks. These complex attacks are more difficult to identify because (a) they are difficult to predict, (b) almost anything could be an attack, and (c) there are a huge number of possibilities. The problem is that the expertise of what constitutes an attack lies in the tacit knowledge of experienced network engineers. By providing an ontological representation of what constitutes a network attack human expertise to be codified and tested. The details of this representation are explained. An implementation of the representation has been developed. Lastly, the use of the representation in an Intrusion Detection System for complex attack detection has been demonstrated using use cases.

Song, Dezhao, Chute, Christopher G. and Tao, Cui . Semantator: Annotating Clinical Narratives with Semantic Web Ontologies. San Francisco, CA, USA. AMIA Summit on Clinical Research Informatics (CRI2012). March 2012. pp.47-56.
Abstract |Full paper local copy | OWL annotation

To facilitate clinical research, clinical data needs to be stored in a machine processable and understandable way. Manual annotating clinical data is time consuming. Automatic approaches (e.g., Natural Language Processing systems) have been adopted to convert such data into structured formats; however, the quality of such automatically extracted data may not always be satisfying. In this paper, we propose Semantator, a semi-automatic tool for document annotation with Semantic Web ontologies. With a loaded free text document and an ontology, Semantator supports the creation/deletion of ontology instances for any document fragment, linking/disconnecting instances with the properties in the ontology, and also enables automatic annotation by connecting to the NCBO annotator and cTAKES. By representing annotations in Semantic Web standards, Semantator supports reasoning based upon the underlying semantics of the owl:disjointWith and owl:equivalentClass predicates. We present discussions based on user experiences of using Semantator.

Song, Dezhao and Heflin, Jeff . Accuracy vs. Speed: Scalable Entity Coreference on the Semantic Web with On-the-Fly Pruning. The 2012 IEEE/WIC/ACM International Conference on Web Intelligence (WI2012). Macau, China. IEEE. December 2012.
Abstract |Full paper local copy | OWL annotation

One challenge for the Semantic Web is to scalably establish high quality owl:sameAs links between coreferent ontology instances in different data sources; traditional approaches that exhaustively compare every pair of instances do not scale well to large datasets. In this paper, we propose a pruning-based algorithm for reducing the complexity of entity coreference. First, we discard candidate pairs of instances that are not sufficiently similar to the same pool of other instances. A sigmoid function based thresholding method is proposed to automatically adjust the threshold for such commonality on-the-fly. In our prior work, each instance is associated with a context graph consisting of neighboring RDF nodes. In this paper, we speed up the comparison for a single pair of instances by pruning insignificant context in the graph; this is accomplished by evaluating its potential contribution to the final similarity measure. We evaluate our system on three Semantic Web instance categories. We verify the effectiveness of our thresholding and context pruning methods by comparing to nine state-of-the-art systems. We show that our algorithm frequently outperforms those systems with a runtime speedup factor of 18 to 24 while maintaining competitive F1-scores. For datasets of up to 1 million instances, this translates to as much as 370 hours improvement in runtime.

Song, Dezhao and Heflin, Jeff . A Pruning Based Approach for Scalable Entity Coreference. The 25th International Florida Artificial Intelligence Research Society Conference (FLAIRS-25). Marco Island, Florida, USA. AAAI. May 2012. pp.98-103.
Abstract |Full paper local copy | OWL annotation

Entity coreference is the process to decide which identifiers (e.g., person names, locations, ontology instances, etc.) refer to the same real world entity. In the Semantic Web, entity coreference can be used to detect equivalence relationships between heterogeneous Semantic Web datasets to explicitly link coreferent ontology instances via the owl:sameAs property. Due to the large scale of Semantic Web data today, we propose two pruning techniques for scalably detecting owl:sameAs links between ontology instances by comparing the similarity of their context graphs. First, a sampling based technique is designed to estimate the potential contribution of each RDF node in the context graph and prune insignificant context. Furthermore, a utility function is defined to reduce the cost of performing such estimations. We evaluate our pruning techniques on three Semantic Web instance categories. We show that the pruning techniques enable the entity coreference system to run 10 to 35 times faster than without them while still maintaining comparably good F1-scores.

Song, Dezhao and Heflin, Jeff . Scalable and Domain-Independent Entity Coreference: Establishing High Quality Data Linkages Across Heterogeneous Data Sources. The 11th International Semantic Web Conference. Boston, MA, USA. Springer. November 2012.
Abstract |Full paper local copy | OWL annotation

Due to the decentralized nature of the Semantic Web, the same real world entity may be described in various data sources and assigned syntactically distinct identifiers. In order to facilitate data utilization in the Semantic Web, without compromising the freedom of people to publish their data, one critical problem is to appropriately interlink such heterogeneous data. This interlinking process can also be referred to as Entity Coreference, i.e., finding which identifiers refer to the same real world entity. This proposal will investigate algorithms to solve this entity coreference problem in the Semantic Web in several aspects. The essence of entity coreference is to compute the similarity of instance pairs. Given the diversity of domains of existing datasets, it is important that an entity coreference algorithm be able to achieve good precision and recall across domains represented in various ways. Furthermore, in order to scale to large datasets, an algorithm should be able to intelligently select what information to utilize for comparison and determine whether to compare a pair of instances to reduce the overall complexity. Finally, appropriate evaluation strategies need to be chosen to verify the effectiveness of the algorithms.

2011 Papers

Kim, Edward, Huang, Xiaolei and Heflin, Jeff . A Visual Image Persons Search Using a Content Property Reasoner and Web Ontology. IEEE International Conference on Multimedia and Expo (ICME). Barcelona. 2011.
Abstract |Full paper local copy | OWL annotation

We present a semantic based search tool, VIPs, i.e. Visual Image Persons Search, on the domain of VIPs, i.e. very important people. Our tool explores the possibilities of content based image search supported by ontological reasoning. Our framework integrates information from both image processing algorithms and semantic knowledge bases to perform interesting queries that would otherwise be impossible. We describe a novel property reasoner that is able to translate low level image features into semantically relevant object properties. Finally, we demonstrate interesting searches supported by our framework on the domain of people, the majority of whom are movie celebrities, using the properties translated by our system as well as existing ontologies available on the web.

Lamiroy, Bart, Lopresti, Daniel, Korth, Henry F. and Heflin, Jeff . How Carefully Designed Open Resource Sharing Can Help and Expand Document Analysis Research. Document Recognition and Retrieval XVIII. San Francisco, CA. 2011.
Abstract |Full paper local copy | OWL annotation

Making datasets available for peer reviewing of published document analysis methods or distributing large commonly used document corpora for benchmarking are extremely useful and sound practices and initiatives. This paper shows that they cover only a very tiny segment of the uses shared and commonly available research data may have. We develop a completely new paradigm for sharing and accessing common data sets, benchmarks and other tools that is based on a very open and free community based contribution model. The model is operational and has been implemented so that it can be tested on a broad scale. The new interactions that will arise from its use may spark innovative ways of conducting document analysis research on the one hand, but create very challenging interactions with other research domains as well.

Li, Yingjie and Heflin, Jeff . Handling Cyclic Axioms in Dynamic, Web-Scale Knowledge Bases. SSWS2011. Bonn,Germany. Springer. 2011.
Abstract |Full paper local copy | OWL annotation

In recent years, there has been an explosion of publicly available Semantic Web data. In order to effectively integrate millions of small, distributed data sources and quickly answer queries, we previously proposed a tree structure query optimization algorithm that uses source selectivity of each query subgoal as the heuristic to plan the query execution and uses the most selective subgoals to provide constraints that make other subgoals selective. However, this constraint propagation is incomplete when the relevant ontologies contain cyclic axioms. Here, we propose an improvement to this algorithm that is complete for cyclic axioms, yet still able to scale to millions of data sources.

Song, Dezhao and Heflin, Jeff . Automatically Generating Data Linkages Using a Domain-Independent Candidate Selection Approach. 10th International Semantic Web Conference. Bonn, Germany. LNCS 7031. Springer. November 2011.
Abstract |Full paper local copy | OWL annotation

One challenge for Linked Data is scalably establishing high-quality owl:sameAs links between instances (e.g., people, geographical locations, publications, etc.) in different data sources. Traditional approaches to this entity coreference problem do not scale because they exhaustively compare every pair of instances. In this paper, we propose a candidate selection algorithm for pruning the search space for entity coreference. We select candidate instance pairs by computing a character-level similarity on discriminating litera values that are chosen using domain-independent unsupervised learning. We index the instances on the chosen predicates’ literal values to efficiently look up similar instances. We evaluate our approach on two RDF and three structured datasets. We show that the traditional metrics don’t always accurately reflect the relative benefits of candidate selection, and propose additional metrics. We show that our algorithm frequently outperforms alternatives and is able to process 1 million instances in under one hour on a single Sun Workstation. Furthermore, on the RDF datasets, we show that the entire entity coreference process scales well by applying our technique. Surprisingly, this high recall, low precision filtering mechanism frequently leads to higher F-scores in the overall system.

Yu, Yang and Heflin, Jeff . Detecting Abnormal Data for Ontology Based Information Integration. International Workshop on Semantic Technologies for Information-Integrated Collaboration. Philadelphia, PA, USA. IEEE. May 2011. pp.431-438.
Abstract |Full paper local copy | OWL annotation

To better support information integration on Semantic Web data with varying degrees of quality, this paper proposes an approach to detect triples which reflect some sort of error. In particular, erroneous triples may occur due to factual errors in the original data source, misuse of the ontology by the original data source, or errors in the integration process. Although diagnosing such errors is a difficult problem, we propose that the degree to which a triple deviates from similar triples can be an important heuristic for identifying errors. We detect such ``abnormal triples'' by learning probabilistic rules from the reference data and checking to what extent these rules agree with the triples. The system consists of two components for two types of abnormal relational descriptions that a Semantic Web statement could have, whether accidentally or maliciously: a statement could relate two resources that are unlikely to have anything in common or an inappropriate predicate could be used to describe the relation between the two resources. The classification technique is adopted to learn statistical characteristics for detecting a suspect resource pair, i.e. there is no significant relation between the subject and the object in the statement. For the suspect usages of a predicate, the system learns semantic patterns for each predicate from indirect semantic connections between the subject / object pairs.

Yu, Yang and Heflin, Jeff . Extending Functional Dependency to Detect Abnormal Data in RDF Graphs. The 10th International Semantic Web Conference. Bonn, Germany. Springer. November 2011.
Abstract |Full paper local copy | OWL annotation

Data quality issues arise in the Semantic Web because data is created by diverse people and/or automated tools. In particular, erroneous triples may occur due to factual errors in the original data source, the acquisition tools employed, misuse of ontologies, or errors in ontology alignment. We propose that the degree to which a triple deviates from similar triples can be an important heuristic for identifying errors. Inspired by functional dependency, which has shown promise in database data quality research, we introduce \textit{value-clustered graph functional dependency} to detect abnormal data in RDF graphs. To better deal with Semantic Web data, this extends the concept of functional dependency on several aspects. First, there is the issue of scale, since we must consider the whole data schema instead of being restricted to one database relation. Second, it deals with multi-valued properties without explicit value correlations as specified as tuples in databases. Third, it uses clustering to consider classes of values. Focusing on these characteristics, we propose a number of heuristics and algorithms to efficiently discover the extended dependencies and use them to detect abnormal data. Experiments have shown that the system is efficient on multiple data sets and also detects many quality problems in real world data.

Yu, Yang, Li, Yingjie and Heflin, Jeff . Detecting Abnormal Semantic Web Data Using Semantic Dependency. Fifth IEEE International Conference on Semantic Computing (ICSC 2011). Stanford University, Palo Alto, CA, USA. September 2011.
Abstract |Full paper local copy | OWL annotation

Data quality is a critical problem for the Semantic Web. We propose that the degree to which a triple deviates from similar triples can be an important heuristic for identifying errors. Inspired by data dependency, which has shown promise in database data quality research, we introduce \textit{semantic dependency} to assess quality of Semantic Web data. The system first builds a summary graph for finding candidate semantic dependencies. Each semantic dependency has a probability according to its instantiations and is subsequently adjusted based on the inconsistencies among them. Then triples can get a posterior probability of normality based on what semantic dependencies can support each of them. Repeating the iteration above, the proposed approach detects abnormal Semantic Web data. Experiments have shown that the system is efficient on data set with 10M triples and has more than a ten percent F-score improvement over our previous system.

Yu, Yang, Zhang, Xingjian and Heflin, Jeff . Learning to Detect Abnormal Semantic Web Data. Banff, Canada. K-CAP '11 Proceedings of the sixth international conference on Knowledge capture . ACM. June 2011. pp.177-178.
Abstract |Full paper local copy | OWL annotation

As different tools are used to capture knowledge from various sources, it is essential that we have approaches to assess the quality of this data. In this paper we focus on Semantic Web data and identifying potential erroneous relational descriptions between objects in triples. In particular, erroneous triples may occur due to factual errors in the original data source, the acquisition tools employed, misuse of ontologies or errors in ontology alignment. Although diagnosing such errors is a difficult problem, we propose that the degree to which a triple deviates from similar triples can be an important heuristic for identifying errors. We detect such ``abnormal triples'' by learning probabilistic rules from the reference data and checking to what extent these rules agree with the context of triples. The context for two objects in a triple is represented in a vector space in which each element is a distinct semantic connection between the objects. To deal with the open world assumption underlying the Semantic Web data, the system uses three mechanisms. First, inspired by instance mapping, the system enriches the context by exploring similar instance pairs. Second, the system defines a novel semantic similarity between contexts which considers partial similarity between different semantic connections. Third, the system uses an unsupervised learning model which interprets non-existing triples in reference data as missing values. Finally to reduce the learning time, the experiments demonstrate that a proposed sampling method is applicable.

Zhang, Xingjian and Heflin, Jeff . Using Tag Clouds to Quickly Discover Patterns in Linked Data Sets. Second International Workshop on Consuming Linked Data (COLD2011). 2011.
Abstract |Full paper local copy | OWL annotation

Casual users usually have knowledge gaps that prevent them from using a Knowledge Base (KB) effectively. This problem is exacerbated by KBs for linked data sets because they cover ontologies with diverse domains and the data is often incomplete with regard to the ontologies. We believe providing visual summaries of how instances use ontological terms (classes and properties) is a promising route to reveal patterns in the KB and quickly familiarize users with it. In this paper we propose a novel contextual tag cloud system, that treats the ontological terms as tags and uses the font size of tags to reflect the number of instances related to the tags. As opposed to traditional tag clouds, which have a single view over all the data, our system has a dynamically generated set of tag clouds each of which shows proportional relations to a context specified as a tag set of classes and properties. Furthermore, our tags have a precise semantics enabling inference of tags. We optimize the infrastructure to enable scalable online computation. We give several examples of discoveries made about DBPedia using our system.

2010 Papers

Korth, Henry F., Song, Dezhao and Heflin, Jeff . Metadata for Structured Document Datesets. Ninth International Workshop on Document Analysis Systems (DAS 2010) (short papers). 2010.
Abstract |Full paper local copy | OWL annotation

In order for a large dataset of documents to be usable by document analysts, the dataset must be searchable on document features and on the results of prior analytic work. This paper describes a work-in-progress to develop such a document repository. We describe the types of data we plan to maintain regarding both the documents themselves and analyses performed on those documents. By storing the provenance of all metadata pertaining to documents, the repository will allow researchers to determine dependency relationships among document analyses. Our ultimate goal is to enable geographically separated teams of researchers to collaborate in large document analysis efforts.

Li, Yingjie and Heflin, Jeff . A Scalable Indexing Mechanism for Ontology-Based Information Integration. International Conference on Web Intelligence(WI10). 2010.
Abstract |Full paper local copy | OWL annotation

In recent years, there has been an explosion of publicly available RDF and OWL web pages. Typically, these pages are small, heterogeneous and prone to change frequently. In order to effectively integrate them, we propose to adapt a query reformulation algorithm and combine it with an information retrieval inspired index in order to select all sources relevant to a query. We treat each RDF document as a bag of URIs and literals and build an inverted index. Our system first reformulates the user's query into a set of subgoals and then translates these into Boolean queries against the index in order to determine which sources are relevant. Finally, the selected data sources and the relevant ontology mappings are used in conjunction with a description logic reasoner to provide an efficient query answering solution for the Semantic Web. We have evaluated our system using ontology mappings and ten million real world data sources.

Li, Yingjie and Heflin, Jeff . Query Optimization for Ontology-Based Information Integration. Nineteenth International Conference on Information and Knowledge Management (CIKM 10). 2010.
Abstract |Full paper local copy | OWL annotation

In recent years, there has been an explosion of publicly available RDF and OWL data sources. In order to effectively and quickly answer queries in such an environment, we present an approach to identifying the potentially relevant Semantic Web data sources using query rewritings and a term index. We demonstrate that such an approach must carefully handle query goals that lack constants; otherwise the algorithm may identify many sources that do not contribute to even- tual answers. This is because the term index only indicates if URIs are present in a document, and specific answers to a subgoal cannot be calculated until the source is physically accessed - an expensive operation given disk/network latency. We present an algorithm that, given a set of query rewritings that accounts for ontology heterogeneity, incre- mentally selects and processes sources in order to maintain selectivity. Once sources are selected, we use an OWL rea- soner to answer queries over these sources and their corre- sponding ontologies. We present the results of experiments using both a synthetic data set and a subset of the real-world Billion Triple Challenge data.

Li, Yingjie and Heflin, Jeff . Query Optimization for Ontology-Based Information Integration. Technical Report LU-CSE-10-002. Dept. of Computer Science and Engineering, Lehigh University. 2010.
Abstract | Full paper | OWL annotation

In recent years, there has been an explosion of publicly available RDF and OWL data sources. In order to effectively and quickly answer queries in such an environment, we present an approach to identifying the potentially rele- vant Semantic Web data sources using query rewriting and a resource index. We demonstrate that such an approach must carefully handle query goals that lack constants; otherwise the algorithm may identify many sources that do not contribute to eventual answers. This is because the resource index only indicates if URIs are present in a document, and specific answers to a subgoal cannot be calculated until the source is physically accessed an expensive operation given disk or network latency. We present an algorithm that, given a set of query rewritings that accounts for ontology hetero- geneity, incrementally selects and processes sources in order to maintain selectivity. Once sources are selected, we use an OWL reasoner to answer queries over these sources and their corresponding ontologies. We present the results of ex- periments using both a synthetic data set and a subset of the real-world Billion Triple Challenge data.

Li, Yingjie and Heflin, Jeff . Using Reformulation Trees to Optimize Queries over Distributed Heterogeneous Sources. Ninth International Semantic Web Conference (ISWC 2010). 2010.
Abstract |Full paper local copy | OWL annotation

In order to effectively and quickly answer queries in envi- ronments with distributed RDF/OWL, we present a query optimization algorithm to identify the potentially relevant Semantic Web data sources using structural query features and a term index. This algorithm is based on the observation that the join selectivity of a pair of query triple patterns is often higher than the overall selectivity of these two patterns treated independently. Given a rule goal tree that expresses the reformu- lation of a conjunctive query, our algorithm uses a bottom-up approach to estimate the selectivity of each node. It then prioritizes loading of selective nodes and uses the information from these sources to further con- strain other nodes. Finally, we use an OWL reasoner to answer queries over the selected sources and their corresponding ontologies. We have evaluated our system using both a synthetic data set and a subset of the real-world Billion Triple Challenge data.

Li, Yingjie, Qasem, Abir and Heflin, Jeff . A Scalable Indexing Mechanism for Ontology-Based Information Integration. Technical Report LU-CSE-10-001. Dept. of Computer Science and Engineering, Lehigh University. 2010.
Abstract | Full paper | OWL annotation

In recent years, there has been an explosion of publicly available RDF and OWL web pages. Typically, these pages are small,heterogeneous and prone to change frequently. In order to effectively integrate them, we propose to adapt a query reformulation algorithm and combine it with an information retrieval inspired index in order to select all sources relevant to a query. We treat each RDF document as a bag of URIs and literals and build an inverted index. Our system first reformulates the user's query into a set of subgoals and then translates these into Boolean queries against the index in order to determine which sources are relevant. Finally, the selected data sources and the relevant ontology mappings are used in conjunction with a description logic reasoner to provide an efficient query answering solution for the Semantic Web. We have evaluated our system using ontology mappings and ten million real world data sources.

Li, Yingjie, Yu, Yang and Heflin, Jeff . A Multi-ontology Synthetic Benchmark for the Semantic Web. 1st International Workshop on Evaluation of Semantic Technologies (IWEST2010). 2010.
Abstract |Full paper local copy | OWL annotation

One important use case for the Semantic Web is the integration of data across many heterogeneous ontologies. However, most Semantic Web Knowledge Bases are evaluated using the single ontology benchmark such as LUBM and UOBM. Therefore, there is a require- ment to develop a benchmark system that is able to evaluate not only single but also federated ontology systems for different uses with differ- ent configurations of ontologies. To support such a need, based on our earlier work, we present a multi-ontology synthetic benchmark system that takes a two-level profile as input to generate user-customized on- tologies together with related mappings and data sources. Meanwhile, a graph-based query generation algorithm and an owl sameAs generation mechanism are also proposed. By using this benchmark, the Semantic Web systems can be evaluated against complex ontology configurations using the standard metrics of loading time, repository size, query re- sponse time and query completeness and soundness.

Pan, Zhengxiang, Qasem, Abir and Heflin, Jeff . Semantic Integration: The Hawkeye Approach. In Sheu, Phillip, Yu, Heather, Ramamoorthy, C. V., Joshi, Arvind K. and Zadeh, Lofti A. (Eds.). Semantic Computing. IEEE Press/Wiley. Cambridge, MA. 2010.
Abstract | OWL annotation

At present, the Semantic Web consists of numerous independent ontologies. We put forward that the Web Ontology Language (OWL) can be used to integrate these ontologies and thereby integrate the data sources that commit to them. In this chapter we briefly survey approaches to semantic integration and then present the Hawkeye knowledge base, in which we have loaded more than 166 million facts from a diverse set of real-world data sources. In order to support Hawkeye, we extended our DLDB knowledge base system with additional reasoning capabilities. DLDB is a system that, given sufficient OWL descriptions, can answer queries that span heterogeneous data sources. We use the Hawkeye knowledge base to demonstrate realistic integration queries in e-government and academic scenarios. For example, our system can produce answers that integrate Citeseer and DBLP, which are knowledge bases about computer science publications. We achieve this integration in a declarative way by only using OWL. These queries cannot be answered by traditional search engines. Furthermore, we show that many complex queries have response times under 1 minute and that simple queries can be answered in seconds.

Song, Dezhao and Heflin, Jeff . Domain-Independent Entity Coreference in RDF Graphs. In CIKM2010: Proceedings of the 19th ACM Conference on Information and Knowledge Management . 2010.
Abstract |Full paper local copy | OWL annotation

In this paper, we present a novel entity coreference algorithm for Semantic Web instances. The key issues include how to locate context information and how to utilize the context appropriately. To collect context information, we select a neighborhood (consisting of triples) of each instance from the RDF graph. To determine the similarity between two instances, our algorithm computes the similarity between comparable property values in the neighborhood graphs. The similarity of distinct URIs and blank nodes is computed by comparing their outgoing links. To provide the best possible domain-independent matches, we examine an appropriate way to compute the discriminability of triples. To reduce the impact of distant nodes, we explore a distance-based discounting approach. We evaluated our algorithm using different instance categories in two datasets. Our experiments show that the best results are achieved by including both our triple discrimination and discounting approaches.

Song, Dezhao and Heflin, Jeff . Domain-Independent Entity Coreference in RDF Graphs . Technical Report LU-CSE-10-004. Department of Computer Science and Engineering, Lehigh University. 2010.
Abstract | | OWL annotation

Zhang, Xingjian and Heflin, Jeff . Calculating Word Sense Probability Distributions for SemanticWeb Applications. Fourth IEEE International Conference on Semantic Computing (ICSC 2010). 2010.
Abstract |Full paper local copy | OWL annotation

Researchers have found that Word Sense Disambiguation (WSD) is useful for tasks such as ontology alignment. Many other Semantic Web applications could also be enhanced with WSD results of Semantic Web documents. A system that can provide reusable intermediate WSD results is desirable. Compared to the top sense or a rank of senses, an output of meaningful scores of each possible sense informs subsequent processes of the certainty in results, and facilitates the application of other knowledge in choosing the correct sense. We propose that probabilistic models, which have proved successful in many other fields, can also be applied toWSD. Based on such observations, we focus on the problem of calculating probability distributions of senses for terms. In this paper we propose our novel WSD approach with our probability model, derive the problem formula into small computable pieces, and propose ways to estimate the values of these pieces.

2009 Papers

Brophy, Matt and Heflin, Jeff . OWL-PL: A Presentation Language for Displaying Semantic Data on the Web. Technical Report LU-CSE-09-002. Department of Computer Science and Engineering, Lehigh University. 2009.
Abstract | Full paper | OWL annotation

Current systems for displaying semantic data on the Web generate insufficient domain-independent views or require the adoption of complex ontological languages for display definition.We present OWLPL, an easy to use, XSLT-inspired language for transforming RDF/OWL into XHTML for display on the Web. OWL-PL includes data selector tags inside an XHTML file, and thus encourages heavy reuse of existing technologies, such as CSS, JavaScript, and AJAX. This coupling with XHTML reduces the learning curve on traditional Web designers, while allowing for creation of the same rich interfaces that Web users have become accustomed to. The use of data selector tags in OWL-PL promotes a clean separation of raw data from document structure, similar to the separation of formatting and document structure provided by Cascading Style Sheets. OWL-PL permits the reuse of view information across ontologies using semantic inference and allows the end-user to switch between all available views on the fly. We implement an OWL-PL parser and demonstrate a functional example showing that it is easy and intuitive to create XHTML views of semantic data as well as re-create views of existing static HTML pages.

Dimitrov, Dimitre A., Pundaleeka, Roopa, Qasem, Abir and Heflin, Jeff . ISENS: A System for Information Integration, Exploration, and Querying of Multi-Ontology Data Sources. Third IEEE International Conference on Semantic Computing (ICSC 2009). 2009.
Abstract |Full paper local copy | OWL annotation

Separate data sources on related domains of knowledge generally contain different but complementary information. There are queries that can only be answered with pieces of information from some of the separate sources. Thus, it is of considerable interest to enable query answering based on searching the information in an integrated collection of sources. However, independently developed and evolved data sources generally use different schemas to represent their data. This makes it difficult to search the sources in an integrated way. To address this problem, we have developed an end-to-end information integration system that is based on the Semantic Web technologies, algorithms for efficient source selection, and a Web-based user interface to construct queries and search multiontology data. We describe the system architecture, handling of information integration in a multi-ontology environment, and the user interface capabilities for ontology visualization, query construction, and presentation of results. Moreover, we have developed a multi-ontology real-world data set up and measured the performance of the different ISENS components on query answering that involves information integration from the different data sources. The results indicate that the source selection and the logical reasoning parts of query processing are dominated by the time to transfer data from the selected sources and to load the triples in the reasoner. However, the query answering time is often of the order of a second or less allowing ISENS to be used efficiently for such semantic applications.

Pan, Zhengxiang, Li, Yingjie and Heflin, Jeff . A Semantic Web Knowledge Base System that Supports Large Scale Data Integration. 5th International Workshop on Scalable Semantic Web Knowledge Base Systems (SSWS'09). 2009.
Abstract |Full paper local copy | OWL annotation

A true SemanticWeb knowledge base system must scale both in terms of number of ontologies and quantity of data. It should also support reasoning using different points of view about the meanings and relationships of concepts and roles. We present our DLDB3 system that supports large scale data integration, and is provably sound and complete on a fragment of OWL DL when answering extensional conjunctive queries. By delegating TBox reasoning to a DL reasoner, we focus on the design of the table schema, database views, and algorithms that achieve essential ABox reasoning over an RDBMS. The ABox inferences from cyclic axioms are materialized at load time, while other inferences are computed at query time. Instance data are directly loaded into the database tables.We evaluate the system using synthetic benchmarks and compare performances with other systems.We also validate our approach on data integration using multiple ontologies and data sources.

Qasem, Abir, Dimitrov, Dimitre A. and Heflin, Jeff . Towards Scalable Information Integration with Instance Coreferences. Seventh Workshop on Information Integration and the Web, IJCAI 2009. Pasadena, CA. 2009.
Abstract |Full paper local copy | OWL annotation

Identifying data instances, that come from different sources but denote the same entity is necessary for effective integration of SemanticWeb data. This "instance coreference" identification problem has gained attention in recent years. Although this is a critical aspect of the overall information integration problem in the Semantic Web, we put forward that information integration algorithms also need to be extended in order to work effectively and efficiently in the presence of these coreferenced entities (whether they are discovered by a tool or explicitly stated with an owl:sameAs assertion). We describe such an extension to our Goal Node Search algorithm for SemanticWeb information integration.

Yu, Yang, Hillman, Don, Setio, Basuki and Heflin, Jeff . A Case Study in Integrating Multiple E-commerce Standards via Semantic Web Technology. Eighth International Semantic Web Conference (ISWC2009). Washington, D.C.. Springer. 2009. pp.909-924.
Abstract |Full paper local copy | OWL annotation

Internet business-to-business transactions present great challenges in merging information from different sources. In this paper we describe a project to integrate four representative commercial classification systems with the Federal Cataloging System (FCS). The FCS is used by the US Defense Logistics Agency to name, describe and classify all items under inventory control by the DoD. Our approach uses the ECCMA Open Technical Dictionary (eOTD) as a common vocabulary to accommodate all different classifications. We create a semantic bridging ontology between each classification and the eOTD to describe their logical relationships in OWL DL. The essential idea is that since each classification has formal definitions in a common vocabulary, we can use subsumption to automatically integrate them, thus mitigating the need for pairwise mappings. Furthermore our system provides an interactive interface to let users choose and browse the results and more importantly it can translate catalogs that commit to these classifications using compiled mapping results.

2008 Papers

Pan, Zhengxiang, Zhang, Xingjian and Heflin, Jeff . DLDB2: A Scalable Multi-Perspective Semantic Web Repository. International Conference on Web Intelligence (WI 08). IEEE Computer Society Press. 2008. pp.489-495.
Abstract |Full paper local copy | OWL annotation

A true Semantic Web repository must scale both in terms of number of ontologies and quantity of data. It should also support reasoning using different points of view about the meanings and relationships of concepts and roles. Our DLDB2 system has these features. Our system is sound and complete on a sizable subset of Description Horn Logic when answering extensional conjunctive queries, but more importantly also computes many entailments from OWL DL. By delegating TBox reasoning to a DL reasoner, we focus on the design of the table schema, database views, and algorithms that achieve essential ABox reasoning over an RDBMS. We evaluate the system using synthetic benchmarks as well as real-world data and queries.

Qasem, Abir, Dimitrov, Dimitre A. and Heflin, Jeff . Efficient Selection and Integration of Data Sources for Answering Semantic Web Queries. Second IEEE International Conference on Semantic Computing (ICSC 08). IEEE Computer Society Press. 2008. pp.245-252.
Abstract |Full paper local copy | OWL annotation

In this work we adapt an efficient information integration algorithm to identify the minimal set of potentially relevant Semantic Web data sources for a given query. The vast majority of these sources are files written in RDF or OWL format, and must be processed in their entirety. Our adaptation includes enhancing the algorithm with taxonomic reasoning, defining and using a mapping language for the purpose of aligning heterogeneous SemanticWeb ontologies, and introducing a concept of source relevance to reduce the number of sources that we need to consider for a given query. After the source selection process, we load the selected sources into a Semantic Web reasoner to get a sound and complete answer to the query. We have conducted an experiment using synthetic ontologies and data sources which demonstrates that our system performs well over a wide range of queries. A typical response time for a substantial work load of 50 domain ontologies, 80 map ontologies and 500 data sources is less than 2 seconds. Furthermore, our system returned correct answers to 200 randomly generated queries in several workload configurations. We have also compared our adaptation with a basic implementation of the original information integration algorithm that does not do any taxonomic reasoning. In the most complex configuration with 50 domain ontologies, 100 map ontologies and 1000 data sources our system returns complete answers to all the queries whereas the basic implementation returns complete answers to only 28% of the queries.

Qasem, Abir, Dimitrov, Dimitre A. and Heflin, Jeff . Efficient Selection and Integration of Data Sources for Answering Semantic Web Queries. Technical Report LU-CSE-08-006. Department of Computer Science and Engineering, Lehigh University. 2008.
Abstract | Full paper | OWL annotation

Qasem, Abir, Dimitrov, Dimitre A. and Heflin, Jeff . Goal Node Search for Semantic Web Source Selection. International Conference on Web Intelligence (WI 08). IEEE Computer Society Press. 2008. pp.566-569.
Abstract |Full paper local copy | OWL annotation

We present an efficient search approach for selecting all potentially relevant data sources for a conjunctive Semantic Web query. We use map ontologies to align heterogeneous domain ontologies. This allows us to select data sources that may be relevant to the query but generally do not describe their data directly in terms of the ontology of the query. The "Goal Node Search" algorithm is a significant improvement on our original source selection algorithm. The new algorithm allows a more expressive knowledge representation language to describe domain ontologies and it is about three times more efficient than the original source selection algorithm when performing similar tasks

Qasem, Abir, Dimitrov, Dimitre A. and Heflin, Jeff . Goal Node Search for Semantic Web Source Selection. Technical Report LU-CSE-08-010. Department of Computer Science and Engineering, Lehigh University. 2008.
Abstract | Full paper | OWL annotation

We present an efficient search approach for selecting all potentially relevant data sources for a given a query and a Semantic Web Space. We use map ontologies to align heterogeneous domain ontologies. This allows us to select data sources that may be relevant to query but generally do not describe their data directly in terms of the ontology of the query. The knowledge representation language we use to describe the Semantic Web Space is a subset of OWL that is compatible with Local-as-View (LAV) and Global-as-View (GAV) rules. In our approach, we first translate the SemanticWeb Space into a set of LAV/GAV rules. Given a query the "Goal Node Search" algorithm identifies all possible paths that can be found by applying the LAV/GAV rules to each query subgoal or its expansions. We have incorporated the algorithm in OBII, a Semantic Web query answering system and evaluated its performance versus OBII’s original algorithm. The new algorithm is a significant improvement on the original source selection algorithm of OBII. First, it allows a more expressive knowledge representation language to describe domain ontologies than the original algorithm. Second, it is about three times more efficient than the original source selection algorithm when performing similar tasks. In addition the new algorithm is conceptually simpler than the original source selection algorithm.

Qasem, Abir, Dimitrov, Dimitre A. and Heflin, Jeff . ISENS: A Multi-ontology Query System for the Semantic Deep Web. The Semantic Web meets the Deep Web Workshp, In Proc. of IEEE CEC'08 and EEE'08. Washington, DC. IEEE. 2008. pp.396-399.
Abstract |Full paper local copy | OWL annotation

We present ISENS, a distributed, end-to-end, ontologybased information integration system. In response to a user's query, our system is capable of retrieving facts from data sources that are found in the surface Semantic Web as well as in the Semantic Deep Web. Furthermore, it retrieves facts from sources where the data is not directly described in terms of the query ontology. Instead, its ontology can be translated from the query ontology using mapping axioms. In our solution, we use the concept of source relevance to summarize the content of a data source. Our system can then use this information to select the needed sources to answer a given query. Source relevance is general enough that it can be used with both the surface Semantic Web and the Semantic Deep Web. In this paper, we show how we have incorporated three particular Deep Web data sources into our system to enable answering queries by composing information from the integrated sources.

2007 Papers

Chitnis, Amit, Qasem, Abir and Heflin, Jeff . Benchmarking Reasoners for Multi-Ontology Applications. Workshop on Evaluation of Ontologies and Ontology-Based Tools, ISWC 07. Busan, Korea. 2007.
Abstract |Full paper local copy | OWL annotation

We describe an approach to create a synthetic workload for large scale extensional query answering experiments. The workload comprises multiple interrelated domain ontologies, data sources which commit to these ontologies, synthetic queries and map ontologies that specify a graph over the domain ontologies. Some of the important parameters of the system are the average number of classes and properties of the source ontology which are mapped with the terms of target ontology and the number of data sources per ontology. The ontology graph is described by various parameters like its diameter, number of ontologies and average out-degree of node ontology. These parameters give a significant degree of control over the graph topology. This graph of ontologies is the central component of our synthetic workload that effectively represents a web of data.

Guo, Yuanbo and Heflin, Jeff . Document-Centric Query Answering for the Semantic Web. 2007 IEEE/WIC/ACM International Conference on Web Intelligence (WI'07). 2007. pp.409-415.
Abstract |Full paper local copy | OWL annotation

In this paper, we propose document-centric query answering, a novel form of query answering for the Semantic Web. We discuss how we have built a knowledge base system to support the new queries. In particular, we describe the key techniques used in the system in order to address scalability issues. In addition, we show encouraging experimental results

Guo, Yuanbo, Qasem, Abir, Pan, Zhengxiang and Heflin, Jeff . A Requirements Driven Framework for Benchmarking Semantic Web Knowledge Base Systems. IEEE Transactions on Knowledge and Data Engineering. 19( 2) 2007. pp.297-309.
Abstract |Full paper local copy | OWL annotation

A key challenge for the Semantic Web is to acquire the capability to effectively query large knowledge bases. As there will be several competing systems, we need benchmarks that will objectively evaluate these systems. Development of effective benchmarks in an emerging domain is a challenging endeavor. In this paper, we propose a requirements driven framework for developing benchmarks for Semantic Web Knowledge Base Systems (SW KBSs). In this paper we make two major contributions. First, we provide a list of requirements for SW KBS benchmarks. This can serve as an unbiased guide to both the benchmark developers and personnel responsible for systems acquisition and benchmarking. Second, we provide an organized collection of techniques and tools needed to develop such benchmarks. In particular, the collection contains a detailed guide for generating benchmark workload, defining performance metrics and interpreting experimental results.

Pan, Zhengxiang, Qasem, Abir, Kanitkar, Sudhan, Prabhakar, Fabiana and Heflin, Jeff . Hawkeye: A Practical Large Scale Demonstration of Semantic Web Integration. The 3rd International Workshop on Scalable Semantic Web Knowledge Base Systems (SSWS'07). Springer-Verlag. 2007. pp.1115-1124.
Abstract |Full paper local copy | OWL annotation

We discuss our DLDB knowledge base system and evaluate its capability in processing a very large set of real-world Semantic Web data. Using DLDB, we have constructed the Hawkeye knowledge base, in which we have loaded more than 166 million facts from a diverse set of real-world data sources. We use this knowledge base to demonstrate realistic integration queries in egovernment and academic scenarios. In order to support Hawkeye, we extended DLDB with additional reasoning capabilities. At present, the Semantic Web consists of numerous independent ontologies.We demonstrate that OWL can be used to integrate these ontologies and thereby integrate the data sources that commit to them. In terms of performance, we show that the load time of our system is linear on the number of triples loaded. Furthermore, we show that many complex queries have response times under one minute, and that simple queries can be answered in seconds.

Pan, Zhengxiang, Qasem, Abir, Kanitkar, Sudhan, Prabhakar, Fabiana and Heflin, Jeff . Hawkeye: A Practical Large Scale Demonstration of Semantic Web Integration. Technical Report LU-CSE-07-006. Department of Computer Science and Engineering, Lehigh University. 2007.
Abstract | Full paper | OWL annotation

At present, the Semantic Web consists of numerous independent ontologies. We put forward that OWL can be used to integrate these ontologies and thereby integrate the data sources that commit to them. In this paper we present the Hawkeye knowledge base, in which we have loaded more than 166 million facts from a diverse set of real-world data sources. In order to support Hawkeye, we extended our DLDB knowledge base system with additional reasoning capabilities. DLDB is a system that given sufficient OWL descriptions, can answer queries that span heterogeneous data sources. We use the Hawkeye knowledge base to demonstrate realistic integration queries in e-government and academic scenarios. For example, our system can produce answers that integrates Citeseer and DBLP. We achieve this integration in a declarative way by only using OWL. These queries cannot be answered by traditional search engines. Furthermore, we show that many complex queries have response times under one minute, and that simple queries can be answered in seconds.

Qasem, Abir, Dimitrov, Dimitre A. and Heflin, Jeff . An Efficient and Complete Distributed Query Answering System for Semantic Web Data. Technical Report LU-CSE-07-007. Department of Computer Science and Engineering, Lehigh University. 2007.
Abstract | Full paper | OWL annotation

In this work we consider the problem of answering queries using distributed Semantic Web data sources. We define a mapping language that is a subset of OWL for the purpose of aligning heterogeneous ontologies. In order to answer queries we provide a two step solution. First, given a query we identify potentially relevant sources, which we call the source selection problem. We adapt an information integration algorithm to provide a complete solution to this problem in polynomial time. Second, we load these selected sources into an OWL reasoner, and thereby achieve complete answers to the queries. Since the time to load sources is a dominating factor in performance, and our system identifies the minimal set of potentially relevant sources, it is very efficient.We have conducted an experiment using synthetic ontologies and data sources which demonstrates that our system performs well over a wide range of queries. A typical response time for a given work load of 20 domain ontologies, 20 map ontologies and 400 data sources is just a little over 1 second.

Qasem, Abir, Dimitrov, Dimitre A. and Heflin, Jeff . Efficient Selection and Integration of Data Sources for Answering Semantic Web. Workshop on New forms of reasoning for the Semantic Web: scaleable, tolerant and dynamic, ISWC 07. Busan, Korea. 2007.
Abstract |Full paper local copy | OWL annotation

We present an approach to identifying the minimal set of potentially relevant Semantic Web data sources for a given query. Our solution involves the adaptation of an efficient information integration algorithm that has polynomial time complexity. We then use these selected sources and an OWL reasoner to answer queries on the Semantic Web. We introduce a concept of source relevance expressed in OWL to reduce the number of sources needed to get the answers to a query. As the Semantic Web is an autonomous entity, some of the data sources may contain data that are not described directly in terms of a given query ontology. In our solution we define and use a mapping language that is a subset of OWL for the purpose of aligning heterogeneous ontologies. Our implemented system supports a subset of SPARQL queries, simple OWL ontologies and data sources that commit to them. Since the time to load sources is a dominating factor in performance, and our system identifies the minimal set of potentially relevant sources, it is very efficient. We have conducted an experiment using synthetic ontologies and data sources which demonstrates that our system performs well over a wide range of queries. A typical response time for a given work load of 20 domain ontologies, 20 map ontologies and 400 data sources is approximately 1 second. Furthermore, our system returned correct answers to 200 randomly generated queries in three different data configurations.

2006 Papers

Dimitrov, Dimitre A., Heflin, Jeff, Qasem, Abir and Wang, Nanbor . Information Integration via an End-to-End Distributed Semantic Web System. Fifth International Semantic Web Conference (ISWC 2006). Athens, Georgia. Springer. 2006. pp.764-777.
Abstract |Full paper local copy | OWL annotation

A distributed, end-to-end information integration system that is based on the Semantic Web architecture is of considerable interest to both commercial and government organizations. However, there are a number of challenges that have to be resolved to build such a system given the currently available Semantic Web technologies. We describe here the ISENS prototype system we designed, implemented, and tested (on a small scale) to address this problem. We discuss certain system limitations (some coming from underlying technologies used) and future ISENS development to resolve them and to enable an extended set of capabilities.

Guo, Yuanbo and Heflin, Jeff . A Scalable Approach for Partitioning OWL Knowledge Bases. In Proc. of the 2nd International Workshop on Scalable Semantic Web Knowledge Base Systems (SSWS2006). Athens, Georgia. 2006.
Abstract |Full paper local copy | OWL annotation

We describe an approach to partitioning a large OWL ABox with respect to a TBox so that specific kinds of reasoning can be performed separately on each partition and the results trivially combined in order to achieve complete answers. The main features of our approach include: a reasonable tradeoff between the complexity of the task and the granularity of partitioning; worst-case polynomial time complexity; and the ability to handle problems that are too large for main memory. In addition, we show promising experimental results on both the Lehigh University Benchmark data and the real world FOAF data. This work could contribute to the development of scalable Semantic Web systems that need to deal with large amounts of data.

Guo, Yuanbo, Qasem, Abir and Heflin, Jeff . Large Scale Knowledge Base Systems: An Empirical Evaluation Perspective. Twenty First National Conference on Artificial Intelligence (AAAI 2006). Boston, Mass.. 2006. pp.1617-1620.
Abstract |Full paper local copy | OWL annotation

In this paper, we discuss how our work on evaluating Semantic Web knowledge base systems (KBSs) contributes to address some broader AI problems. First, we show how our approach provides a benchmarking solution to the Semantic Web, a new application area of AI. Second, we discuss how the approach is also beneficial in a more traditional AI context. We focus on issues such as scalability, performance tradeoffs, and the comparison of different classes of systems.

Pan, Zhengxiang and Heflin, Jeff . A Model Theoretic Semantics for Distributed Ontologies that Accounts for Versioning. Technical Report LU-CSE-06-026. Dept. of Computer Science and Engineering, Lehigh University. 2006.
Abstract | Full paper | OWL annotation

We show that the Semantic Web needs a formal semantics for the various kinds of links between ontologies and other documents. We provide a model theoretic semantics that takes into account ontology extension and ontology versioning. Since the Web is the product of a diverse community, as opposed to a single agent, this semantics accommodates different viewpoints by having different entailment relations for different ontology perspectives. We extend our previous work to support deprecation in ontologies and to support retrospective as well as prospective versioning.

Pan, Zhengxiang, Qasem, Abir and Heflin, Jeff . An Investigation into the Feasibility of the Semantic Web. Twenty First National Conference on Artificial Intelligence (AAAI 2006). Boston, MA. 2006. pp.1394-1399.
Abstract |Full paper local copy | OWL annotation

We investigate the challenges that must be addressed for the Semantic Web to become a feasible enterprise. Specifically we focus on the query answering capability of the Semantic Web. We put forward that two key challenges we face are heterogeneity and scalability. We propose a flexible and decentralized framework for addressing the heterogeneity problem and demonstrate that sufficient reasoning is possible over a large dataset by taking advantage of database technologies and making some tradeoff decisions. As a proof of concept, we collect a significant portion of the available Semantic Web data; use our framework to resolve some heterogeneity and reason over the data as one big knowledge base. In addition to demonstrating the feasibility of a real Semantic Web, our experiments have provided us with some interesting insights into how it is evolving and the type of queries that can be answered.

Pan, Zhengxiang, Qasem, Abir and Heflin, Jeff . An Investigation into the Feasibility of the Semantic Web. Technical Report LU-CSE-06-025. Dept. of Computer Science and Engineering, Lehigh University. 2006.
Abstract | Full paper | OWL annotation

2005 Papers

Guo, Yuanbo and Heflin, Jeff . On Logical Consequence for Collections of OWL Documents. In Proc. of the 4th International Semantic Web Conference (ISWC2005). Galway, Ireland. 2005. pp.338-352.
Abstract |Full paper local copy | OWL annotation

In this paper, we investigate the (in)dependence among OWL documents with respect to the logical consequence when they are combined, in particular the inference of concept and role assertions about individuals. One the one hand, we present a systematic approach to identifying those documents that affect the inference of a given fact. On the other hand, we consider ways for fast detection of independence. First, we demonstrate several special cases in which two documents are independent of each other. Secondly, we introduce an algorithm for checking the independence in the general case. In addition, we describe two applications in which the above results have allowed us to develop novel approaches to overcome some difficulties with reasoning on large scale OWL data. Both applications demonstrate the usefulness of this work for improving the scalability of a practical Semantic Web system that relies on the reasoning about individuals.

Guo, Yuanbo, Pan, Zhengxiang and Heflin, Jeff . LUBM: A Benchmark for OWL Knowledge Base Systems. Web Semantics. 3( 2) July 2005. pp.158-182.
Abstract |Full paper local copy | OWL annotation

We describe our method for benchmarking Semantic Web knowledge base systems with respect to use in large OWL applications. We present the Lehigh University Benchmark (LUBM) as an example of how to design such benchmarks. The LUBM features an ontology for the university domain, synthetic OWL data scalable to an arbitrary size, fourteen extensional queries representing a variety of properties, and several performance metrics. The LUBM can be used to evaluate systems with different reasoning capabilities and storage mechanisms. We demonstrate this with an evaluation of two memorybased systems and two systems with persistent storage.

Misra, Upmanyu, Pan, Zhengxiang and Heflin, Jeff . Adding Support for Dynamic Ontologies to Existing Knowledge bases. In Proc. of the International Conference on Enterprise Information Systems. Miami. 2005. pp.97-104.
Abstract |Full paper local copy | OWL annotation

An ontology version needs be created when changes are to be made in an ontology while keeping the basic structure of the ontology more or less intact. It has been shown that an Ontology Perspective theory can be applied on a set of ontology versions. In this paper, we present a Virtual Perspective Interface (VPI) based on this theory that ensures that old data is still accessible through ontology modifications and can be accessed using new ontologies, in addition to the older ontologies which may still be in use by legacy applications. We begin by presenting the problems that are to be dealt with when such an infrastructure needs be created. Then we present possible solutions that may be used to tackle such problems. Finally, we provide an analysis of these solutions to support the one that we have implemented.

Wang, Sui-Yu, Guo, Yuanbo, Qasem, Abir and Heflin, Jeff . Rapid Benchmarking for Semantic Web Knowledge Base Systems. In Proc. of the 4th International Semantic Web Conference (ISWC2005). Galway, Ireland. 2005. pp.758-772.
Abstract |Full paper local copy | OWL annotation

We present a method for rapid development of benchmarks for Semantic Web knowledge base systems. At the core, we have a synthetic data generation approach for OWL that is scalable and models the real world data. The data-generation algorithm learns from real domain documents and generates benchmark data based on the extracted properties relevant for benchmarking. We believe that this is important because relative performance of systems will vary depending on the structure of the ontology and data used. However, due to the novelty of the Semantic Web, we rarely have sufficient data for benchmarking. Our approach helps overcome the problem of having insuffi-cient real world data for benchmarking and allows us to develop benchmarks for a variety of domains and applications in a very time efficient manner. Based on our method, we have created a new Lehigh BibTeX Benchmark and con-ducted an experiment on four Semantic Web knowledge base systems. We have verified our hypothesis about the need for representative data by comparing the experimental result to that of our previous Lehigh University Benchmark. The difference in both experiments has demonstrated the influence of ontology and data on the capability and performance of the systems and thus the need of us-ing a representative benchmark for the intended application of the systems. Fi-nally, we evaluated the technique by comparing our synthetic data to real world data and proved that it is a reasonable substitute when sufficient data is not available.

Wang, Sui-Yu, Guo, Yuanbo, Qasem, Abir and Heflin, Jeff . Rapid Benchmarking for Semantic Web Knowledge Base Systems. Technical Report LU-CSE-05-026. Dept. of Computer Science, Lehigh University. 2005.
Abstract | Full paper | OWL annotation

2004 Papers

Guo, Yuanbo and Heflin, Jeff . An Initial Investigation into Querying an Inconsistent and Untrustworthy Web. In Workshop on Trust, Security, and Reputation on the Semantic Web, ISWC 2004. 2004.
Abstract |Full paper local copy | OWL annotation

The Semantic Web is bound to be untrustworthy and inconsistent. In this paper, we present an initial approach for obtaining useful information in such an environment. In particular, we replace the question of whether an assertion is entailed by the entire Semantic Web with two other queries. The first asks if a specific statement is entailed given an identification of the trusted documents. The second asks for the document sets that entail a specific statement. We propose a mechanism for efficiently computing and representing the contexts of the statements and managing inconsistency. This system could be seen as a component in an overall trust system.

Guo, Yuanbo, Pan, Zhengxiang and Heflin, Jeff . An Evaluation of Knowledge Base Systems for Large OWL Datasets. Third International Semantic Web Conference, Hiroshima, Japan. Springer. 2004. pp.274-288.
Abstract |Full paper local copy | OWL annotation

In this paper, we present an evaluation of four knowledge base systems (KBS) with respect to use in large OWL applications. To our knowledge, no experiment has been done with the scale of data used here. The smallest dataset used consists of 15 OWL files totaling 8MB, while the largest dataset consists of 999 files totaling 583MB. We evaluated two memory-based systems (OWLJessKB and memory-based Sesame) and two systems with persistent storage(database-based Sesame and DLDB-OWL). We describe how we have performed the evaluation and what factors we have considered in it. We show the results of the experiment and discuss the performance of each system. In particular, we have concluded that existing systems need to place a greater emphasis on scalability.

Guo, Yuanbo, Pan, Zhengxiang and Heflin, Jeff . An Evaluation of Knowledge Base Systems for Large OWL Datasets. Technical Report LU-CSE-04-012. Dept. of Computer Science and Engineering, Lehigh University. 2004.
Abstract | Full paper | OWL annotation

In this paper, we present our work on evaluating knowledge base systems with respect to use in large OWL applications. To this end, we have developed the Lehigh University Benchmark (LUBM). The benchmark is intended to evaluate knowledge base systems with respect to extensional queries over a large dataset that commits to a single realistic ontology. LUBM features an OWL ontology modeling university domain, synthetic OWL data generation that can scale to an arbitrary size, fourteen test queries representing a variety of properties, and a set of performance metrics. We describe the components of the benchmark and some rationale for its design. Based on the benchmark, we have conducted an evaluation of four knowledge base systems (KBS). To our knowledge, no experiment has been done with the scale of data used here. The smallest dataset used consists of 15 OWL files totaling 8MB, while the largest dataset consists of 999 files totaling 583MB. We evaluated two memory-based systems (OWLJessKB and memory-based Sesame) and two systems with persistent storage (database-based Sesame and DLDB-OWL). We show the results of the experiment and discuss the performance of each system. In particular, we have concluded that existing systems need to place a greater emphasis on scalability.

Guo, Yuanbo, Pan, Zhengxiang and Heflin, Jeff . Choosing the Best Knowledge Base System for Large Semantic Web Applications. Thirteenth International World Wide Web Conference (WWW2004). 2004. pp.302-303.
Abstract |Full paper local copy | OWL annotation

Heflin, Jeff and Munoz-Avila, Hector . Integrating HTN Planning and Semantic Web Ontologies for Efficient Information Integration. Technical Report LU-CSE-04-002. Dept. of Computer Science and Engineering, Lehigh University. 2004.
Abstract | Full paper | OWL annotation

We integrate HTN planning and Semantic Web ontologies for efficient information integration. HTNs is a hierarchical plan representation that refines high-level tasks into simpler tasks. In the context of information integration, high-level tasks indicate complex queries whereas low-level tasks indicate concrete information-gathering actions such as requests to an information source. Semantic Web ontologies allow software agents to intelligently process and integrate information in distributed and heterogeneous environments such as the world wide web. The integration of HTNs and Semantic Web ontologies allow agents to answer complex queries by processing and integrating information in such environments. We also propose to use local closed world (LCW) information to assist these agents. LCW information can be obtained by accessing sources that are described in a Semantic Web language with LCW extensions, or by executing operators that provide exhaustive information. We demonstrate how the Semantic Web language SHOE can be augmented with the ability to state LCW information.

Heflin, Jeff and Pan, Zhengxiang . A Model Theoretic Semantics for Ontology Versioning. Third International Semantic Web Conference, Hiroshima, Japan. Springer. 2004. pp.62-76.
Abstract |Full paper local copy | OWL annotation

We show that the SemanticWeb needs a formal semantics for the various kinds of links between ontologies and other documents.We provide a model theoretic semantics that takes into account ontology extension and ontology versioning. Since the Web is the product of a diverse community, as opposed to a single agent, this semantics accommodates different viewpoints by having different entailment relations for different ontology perspectives. We discuss how this theory can be practically applied to RDF and OWL and provide a theorem that shows how to compute perspective-based entailment using existing logical reasoners. We illustrate these concepts using examples and conclude with a discussion of future work.

Pan, Zhengxiang and Heflin, Jeff . DLDB: Extending Relational Databases to Support Semantic Web Queries. Technical Report LU-CSE-04-006. Dept. of Computer Science and Engineering, Lehigh University. 2004.
Abstract | Full paper | OWL annotation

We present DLDB, a knowledge base system that extends a relational database management system with additional capabilities for DAML+OIL inference. We discuss a number of database schemas that can be used to store RDF data and discuss the tradeoffs of each. Then we describe how we extend our design to support DAML+OIL entailments. The most significant aspect of our approach is the use of a description logic reasoner to precompute the subsumption hierarchy. We describe a lightweight implementation that makes use of a common RDBMS (MS Access) and the FaCT description logic reasoner. Surprisingly, this simple approach provides good results for extensional queries over a large set of DAML+OIL data that commits to a representative ontology of moderate complexity. As such, we expect such systems to be adequate for personal or small-business usage.

Qasem, Abir and Heflin, Jeff . Efficient Knowledge Management by Extending the Semantic Web with Local Completeness Reasoning. AIS SIGSEMIS Bulletin. 1( 2) 2004. pp.25-28.
Abstract |Full paper local copy | OWL annotation

In our work we have adapted LCW, a formalism commonly used to find relevant answers from an incomplete database, to characterize redundant information on the Semantic Web. We postulate that this representation will increase efficiency of KM on the Semantic Web. We have built a proof of concept system to explore the feasibility of this concept.

Qasem, Abir, Heflin, Jeff and Munoz-Avila, Hector . Efficient Source Discovery and Service Composition for Ubiquitous Computing Environments. In Workshop on Semantic Web Technology for Mobile and Ubiquitous Applications, ISWC 2004. 2004.
Abstract |Full paper local copy | OWL annotation

To be truly pervasive the devices in a ubiquitous computing environment have to be able to form a "coalition" without human intervention. The Semantic Web provides the infrastructure for discovery and composition of device functionalities. AI planning has been a popular technology for automatic service discovery and composition in the Semantic Web. However, because the Web is so vast and changes so rapidly, a planning agent cannot make a closed-world assumption. This condition makes it difficult for an agent to know when it has gathered all relevant information or when additional searches may be redundant. To avoid redundancy we incorporate Local Closed World reasoning with HTN planning to compose Semantic Web services. In addition, when performing information gathering tasks on the Semantic Web, we use Local Closed World reasoning and a concept of "source relevance" to control the search process. We also describe a prototype agent that we have developed.

2003 Papers

Guo, Yuanbo, Heflin, Jeff and Pan, Zhengxiang . Benchmarking DAML+OIL Repositories. Second International Semantic Web Conference, ISWC 2003. LNCS 2870. Springer. 2003. pp.613-627.
Abstract |Full paper local copy | OWL annotation

We present a benchmark that facilitates the evaluation of DAML+OIL repositories in a standard and systematic way. This benchmark is intended to evaluate the performance of DAML+OIL repositories with respect to extensional queries over a large data set that commits to a single realistic ontology. It consists of the ontology, customizable synthetic data, a set of test queries, and several performance metrics. Main features of the benchmark include simulated data for the university domain, a repeatable data set that can be scaled to an arbitrary size, and an approach for measuring the degree to which a repository returns complete query answers. We also show a benchmark experiment for the evaluation of DLDB, a DAML+OIL repository that extends a relational database management system with description logic inference capabilities.

Heflin, Jeff, Hendler, James and Luke, Sean . SHOE: A Blueprint for the Semantic Web. In Fensel, Dieter, Hendler, James, Lieberman, Henry and Wahlster, Wolfgang (Eds.). Spinning the Semantic Web. MIT Press. Cambridge, MA. 2003.
Abstract |Full paper local copy | OWL annotation

The term Semantic Web was coined by Tim Berners-Lee to describe his proposal for "a web of meaning," as opposed to the "web of links" that currently exists on the Internet. To achieve this vision, we need to develop languages and tools that enable machine understandable web pages. The SHOE project, begun in 1995, was one of the first efforts to explore these issues. In this paper, we describe our experiences developing and using the SHOE language. We begin by describing the unique features of the World Wide Web and how they must influence potential Semantic Web languages. Then we present SHOE, a language which allows web pages to be annotated with semantics, describe its syntax and semantics, and discuss our approaches to handling the problems of interoperability in distributed environments and ontology evolution. Finally, we provide an overview of a suite of tools for the Semantic Web, and discuss the application of the language and tools to two different domains.

Emerging Semantic Web technology such as the DARPA Agent Markup Language (DAML) will support advanced semantic interoperability in the next generation of aerospace architectures. The basic idea of DAML is to mark up artifacts (e.g., documents, sensors, databases, legacy software) so that software agents can interpret and reason with the information. DAML will support the representation of ontologies (which include taxonomies of terms and semantic relations) via extensions to XML. XML alone is not sufficient for agents because it provides only syntactic interoperability that depends on implicit semantic agreements. DAML is the official starting point for the Web Ontology Language, an emerging standard from the World Wide Web Consortium. This paper will cover promising aerospace applications and significant challenges for Semantic Web technologies. Potential applications include higher-level information fusion, collaboration in both operational and engineering environments and rapid systems integration. The challenges that will be discussed include the complexity of ontology development, automation of markup, semantic mismatch between current object-oriented models and Semantic Web ontologies, scalability issues related to reasoning with large knowledge bases and technology transition issues. The paper will explain ongoing research that is focused on addressing these challenges.

Pan, Zhengxiang and Heflin, Jeff . DLDB: Extending Relational Databases to Support Semantic Web Queries. In Workshop on Practical and Scaleable Semantic Web Systems, ISWC 2003. 2003. pp.109-113.
Abstract |Full paper local copy | OWL annotation

We present DLDB, a knowledge base system that extends a relational database management system with additional capabilities for DAML+OIL inference. We discuss a number of database schemas that can be used to store RDF data and discuss the tradeoffs of each. Then we describe how we extend our design to support DAML+OIL entailments. The most significant aspect of our approach is the use of a description logic reasoner to precompute the subsumption hierarchy. We describe a lightweight implementation that makes use of a common RDBMS (MS Access) and the FaCT description logic reasoner. Surprisingly, this simple approach provides good results for extensional queries over a large set of DAML+OIL data that commits to a representa-tive ontology of moderate complexity. As such, we expect such systems to be adequate for personal or small-business usage.

2002 Papers

Heflin, Jeff and Munoz-Avila, Hector . LCW-Based Agent Planning for the Semantic Web. Ontologies and the Semantic Web. Papers from the 2002 AAAI Workshop WS-02-11. AAAI Press. Menlo Park, CA. 2002. pp.63-70.
Abstract |Full paper local copy | OWL annotation

The Semantic Web has the potential to allow software agents to intelligently process and integrate the Web's wealth of information. These agents must plan how to achieve their goals in light of the information available. However, because the Web is so vast and changes so rapidly, the agent cannot make a closed-world assumption. This condition makes it difficult for an agent to know when it has gathered all relevant information or when additional searches may be redundant. We propose to use local closed world (LCW) information to assist these agents. LCW information can be obtained by accessing sources that are described in a Semantic Web language with LCW extensions, or by executing operators that provide exhaustive information. In this paper, we demonstrate how two Semantic Web languages (DAML+OIL and SHOE) can be augmented with the ability to state LCW information. We also show that DAML+OIL can represent many kinds of LCW information even without additional language features. Finally, we describe how ordered task decomposition can be used with LCW information to efficiently plan in distributed information environments.