Analysis and Visualization of Biological Publication Data

Lang, Maren

Analysis and Visualization of Biological Publication Data

von Maren Lang (Autor:in)

Zusammenfassung

Inhaltsangabe:Abstract:
The content of todays World Wide Web is semantically not well structured. Every-thing is built for people and the data is therefore machine-readable but not machine- understandable. The semantic Web provides a solution for this problem through a new form of content structure. One technology for developing the Semantic Web is the Resource Description Framework (RDF).
RDF is a language for representing information about resources in the World Wide Web and is particularly intended for representing metadata about Web resources. Therefore RDF provides interoperability between applications that exchange machine-understandable information on the Web. In this work, existing biological publication data which is stored in an object-relational database, is transformed into data represented in RDF. With the newly created RDF model it is possible to make a new way of queries, not only key word searching, but also queries with semantic sense. The additional advantage oft his representation is that it can be described not only in triples or XML structure but also in directed graphs.
The World Wide Web provides documents that are built for human usage. There are formats like HTML, SVG and other extensions like Javascript or Javaapplets which are made for representing information. The content is semantically not well structured. These documents are structured for their presentation and are meant for people rather than computer which process data and information automatically. Everything is built for people and the data therefore is machine-readable but not machine-understandable. The Semantic Web provides a solution for this problem through a new form of structuring the content of the Web. It is not a separate Web but an extension of the existing one. There is, beside the documents of the Web, well defined additional information, which the computer is able to exploit automatically. This will give search engines more selective results as answer to the user enquired queries.
Current search engines normally provide a big quantity of results to which the user has not or hardly referred initially. Their criteria of assigning a document to the set of relevant documents are the occurrences of one or several keywords. The results could be more precise if additional information which concerns the question would be considered. For example if somebody searches a document of mister Miller, the search engine could take into account, that one […]

Leseprobe

Inhaltsverzeichnis

Maren Lang

Analysis and Visualization of Biological Publication Data

ISBN: 978-3-8366-0868-8

Druck Diplomica® Verlag GmbH, Hamburg, 2008

Zugl. Heinrich-Heine-Universität Düsseldorf, Düsseldorf, Deutschland, Bachelorarbeit,

2006

Dieses Werk ist urheberrechtlich geschützt. Die dadurch begründeten Rechte,

insbesondere die der Übersetzung, des Nachdrucks, des Vortrags, der Entnahme von

Abbildungen und Tabellen, der Funksendung, der Mikroverfilmung oder der

Vervielfältigung auf anderen Wegen und der Speicherung in Datenverarbeitungsanlagen,

bleiben, auch bei nur auszugsweiser Verwertung, vorbehalten. Eine Vervielfältigung

dieses Werkes oder von Teilen dieses Werkes ist auch im Einzelfall nur in den Grenzen

der gesetzlichen Bestimmungen des Urheberrechtsgesetzes der Bundesrepublik

Deutschland in der jeweils geltenden Fassung zulässig. Sie ist grundsätzlich

vergütungspflichtig. Zuwiderhandlungen unterliegen den Strafbestimmungen des

Urheberrechtes.

Die Wiedergabe von Gebrauchsnamen, Handelsnamen, Warenbezeichnungen usw. in

diesem Werk berechtigt auch ohne besondere Kennzeichnung nicht zu der Annahme,

dass solche Namen im Sinne der Warenzeichen- und Markenschutz-Gesetzgebung als frei

zu betrachten wären und daher von jedermann benutzt werden dürften.

Die Informationen in diesem Werk wurden mit Sorgfalt erarbeitet. Dennoch können

Fehler nicht vollständig ausgeschlossen werden, und die Diplomarbeiten Agentur, die

Autoren oder Übersetzer übernehmen keine juristische Verantwortung oder irgendeine

Haftung für evtl. verbliebene fehlerhafte Angaben und deren Folgen.

http://www.diplom.de, Hamburg 2008

Printed in Germany

Abstract

The content of today's World Wide Web is semantically not well structured. Every-

thing is built for people and the data is therefore machine-readable but not machine-

understandable.

The Semantic Web provides a solution for this problem through a new form of content

structure. One technology for developing the Semantic Web is the Resource Description

Framework (RDF). RDF is a language for representing information about resources in the

World Wide Web and is particularly intended for representing metadata about Web re-

sources. Therefore RDF provides "interoperability" between applications that exchange

machine-understandable information on the Web.

In this work, existing biological publication data which is stored in an object-relational

database, is transformed into data represented in RDF. With the newly created RDF

model it is possible to make a new way of queries, not only keyword searching, but

also queries with semantic sense. The additional advantage of this representation is that

it can be described not only in triples or XML structure but also in directed graphs.

CONTENTS

Contents

Introduction

1.1

Semantic Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.2

RDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.2.1

Schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.2.2

RDF Support in Oracle RDBMS . . . . . . . . . . . . . . . . . . . . .

A Queries under Oracle . . . . . . . . . . . . . . . . . . . . . . . . .

1.2.3

Visualisation of RDF . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.3

Goal of this work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Implementation

2.1

Part I: PL/SQL package for use of Oracle RDF . . . . . . . . . . . . . . . .

2.1.1

Short instructions for the utilisation of the package RDF TABLE

A Overview RDF TABLE . . . . . . . . . . . . . . . . . . . . . . . . .

2.1.2

Details of Implementation . . . . . . . . . . . . . . . . . . . . . . . .

A The RDF table . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

B The model belonging to a table . . . . . . . . . . . . . . . . . . . .

C Inserting triples . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

D A table for rules . . . . . . . . . . . . . . . . . . . . . . . . . . . .

E Rulebases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

F The rules index . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

G Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.2

Part II: Connection to the PL/SQL package with Perl . . . . . . . . . . . .

2.2.1

Perl package rdfmaker . . . . . . . . . . . . . . . . . . . . . . . . .

A Predefined Queries . . . . . . . . . . . . . . . . . . . . . . . . . .

2.2.2

Details of implementation . . . . . . . . . . . . . . . . . . . . . . . .

A Connection to a database . . . . . . . . . . . . . . . . . . . . . . .

B SQL statements in Perl . . . . . . . . . . . . . . . . . . . . . . . . .

C Converting results to triple format . . . . . . . . . . . . . . . . . .

D Converting results to XML format . . . . . . . . . . . . . . . . . .

E An example: Searching coauthors and their publications . . . . .

2.3

Part III: GUI written as Java application . . . . . . . . . . . . . . . . . . . .

2.3.1

Instructions for the usage of RdfGui

. . . . . . . . . . . . . . . . .

CONTENTS

2.3.2

Details of implementation . . . . . . . . . . . . . . . . . . . . . . . .

A Description of the implemented classes . . . . . . . . . . . . . . .

B Graphical display of results . . . . . . . . . . . . . . . . . . . . . .

Conclusions

A Schema

B Query

B.1 Query: searching for keywords of pub1 . . . . . . . . . . . . . . . . . . . .

B.1.0.1

triples format . . . . . . . . . . . . . . . . . . . . . . . . . .

B.1.0.2

xml format . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 INTRODUCTION

Introduction

1.1

Semantic Web

The World Wide Web provides documents that are built for human usage. There are for-

mats like HTML, SVG and other extensions like Javascript or Javaapplets which are made

for representing information. The content is semantically not well structured. These doc-

uments are structured for their presentation and are meant for people rather than com-

puter which process data and information automatically ([TBL01]). Everything is built

for people and the data therefore is machine-readable but not machine-understandable

([OL99]).

The Semantic Web provides a solution for this problem through a new form of structur-

ing the content of the Web. It is not a separate Web but an extension of the existing one.

There is, beside the documents of the Web, well defined additional information, which

the computer is able to exploit automatically. This will give search engines more selective

results as answer to the user enquired queries. Current search engines normally provide

a big quantity of results to which the user has not or hardly referred initially. Their crite-

ria of assigning a document to the set of relevant documents are the occurrences of one or

several keywords. The results could be more precise if additional information which con-

cerns the question would be considered. For example if somebody searches a document

of mister Miller, the search engine could take into account, that one searches an article

which has an author named Miller and not to select all articles where the word Miller ap-

pears. Adding logic to the Web also means to use rules to make inferences, choose courses

and ask questions, which is the task before the Semantic Web community at the moment.

One technology for developing the Semantic Web is the eXtensible Markup Language

(XML). XML allows users to add arbitrary structure to their documents but says nothing

about what the structures means ([TBL01]). But the Semantic Web will enable machines

to comprehend semantic documents and data, not human speech and writings. Mean-

ing is expressed by the Resource Description Framework (RDF) which provides the basic

building blocks for supporting the Semantic Web.

1.2

RDF

RDF is intended for situations in which the information needs to be processed by appli-

cations rather than being only displayed to people. It was developed under the control of

the World Wide Web Consortium (W3C) which is a foundation for processing Metadata.

These are 'data about data', like a library catalog that desribes publications ([OL99]). RDF

is a language for representing information about resources in the World Wide Web and

is particularly intended for representing metadata about Web resources, such as the ti-

tle or author of a Web page ([Fra04]). Therefore RDF provides interoperability between

applications that exchange machine-understandable information on the Web. RDF is be-

ing used in numerous application areas, including resource discovery to provide more

targeted and sophisticated search engine capabilities. Then it is used in cataloging for a

description of contents and relationships between them that are available at a particluar

Web site or digital Library, as well as intelligent software agents for facilitating knowl-

edge exchange. A further application is content rating, that means describing a collection

1 INTRODUCTION

of different pages which is representing a single logical document that can be important

for intellectual property rights of Web pages and for expressing the privacy policies of

a Web site as well as the preferences of a user. Thus, different areas like Life Sciences,

Digital Libraries, and Intelligence, E-Commerce and Personal Information Management

can benefit from RDF ([AR05] and [OL99]).

RDF provides a model for describing resources. A resource is defined as regardless

which object that can be clearly defined by an URI (Uniform Resource Identifiers, URIs).

The URI is only an identifier, it does not have to be specified a special protocol nor

does the identified object has to exist physically in the web. Resources can be described

through statements, which are represented by triples consisting of subject (resource),

predicate (property) and an object. The subject can be a document, a part of it or as well a

person. Subjects are identified by using Web identifiers or a blanknode ([Mil98]). A blank

node is used if a subject or object is unknown or if the relationship between a subject and

an object is unknown. A blanknode starts with an underline like reference. A property

is a characteristic, attribute or relation that can be used to describe the subject. It has to

be an URI. Properties that are associated with subjects are identified by property-types.

These express the relationships of values associated with resources. The value of the

property is the object of the triple and can be represented as URI as well as as literal or

as blanknode.

An example for a statement is

¡http://www.ontoverse.org/publications/pub1¿

¡http://www.ontoverse.org/publications/keyword¿

¡Mitochondria¿

Triples can be modelled as a directed graph. Subject and object represent nodes of the

graph and the predicate is the directed link from subject to object.

Figure 1 shows an example of a triple that is displayed as graph.

Figure 1: Triple. The subject is shown as ellipse, the object as rectangle and the link is

always directed towards the object

RDF uses the Extensible Markup Language (XML) to represent RDF statements in a

machine-processable way. XML was designed to write a document in a self designed

document format and RDF defines a specific XML format that is called RDF/XML. But

XML is not the only possible way [W3C], RDF data can be represented in Notation3 or

N-Triple, too. These are formats that consist of triples.

1 INTRODUCTION

1.2.1

Schema

Schema RDF statements define the vocabulary for a certain domain, they alone do not

make assumptions about any particular application domain, nor do they define the se-

mantics of any domain. So the RDF Schema (RDFS) lets the user determine a syntax

for the common data exchange. Properties and relations of the resources that exist in

this domain can also be stipulated so that an interpretation of the statements given in

an RDF data model is possible. This information is represented using the concepts of

classes, properties, and attributes ([BR00]). Schemas themselves may be written in RDF.

RDF-Schemas are Entity Relationship Diagrams ([OL99]). Schemas are only a description

of the relation among terms, they cannot provide an "`environment filled with life". To

reach this, statements together with an appropriate schema are required.

1.2.2

RDF Support in Oracle RDBMS

The Oracle Spatial Network Data Model (NDM) is one of the features provided with

Oracle Spatial 10g ([Mur05]). The NDM provides a solution for storing, managing and

analysing directed and undirected networks or graphs in the database. This allows RDF

data to be managed as objects and analysed as networks. RDF graphs are modelled as a

directed logical network in NDM ([Mur05]). In this network, the subjects and objects of

triples are mapped to nodes and predicates are mapped to links that have their direction

towards the object nodes ( [AR05]). All RDF triples for all the RDF models are parsed and

stored in global tables under a central schema. Nodes are uniquely stored, regardless of

the number of times they participate in triples. Only references (IDs), to these triples, are

stored in the user-defined application tables whenever a new triple is inserted ( [AR05]).

An RDF triple (subject, property, object) is treated as one database object.

There

are two possible object types in Oracle for representing RDF data. First there is the

SDO RDF TRIPLE type which represents RDF data directly in triple format but which

is unapplicable for the purpose of our work because these data aren't stored persistently,

and second, the object type that was chosen for this work, because it stores persistent

RDF data in the database, the SDO RDF TRIPLE S type (the s stands for storage). This

one only has references to the data, which are stored, as mentioned above, in the central

RDF schema ([Mur05]) . Further database objects are models, rulebases and the rules index.

The model is an RDF graph which consists of a set of triples and is associated with one

base table. The Oracle RDF Data Model allows to create rules which can refine and im-

prove the definition of relationships between subjects and objects. There is the possibility

to create rules for hierarchical relationships of the data. For creating a rule, not concrete

triples are used, but the Oracle supported RDF Schema to get general applicability for all

triples. A rule consists of an IF side pattern for the antecedents, an optional filter condi-

tion that further restricts the subgraphs matched by the IF side pattern and a THEN side

pattern for the consequents. Rulebases contain these rules that can be defined by the user.

There are two predefined rulebases, RDF and RDFS, that are provided by Oracle. A rules

index is an entailed graph that is created on an RDF dataset consisting of an RDF model

and a rulebase. It has to be created to use the rulebase in a query.

1 INTRODUCTION

Queries under Oracle

For querying the RDF statements Oracle provides a function

called SDO RDF MATCH Table Function. With this function, a query to the RDF data

can be done. Its construction is as follows:

Select x, y from table (SDO RDF MATCH ' (?x :predicate ?y)',

SDO RDF Models ('publications'),

SDO RDF Rulebases ('RDFS', 'publications rb'), Null)

This statement will give all subjects and objects that have :predicate as predicate as out-

put. The subject and object can be replaced by a known concrete resource to get a more

specific result.

1.2.3

Visualisation of RDF

The RDF data can be visualized as graphs as mentioned before. There can be found sev-

eral tools in the world wide web that can produce nice graphs out of triples or XML

files written in RDF. These graphs are directed ones, with nodes for resources and liter-

als, as well as edges linking these nodes representing the properties. One example for

these tools is IsaViz. It provides an environment for visualizing RDF models that are

represented as graphs, with ellipses as resources and rectangles as literals including the

possibility of zooming and navigation in the graph. Possible input files are written in

Notation 3, N-Triple or RDF/XML.

RDF-Gravity stands for RDF GRAph Visualization Tool and is a tool for visualizing di-

rected graphs that are built in RDF or OWL. It has zoom and selection functions like

IsaViz but with the additional advantage that global and local filters can be applied to

get more specific views on a graph([SG]). XML files are the only possibility for input

RDF models.

1.3

Goal of this work

The motivation for this work emanates from the ontoverse project. Ontoverse is a project

for developing ontologies for the Life Sciences ([OV006]). It searches solutions for a new

network of information and knowledge management. One part of it is an object-relational

database where publications are inserted that are fetched from pubmed ([PM]). Pubmed

is a part of the National Center for Biotechnology Information ([NCB]) and the most fa-

mous database for biological publication data. Publication data expresses the description

of a publication whereby creators, title, date of publication, pubmed id, references to

other publications and keywords belong to. Neither pubmed nor the object-relational

database of Ontoverse are written in RDF. The question was, because Ontoverse is a

project about ontologies and therefore uses OWL, the Web Ontology Language that in-

cludes RDF vocabulary, if there would be a possibility to enhance the object-relational

database by adding RDF-functionality. Data could be taken of the object-relational

database and with this information a schema for the RDF model could be built. This

schema was developed according to Dublin Core, which is a metadata vocabulary for

describing resources in the internet that enable more intelligent information discovery

systems ([Hil] and [DC006]). Figure 2 shows the RDF model, the corresponding schema

1 INTRODUCTION

written in XML can be seen in the Appendix (Section A). In the RDF model and schema

there is one attribute that is not drawn directly from the original publication data. The

row references is split into two predicates for the RDF table: references and citatedBy. The

y of (:pub3 :references ?y) expresses a publication that references publication 3

whereby the y of (:pub3 :citatedBy ?y) describes a publication that is referenced

by publication 3.

http://www.ontoverse.org/publications/pub1

Keyword

Title

Group

Key

rds

Creators

Jour

nal-

Pubmed-id

Ref

eren

Cit

ate

Figure 2: RDF-Model. These are the attributes that describe a publication. Later, these

will be the predicates

In this work, possibilities of improving the object-relational database through adding

an RDF model with an enhancement of data acquisition and interpretation and with

the possibility of displaying results of queries has to be evaluated. Oracle provides a

support for RDF with the Oracle Spatial network data model, but it is not easy to get

knowledge of all single functions, so there is the requirement for a simplification of the

use. Therefore a package in PL/SQL has to be implemented for a better management of

data in RDF. It shall enable the user to create an RDF model and an associated table with

insertions of RDF-triples out of an existing database. Further there have to be procedures

in the package that are able to build rulebases and according rules indexes for a better

querying of the data. This package called from a Perl program also shall create files with

results of the queries, one with triples and one in an XML-design of the same data. The

whole package can be started using a graphical user interface written in Java whereby it

is possible to illustrate the results by calling of a graph visualization program.

2 IMPLEMENTATION

Implementation

Data of publications exist in a relational schema and have to be transformed into RDF

form. For this purpose, a package is implemented in PL/SQL, which uses Oracle RDF

functions. It provides functionalities for creating a new RDF model as well as new rules

and to do queries over this new dataset. The second part includes a connection of the

PL/SQL program to a Perl program. It serves for a more convenient selection of proce-

dures of the PL/SQL package and for a connection to the database with several example

queries. Further it allows the use of the RDF-Perl module RDF::Notation3::XML whereby

files with inserted triples can be converted into XML files. The third part of implemen-

tation is the development of a graphical user interface written in Java for a more com-

fortable use of the whole program including creating graphical views of the results. Each

package can be used autonomous with its underlying packages. The PL/SQL package

can be used directly in the database, adding the Perl program a command line version of

the program can be used and all three parts offer the most comfortable version.

2.1

Part I: PL/SQL package for use of Oracle RDF

PL/SQL stands for Procedural Language/Structured Query Language.

PL/SQL is

closely integrated into the SQL language, yet it adds programming constructs that are

native to SQL. It is a procedural extension of Oracle-SQL that offers language constructs

similar to those in imperative programming languages. It allows users and designers

to develop complex database applications that require the utilisation of control struc-

tures and procedural elements such as procedures, functions and modules. Functions

and procedures can be grouped into packages. Its commands can be deposited and

used as anonymous blocs or named in the form of so called Stored Procedures in Ora-

cle databases. Stored Procedures are efficient, because they are compiled once and stored

in executable form, they are cached and shared among users ([SU04]).

PL/SQL is a block-structured language. The basic program unit in PL/SQL is called a

block and all PL/SQL programs are made of at least one block. This is the part contain-

ing PL/SQL statements. The minimum structure of a block is a single statement with

BEGIN and END framing it. Then there are two additional optional sections. First, the

declaration section, which is used to list the variables used in the subsequent block along

with the types of data that will be stored in the variables. Cursors are declared in this

section, too. Second, there is an optional exception-handling part. It traps errors that are

generated during the program execution.

Thus, the structure looks as follows:

2 IMPLEMENTATION

CREATE PROCEDURE x

DECLARE

constants

variables

cursors

user defined exceptions

BEGIN

PL/SQL statements

EXCEPTION

Exception handling

END;

PL/SQL supports both static and dynamic SQL. While the syntax of static SQL state-

ments is already known at precompile time and the preparation of them occurs before

runtime, the syntax of dynamic SQL statements is not known until runtime. The latter

one represents a programming technique which makes the application more flexible and

versatile, because the program can be built without knowing for example table names in

advance. This is important for procedures where the user is asked to give such names

([SU04]).

Details

Seiten
Erscheinungsform: Originalausgabe
Erscheinungsjahr: 2006
ISBN (eBook): 9783836608688
DOI: 10.3239/9783836608688
Dateigröße: 1.3 MB
Sprache: Englisch
Institution / Hochschule: Heinrich-Heine-Universität Düsseldorf – Mathematisch-Naturwissenschaftliche Fakultät, Informatik
Erscheinungsdatum: 2008 (Januar)
Note: 2,0
Schlagworte: semantisches support biological publication data package
Produktsicherheit: Diplom.de

Autor

Maren Lang (Autor:in)