Lade Inhalt...

Analysis and Visualization of Biological Publication Data

©2006 Bachelorarbeit 48 Seiten

Zusammenfassung

Inhaltsangabe:Abstract:
The content of today’s World Wide Web is semantically not well structured. Every-thing is built for people and the data is therefore machine-readable but not machine- understandable. The semantic Web provides a solution for this problem through a new form of content structure. One technology for developing the Semantic Web is the Resource Description Framework (RDF).
RDF is a language for representing information about resources in the World Wide Web and is particularly intended for representing metadata about Web resources. Therefore RDF provides ‘interoperability’ between applications that exchange machine-understandable information on the Web. In this work, existing biological publication data which is stored in an object-relational database, is transformed into data represented in RDF. With the newly created RDF model it is possible to make a new way of queries, not only key word searching, but also queries with semantic sense. The additional advantage oft his representation is that it can be described not only in triples or XML structure but also in directed graphs.
The World Wide Web provides documents that are built for human usage. There are formats like HTML, SVG and other extensions like Javascript or Javaapplets which are made for representing information. The content is semantically not well structured. These documents are structured for their presentation and are meant for people rather than computer which process data and information automatically. Everything is built for people and the data therefore is machine-readable but not machine-understandable. The Semantic Web provides a solution for this problem through a new form of structuring the content of the Web. It is not a separate Web but an extension of the existing one. There is, beside the documents of the Web, well defined additional information, which the computer is able to exploit automatically. This will give search engines more selective results as answer to the user enquired queries.
Current search engines normally provide a big quantity of results to which the user has not or hardly referred initially. Their criteria of assigning a document to the set of relevant documents are the occurrences of one or several keywords. The results could be more precise if additional information which concerns the question would be considered. For example if somebody searches a document of mister Miller, the search engine could take into account, that one […]

Leseprobe

Inhaltsverzeichnis


Maren Lang
Analysis and Visualization of Biological Publication Data
ISBN: 978-3-8366-0868-8
Druck Diplomica® Verlag GmbH, Hamburg, 2008
Zugl. Heinrich-Heine-Universität Düsseldorf, Düsseldorf, Deutschland, Bachelorarbeit,
2006
Dieses Werk ist urheberrechtlich geschützt. Die dadurch begründeten Rechte,
insbesondere die der Übersetzung, des Nachdrucks, des Vortrags, der Entnahme von
Abbildungen und Tabellen, der Funksendung, der Mikroverfilmung oder der
Vervielfältigung auf anderen Wegen und der Speicherung in Datenverarbeitungsanlagen,
bleiben, auch bei nur auszugsweiser Verwertung, vorbehalten. Eine Vervielfältigung
dieses Werkes oder von Teilen dieses Werkes ist auch im Einzelfall nur in den Grenzen
der gesetzlichen Bestimmungen des Urheberrechtsgesetzes der Bundesrepublik
Deutschland in der jeweils geltenden Fassung zulässig. Sie ist grundsätzlich
vergütungspflichtig. Zuwiderhandlungen unterliegen den Strafbestimmungen des
Urheberrechtes.
Die Wiedergabe von Gebrauchsnamen, Handelsnamen, Warenbezeichnungen usw. in
diesem Werk berechtigt auch ohne besondere Kennzeichnung nicht zu der Annahme,
dass solche Namen im Sinne der Warenzeichen- und Markenschutz-Gesetzgebung als frei
zu betrachten wären und daher von jedermann benutzt werden dürften.
Die Informationen in diesem Werk wurden mit Sorgfalt erarbeitet. Dennoch können
Fehler nicht vollständig ausgeschlossen werden, und die Diplomarbeiten Agentur, die
Autoren oder Übersetzer übernehmen keine juristische Verantwortung oder irgendeine
Haftung für evtl. verbliebene fehlerhafte Angaben und deren Folgen.
© Diplomica Verlag GmbH
http://www.diplom.de, Hamburg 2008
Printed in Germany

Abstract
The content of today's World Wide Web is semantically not well structured. Every-
thing is built for people and the data is therefore machine-readable but not machine-
understandable.
The Semantic Web provides a solution for this problem through a new form of content
structure. One technology for developing the Semantic Web is the Resource Description
Framework (RDF). RDF is a language for representing information about resources in the
World Wide Web and is particularly intended for representing metadata about Web re-
sources. Therefore RDF provides "interoperability" between applications that exchange
machine-understandable information on the Web.
In this work, existing biological publication data which is stored in an object-relational
database, is transformed into data represented in RDF. With the newly created RDF
model it is possible to make a new way of queries, not only keyword searching, but
also queries with semantic sense. The additional advantage of this representation is that
it can be described not only in triples or XML structure but also in directed graphs.

CONTENTS
1
Contents
1
Introduction
3
1.1
Semantic Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.2
RDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.2.1
Schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
1.2.2
RDF Support in Oracle RDBMS . . . . . . . . . . . . . . . . . . . . .
5
A Queries under Oracle . . . . . . . . . . . . . . . . . . . . . . . . .
5
1.2.3
Visualisation of RDF . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
1.3
Goal of this work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
2
Implementation
8
2.1
Part I: PL/SQL package for use of Oracle RDF . . . . . . . . . . . . . . . .
8
2.1.1
Short instructions for the utilisation of the package RDF TABLE
.
10
A Overview RDF TABLE . . . . . . . . . . . . . . . . . . . . . . . . .
10
2.1.2
Details of Implementation . . . . . . . . . . . . . . . . . . . . . . . .
12
A The RDF table . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
B The model belonging to a table . . . . . . . . . . . . . . . . . . . .
13
C Inserting triples . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
D A table for rules . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
E Rulebases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
F The rules index . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
G Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
2.2
Part II: Connection to the PL/SQL package with Perl . . . . . . . . . . . .
22
2.2.1
Perl package rdfmaker . . . . . . . . . . . . . . . . . . . . . . . . .
22
A Predefined Queries . . . . . . . . . . . . . . . . . . . . . . . . . .
23
2.2.2
Details of implementation . . . . . . . . . . . . . . . . . . . . . . . .
24
A Connection to a database . . . . . . . . . . . . . . . . . . . . . . .
24
B SQL statements in Perl . . . . . . . . . . . . . . . . . . . . . . . . .
25
C Converting results to triple format . . . . . . . . . . . . . . . . . .
25
D Converting results to XML format . . . . . . . . . . . . . . . . . .
26
E An example: Searching coauthors and their publications . . . . .
26
2.3
Part III: GUI written as Java application . . . . . . . . . . . . . . . . . . . .
27
2.3.1
Instructions for the usage of RdfGui
. . . . . . . . . . . . . . . . .
27

CONTENTS
2
2.3.2
Details of implementation . . . . . . . . . . . . . . . . . . . . . . . .
28
A Description of the implemented classes . . . . . . . . . . . . . . .
28
B Graphical display of results . . . . . . . . . . . . . . . . . . . . . .
29
3
Conclusions
31
A Schema
33
B Query
36
B.1 Query: searching for keywords of pub1 . . . . . . . . . . . . . . . . . . . .
36
B.1.0.1
triples format . . . . . . . . . . . . . . . . . . . . . . . . . .
36
B.1.0.2
xml format . . . . . . . . . . . . . . . . . . . . . . . . . . .
38

1 INTRODUCTION
3
1
Introduction
1.1
Semantic Web
The World Wide Web provides documents that are built for human usage. There are for-
mats like HTML, SVG and other extensions like Javascript or Javaapplets which are made
for representing information. The content is semantically not well structured. These doc-
uments are structured for their presentation and are meant for people rather than com-
puter which process data and information automatically ([TBL01]). Everything is built
for people and the data therefore is machine-readable but not machine-understandable
([OL99]).
The Semantic Web provides a solution for this problem through a new form of structur-
ing the content of the Web. It is not a separate Web but an extension of the existing one.
There is, beside the documents of the Web, well defined additional information, which
the computer is able to exploit automatically. This will give search engines more selective
results as answer to the user enquired queries. Current search engines normally provide
a big quantity of results to which the user has not or hardly referred initially. Their crite-
ria of assigning a document to the set of relevant documents are the occurrences of one or
several keywords. The results could be more precise if additional information which con-
cerns the question would be considered. For example if somebody searches a document
of mister Miller, the search engine could take into account, that one searches an article
which has an author named Miller and not to select all articles where the word Miller ap-
pears. Adding logic to the Web also means to use rules to make inferences, choose courses
and ask questions, which is the task before the Semantic Web community at the moment.
One technology for developing the Semantic Web is the eXtensible Markup Language
(XML). XML allows users to add arbitrary structure to their documents but says nothing
about what the structures means ([TBL01]). But the Semantic Web will enable machines
to comprehend semantic documents and data, not human speech and writings. Mean-
ing is expressed by the Resource Description Framework (RDF) which provides the basic
building blocks for supporting the Semantic Web.
1.2
RDF
RDF is intended for situations in which the information needs to be processed by appli-
cations rather than being only displayed to people. It was developed under the control of
the World Wide Web Consortium (W3C) which is a foundation for processing Metadata.
These are 'data about data', like a library catalog that desribes publications ([OL99]). RDF
is a language for representing information about resources in the World Wide Web and
is particularly intended for representing metadata about Web resources, such as the ti-
tle or author of a Web page ([Fra04]). Therefore RDF provides interoperability between
applications that exchange machine-understandable information on the Web. RDF is be-
ing used in numerous application areas, including resource discovery to provide more
targeted and sophisticated search engine capabilities. Then it is used in cataloging for a
description of contents and relationships between them that are available at a particluar
Web site or digital Library, as well as intelligent software agents for facilitating knowl-
edge exchange. A further application is content rating, that means describing a collection

1 INTRODUCTION
4
of different pages which is representing a single logical document that can be important
for intellectual property rights of Web pages and for expressing the privacy policies of
a Web site as well as the preferences of a user. Thus, different areas like Life Sciences,
Digital Libraries, and Intelligence, E-Commerce and Personal Information Management
can benefit from RDF ([AR05] and [OL99]).
RDF provides a model for describing resources. A resource is defined as regardless
which object that can be clearly defined by an URI (Uniform Resource Identifiers, URIs).
The URI is only an identifier, it does not have to be specified a special protocol nor
does the identified object has to exist physically in the web. Resources can be described
through statements, which are represented by triples consisting of subject (resource),
predicate (property) and an object. The subject can be a document, a part of it or as well a
person. Subjects are identified by using Web identifiers or a blanknode ([Mil98]). A blank
node is used if a subject or object is unknown or if the relationship between a subject and
an object is unknown. A blanknode starts with an underline like reference. A property
is a characteristic, attribute or relation that can be used to describe the subject. It has to
be an URI. Properties that are associated with subjects are identified by property-types.
These express the relationships of values associated with resources. The value of the
property is the object of the triple and can be represented as URI as well as as literal or
as blanknode.
An example for a statement is
¡http://www.ontoverse.org/publications/pub1¿
¡http://www.ontoverse.org/publications/keyword¿
¡Mitochondria¿
Triples can be modelled as a directed graph. Subject and object represent nodes of the
graph and the predicate is the directed link from subject to object.
Figure 1 shows an example of a triple that is displayed as graph.
Figure 1: Triple. The subject is shown as ellipse, the object as rectangle and the link is
always directed towards the object
RDF uses the Extensible Markup Language (XML) to represent RDF statements in a
machine-processable way. XML was designed to write a document in a self designed
document format and RDF defines a specific XML format that is called RDF/XML. But
XML is not the only possible way [W3C], RDF data can be represented in Notation3 or
N-Triple, too. These are formats that consist of triples.

1 INTRODUCTION
5
1.2.1
Schema
Schema RDF statements define the vocabulary for a certain domain, they alone do not
make assumptions about any particular application domain, nor do they define the se-
mantics of any domain. So the RDF Schema (RDFS) lets the user determine a syntax
for the common data exchange. Properties and relations of the resources that exist in
this domain can also be stipulated so that an interpretation of the statements given in
an RDF data model is possible. This information is represented using the concepts of
classes, properties, and attributes ([BR00]). Schemas themselves may be written in RDF.
RDF-Schemas are Entity Relationship Diagrams ([OL99]). Schemas are only a description
of the relation among terms, they cannot provide an "`environment filled with life". To
reach this, statements together with an appropriate schema are required.
1.2.2
RDF Support in Oracle RDBMS
The Oracle Spatial Network Data Model (NDM) is one of the features provided with
Oracle Spatial 10g ([Mur05]). The NDM provides a solution for storing, managing and
analysing directed and undirected networks or graphs in the database. This allows RDF
data to be managed as objects and analysed as networks. RDF graphs are modelled as a
directed logical network in NDM ([Mur05]). In this network, the subjects and objects of
triples are mapped to nodes and predicates are mapped to links that have their direction
towards the object nodes ( [AR05]). All RDF triples for all the RDF models are parsed and
stored in global tables under a central schema. Nodes are uniquely stored, regardless of
the number of times they participate in triples. Only references (IDs), to these triples, are
stored in the user-defined application tables whenever a new triple is inserted ( [AR05]).
An RDF triple (subject, property, object) is treated as one database object.
There
are two possible object types in Oracle for representing RDF data. First there is the
SDO RDF TRIPLE type which represents RDF data directly in triple format but which
is unapplicable for the purpose of our work because these data aren't stored persistently,
and second, the object type that was chosen for this work, because it stores persistent
RDF data in the database, the SDO RDF TRIPLE S type (the s stands for storage). This
one only has references to the data, which are stored, as mentioned above, in the central
RDF schema ([Mur05]) . Further database objects are models, rulebases and the rules index.
The model is an RDF graph which consists of a set of triples and is associated with one
base table. The Oracle RDF Data Model allows to create rules which can refine and im-
prove the definition of relationships between subjects and objects. There is the possibility
to create rules for hierarchical relationships of the data. For creating a rule, not concrete
triples are used, but the Oracle supported RDF Schema to get general applicability for all
triples. A rule consists of an IF side pattern for the antecedents, an optional filter condi-
tion that further restricts the subgraphs matched by the IF side pattern and a THEN side
pattern for the consequents. Rulebases contain these rules that can be defined by the user.
There are two predefined rulebases, RDF and RDFS, that are provided by Oracle. A rules
index is an entailed graph that is created on an RDF dataset consisting of an RDF model
and a rulebase. It has to be created to use the rulebase in a query.

1 INTRODUCTION
6
Queries under Oracle
For querying the RDF statements Oracle provides a function
called SDO RDF MATCH Table Function. With this function, a query to the RDF data
can be done. Its construction is as follows:
Select x, y from table (SDO RDF MATCH ' (?x :predicate ?y)',
SDO RDF Models ('publications'),
SDO RDF Rulebases ('RDFS', 'publications rb'), Null)
This statement will give all subjects and objects that have :predicate as predicate as out-
put. The subject and object can be replaced by a known concrete resource to get a more
specific result.
1.2.3
Visualisation of RDF
The RDF data can be visualized as graphs as mentioned before. There can be found sev-
eral tools in the world wide web that can produce nice graphs out of triples or XML
files written in RDF. These graphs are directed ones, with nodes for resources and liter-
als, as well as edges linking these nodes representing the properties. One example for
these tools is IsaViz. It provides an environment for visualizing RDF models that are
represented as graphs, with ellipses as resources and rectangles as literals including the
possibility of zooming and navigation in the graph. Possible input files are written in
Notation 3, N-Triple or RDF/XML.
RDF-Gravity stands for RDF GRAph Visualization Tool and is a tool for visualizing di-
rected graphs that are built in RDF or OWL. It has zoom and selection functions like
IsaViz but with the additional advantage that global and local filters can be applied to
get more specific views on a graph([SG]). XML files are the only possibility for input
RDF models.
1.3
Goal of this work
The motivation for this work emanates from the ontoverse project. Ontoverse is a project
for developing ontologies for the Life Sciences ([OV006]). It searches solutions for a new
network of information and knowledge management. One part of it is an object-relational
database where publications are inserted that are fetched from pubmed ([PM]). Pubmed
is a part of the National Center for Biotechnology Information ([NCB]) and the most fa-
mous database for biological publication data. Publication data expresses the description
of a publication whereby creators, title, date of publication, pubmed id, references to
other publications and keywords belong to. Neither pubmed nor the object-relational
database of Ontoverse are written in RDF. The question was, because Ontoverse is a
project about ontologies and therefore uses OWL, the Web Ontology Language that in-
cludes RDF vocabulary, if there would be a possibility to enhance the object-relational
database by adding RDF-functionality. Data could be taken of the object-relational
database and with this information a schema for the RDF model could be built. This
schema was developed according to Dublin Core, which is a metadata vocabulary for
describing resources in the internet that enable more intelligent information discovery
systems ([Hil] and [DC006]). Figure 2 shows the RDF model, the corresponding schema

1 INTRODUCTION
7
written in XML can be seen in the Appendix (Section A). In the RDF model and schema
there is one attribute that is not drawn directly from the original publication data. The
row references is split into two predicates for the RDF table: references and citatedBy. The
y of (:pub3 :references ?y) expresses a publication that references publication 3
whereby the y of (:pub3 :citatedBy ?y) describes a publication that is referenced
by publication 3.
http://www.ontoverse.org/publications/pub1
ID
Keyword
Title
ID
ID
Group
Key
wo
rds
Creators
Jour
nal-
id
Pubmed-id
Ref
eren
ce
Cit
ate
d
Figure 2: RDF-Model. These are the attributes that describe a publication. Later, these
will be the predicates
In this work, possibilities of improving the object-relational database through adding
an RDF model with an enhancement of data acquisition and interpretation and with
the possibility of displaying results of queries has to be evaluated. Oracle provides a
support for RDF with the Oracle Spatial network data model, but it is not easy to get
knowledge of all single functions, so there is the requirement for a simplification of the
use. Therefore a package in PL/SQL has to be implemented for a better management of
data in RDF. It shall enable the user to create an RDF model and an associated table with
insertions of RDF-triples out of an existing database. Further there have to be procedures
in the package that are able to build rulebases and according rules indexes for a better
querying of the data. This package called from a Perl program also shall create files with
results of the queries, one with triples and one in an XML-design of the same data. The
whole package can be started using a graphical user interface written in Java whereby it
is possible to illustrate the results by calling of a graph visualization program.

2 IMPLEMENTATION
8
2
Implementation
Data of publications exist in a relational schema and have to be transformed into RDF
form. For this purpose, a package is implemented in PL/SQL, which uses Oracle RDF
functions. It provides functionalities for creating a new RDF model as well as new rules
and to do queries over this new dataset. The second part includes a connection of the
PL/SQL program to a Perl program. It serves for a more convenient selection of proce-
dures of the PL/SQL package and for a connection to the database with several example
queries. Further it allows the use of the RDF-Perl module RDF::Notation3::XML whereby
files with inserted triples can be converted into XML files. The third part of implemen-
tation is the development of a graphical user interface written in Java for a more com-
fortable use of the whole program including creating graphical views of the results. Each
package can be used autonomous with its underlying packages. The PL/SQL package
can be used directly in the database, adding the Perl program a command line version of
the program can be used and all three parts offer the most comfortable version.
2.1
Part I: PL/SQL package for use of Oracle RDF
PL/SQL stands for Procedural Language/Structured Query Language.
PL/SQL is
closely integrated into the SQL language, yet it adds programming constructs that are
native to SQL. It is a procedural extension of Oracle-SQL that offers language constructs
similar to those in imperative programming languages. It allows users and designers
to develop complex database applications that require the utilisation of control struc-
tures and procedural elements such as procedures, functions and modules. Functions
and procedures can be grouped into packages. Its commands can be deposited and
used as anonymous blocs or named in the form of so called Stored Procedures in Ora-
cle databases. Stored Procedures are efficient, because they are compiled once and stored
in executable form, they are cached and shared among users ([SU04]).
PL/SQL is a block-structured language. The basic program unit in PL/SQL is called a
block and all PL/SQL programs are made of at least one block. This is the part contain-
ing PL/SQL statements. The minimum structure of a block is a single statement with
BEGIN and END framing it. Then there are two additional optional sections. First, the
declaration section, which is used to list the variables used in the subsequent block along
with the types of data that will be stored in the variables. Cursors are declared in this
section, too. Second, there is an optional exception-handling part. It traps errors that are
generated during the program execution.
Thus, the structure looks as follows:

2 IMPLEMENTATION
9
CREATE PROCEDURE x
DECLARE
constants
variables
cursors
user defined exceptions
BEGIN
PL/SQL statements
EXCEPTION
Exception handling
END;
PL/SQL supports both static and dynamic SQL. While the syntax of static SQL state-
ments is already known at precompile time and the preparation of them occurs before
runtime, the syntax of dynamic SQL statements is not known until runtime. The latter
one represents a programming technique which makes the application more flexible and
versatile, because the program can be built without knowing for example table names in
advance. This is important for procedures where the user is asked to give such names
([SU04]).

Details

Seiten
Erscheinungsform
Originalausgabe
Jahr
2006
ISBN (eBook)
9783836608688
DOI
10.3239/9783836608688
Dateigröße
1.3 MB
Sprache
Englisch
Institution / Hochschule
Heinrich-Heine-Universität Düsseldorf – Mathematisch-Naturwissenschaftliche Fakultät, Informatik
Erscheinungsdatum
2008 (Januar)
Note
2,0
Schlagworte
semantisches support biological publication data package
Zurück

Titel: Analysis and Visualization of Biological Publication Data
book preview page numper 1
book preview page numper 2
book preview page numper 3
book preview page numper 4
book preview page numper 5
book preview page numper 6
book preview page numper 7
book preview page numper 8
book preview page numper 9
book preview page numper 10
book preview page numper 11
48 Seiten
Cookie-Einstellungen