Semi-automatic ontology engineering and ontology supported document indexing in a multilingual environment

Lauser, Boris

Semi-automatic ontology engineering and ontology supported document indexing in a multilingual environment

Zusammenfassung

Inhaltsangabe:Introduction:
The management of large amounts of information and knowledge is of ever increasing importance in todays large organisations. With the ongoing ease of supplying information online, especially in corporate intranets and knowledge bases, finding the right information becomes an increasingly difficult task. Todays search tools perform rather poorly in the sense that information access is mostly based on keyword searching or even mere browsing of topic areas. This unfocused approach often leads to undesired results. The following example illustrates the problem more clearly: An agriculture scientist would like to find out which organisation established the Agreement on Agriculture. A simple search for establish Agreement on Agriculture might result in a huge list of documents containing these words, but actually none of them containing the desired result: WTO or World Trade Organisation. The problem becomes even worse if the result searched for only appears in a foreign language document.
Semantically annotated documents, i.e. documents that are indexed with ontological terms and concepts instead of simple keywords, provide several advantages. First, the ontological abstraction provides robustness against changes in the document. In the above example, the document representation might change using the term Agricultural Agreement instead of Agreement on Agriculture. However, since the document has been annotated with the ontological semantics, this will not affect the search results. Second, since the ontology used for annotating the document in this example is domain-specific, the semantic meanings and interpretations of keywords are bound to that domain and therefore the retrieval is likely to be more efficient. A term can have several meanings in different domains. By first mapping the keyword to its semantic representation in a specific ontology and using the ontologys linked knowledge structure, a much more focused search approach can be taken. Third, document specific representations no longer affect the search. This is extremely important in the case of multilingual representations. Keywords of several languages are mapped to the same concept in an ontology and are therefore given the same meaning. Multilingual search portals can be established to produce the same results, no matter which language is used for retrieval.
An important task in knowledge management facilitating above described search scenario id […]

Leseprobe

Inhaltsverzeichnis

ID 6905

Lauser, Boris: Semi-automatic ontology engineering and ontology supported document

indexing in a multilingual environment

Hamburg: Diplomica GmbH, 2003

Zugl.: Fachhochschule Südwestfalen, Technische Universität, Diplomarbeit, 2003

Dieses Werk ist urheberrechtlich geschützt. Die dadurch begründeten Rechte,

insbesondere die der Übersetzung, des Nachdrucks, des Vortrags, der Entnahme von

Abbildungen und Tabellen, der Funksendung, der Mikroverfilmung oder der

Vervielfältigung auf anderen Wegen und der Speicherung in Datenverarbeitungsanlagen,

bleiben, auch bei nur auszugsweiser Verwertung, vorbehalten. Eine Vervielfältigung

dieses Werkes oder von Teilen dieses Werkes ist auch im Einzelfall nur in den Grenzen

der gesetzlichen Bestimmungen des Urheberrechtsgesetzes der Bundesrepublik

Deutschland in der jeweils geltenden Fassung zulässig. Sie ist grundsätzlich

vergütungspflichtig. Zuwiderhandlungen unterliegen den Strafbestimmungen des

Urheberrechtes.

Die Wiedergabe von Gebrauchsnamen, Handelsnamen, Warenbezeichnungen usw. in

diesem Werk berechtigt auch ohne besondere Kennzeichnung nicht zu der Annahme,

dass solche Namen im Sinne der Warenzeichen- und Markenschutz-Gesetzgebung als frei

zu betrachten wären und daher von jedermann benutzt werden dürften.

Die Informationen in diesem Werk wurden mit Sorgfalt erarbeitet. Dennoch können

Fehler nicht vollständig ausgeschlossen werden, und die Diplomarbeiten Agentur, die

Autoren oder Übersetzer übernehmen keine juristische Verantwortung oder irgendeine

Haftung für evtl. verbliebene fehlerhafte Angaben und deren Folgen.

Diplomica GmbH

http://www.diplom.de, Hamburg 2003

Printed in Germany

TABLE OF CONTENTS

INTRODUCTION... 1

1.1

OTIVATION

... 1

1.2

PPROACH

... 3

1.3

UTLINE

... 4

THE PROJECT ENVIRONMENT... 5

2.1

FAO

AND THE

AOS ... 5

2.2

NFORMATION MANAGEMENT AT THE

FAO... 7

2.2.1

Resources and metadata ... 7

2.2.2

The information management system ... 8

2.2.3

AGROVOC Thesaurus and Document Indexing ... 10

2.3

ROBLEMS WITH THE CURRENT SYSTEM AND PROPOSAL

... 13

SEMANTIC WEB... 15

3.1

HE IDEA

... 15

3.2

NTOLOGIES

... 17

3.2.1

Introduction ... 17

3.2.2

Types of ontologies... 20

3.2.3

Ontology representation languages... 22

3.2.4

KAON ... 25

3.2.5

Ontology Engineering ... 27

INTRODUCTION OF ONTOLOGY BASED INFORMATION

MANAGEMENT SYSTEM AT THE FAO ... 29

4.1

HE PROTOTYPE PROJECT

... 29

4.2

EQUIREMENTS REGARDING THE

AOS ... 30

4.3

NTOLOGY

NGINEERING

RAMEWORK

... 32

4.3.1

Overview... 32

4.3.2

Initialisation of the cycle... 33

4.3.3

The 5 phases of the framework ... 35

4.4

NTOLOGY

ROWSER

... 40

4.5

EPRESENTATION OF

AGROVOC

KAON... 42

4.6

ELATED

ORK AND POSITIONING

:... 46

4.7

URRENT STATUS AND

URTHER

ORK

:... 48

THE ONTOLOGY PRUNER ... 50

5.1

NTRODUCTION TO THE PRUNING APPROACH

... 50

5.2

DAPTATION OF THE ONTOLOGY PRUNER

... 53

5.3

VALUATION

... 56

5.3.1

Resources: Document corpus and source ontology ... 56

5.3.2

Hypotheses for evaluation... 58

5.3.3

Evaluation plan:... 59

5.4

ESULTS AND

ISCUSSION

: ... 60

5.4.1

Pruner Trie vs. Pruner:... 61

5.4.2

Dependency of the statistics on different parameter settings: ... 61

5.4.3

Generic Document Set 1 (Gen) vs. Generic Document Set 2 (AG): ... 62

5.4.4

Empirical evaluation:... 63

5.5

UMMARY

... 67

AUTOMATIC CLASSIFICATION ... 69

6.1

NTRODUCTION

... 69

6.1.1

What is text categorisation?... 69

6.1.2

Motivation within the project context ... 69

6.2

ASIC DEFINITIONS

... 70

6.2.1

Using Support Vector Machines for Multi-label Document Indexing ... 70

6.2.2

Evaluation measures:... 74

6.3

DAPTATION OF THE CLASSIFIER

... 78

6.3.1

Multi-label vs. single-label Indexing ... 78

6.3.2

Multiple Languages... 80

6.3.3

Integration of background knowledge... 80

6.3.4

Multi-class problem and class hierarchy ... 83

6.4

ET OF TRAINING AND TEST DOCUMENTS

... 85

6.5

VALUATION

... 89

6.5.1

Single-label vs. multi-label classification... 89

6.5.2

Multilingual classification ... 96

6.5.3

Integration of domain specific background knowledge ... 98

6.6

ELATED

ORK

... 100

6.7

UMMARY AND

UTLOOK

... 101

CONCLUSION ... 103

7.1

UMMARY

... 103

7.2

UTLOOK

... 105

REFERENCES... 106

A KAON RDFS REPRESENTATION OF THE ONTOLOGY ON FOOD

SAFETY, ANIMAL AND PLANT HEALTH (EXTRACT)... 113

B COMPLETE LIST OF WEB SITES OUTPUT BY THE FOCUSED

CRAWLER... 114

C AGROVOC

CATEGORIES ... 119

D RESULTS OF ONTOLOGY INTEGRATION INTO AUTOMATIC TEXT

CLASSIFICATION... 123

III

ABLE OF

IGURES

IGURE

1: O

NTOLOGY EXAMPLE

EXCERPT

... 2

IGURE

2: I

NFORMATION MANAGEMENT SYSTEM AT THE

FAO ... 10

IGURE

3: AGROVOC

THESAURUS

: A

SAMPLE EXTRACT SHOWING A DESCRIPTOR AND A NON

DESCRIPTOR

... 12

IGURE

4: XML

SERIALISATION OF

RDF,

EXAMPLE

... 16

IGURE

5: O

NTOLOGY TYPES

... 21

IGURE

6: O

NTOLOGY REPRESENTATION LANGUAGES AND THEIR EXPRESSIVENESS TAKEN FROM

[CG00]... 22

IGURE

7: RDF S

CHEMA EXAMPLE MODEL

... 23

IGURE

8: L

EXICAL

OIM

ODEL

... 25

IGURE

9: S

PANNING

BJECT

XAMPLE

... 26

IGURE

10: T

HE ONTOLOGY ENGINEERING FRAMEWORK

... 33

IGURE

11: T

OCUSED

RAWLER

... 36

IGURE

12: E

VALUATION OF THE ONTOLOGY

... 39

IGURE

13: C

OMMUNICATION BETWEEN THE

CDS

SYSTEM AND THE ONTOLOGY BROWSING INTERFACE

... 40

IGURE

14: S

CREENSHOT OF THE ADAPTED

KAON

PORTAL

... 41

IGURE

15: M

APPING OF

AGROVOC

THESAURUS TO ONTOLOGY STRUCTURE

... 45

IGURE

16: M

ODELLING OF

AGROVOC

Details

Seiten
Erscheinungsform: Originalausgabe
Erscheinungsjahr: 2003
ISBN (eBook): 9783832469054
ISBN (Paperback): 9783838669052
Dateigröße: 1.9 MB
Sprache: Englisch
Institution / Hochschule: Karlsruher Institut für Technologie (KIT) – Wirtschaftsingenieurwesen, Angewandte Informatik
Erscheinungsdatum: 2014 (April)
Note: 1,3
Schlagworte: klassifikation pruning multi-label-klassifikation multilingual thesaurus
Produktsicherheit: Diplom.de

Autor

Boris Lauser (Autor:in)

Semi-automatic ontology engineering and ontology supported document indexing in a multilingual environment

Zusammenfassung

Leseprobe

Inhaltsverzeichnis

Details

Autor

Boris Lauser (Autor:in)