Strings of Natural Languages

Stengel, Markus

Strings of Natural Languages

Unsupervised Analysis and Segmentation on the Expression Level

von Markus Stengel (Autor:in)

Informatik - Computerlinguistik

Zusammenfassung

Inhaltsangabe:Abstract:
Learning a second language is often difficult. One major reason for this is the way we learn: We try to translate the words and concepts of the other language into those of our own language. As long as the languages are fairly similar, this works quite well. However, when the languages differ to a great degree, problems are bound to appear. For example, to someone whose first language is French, English is not difficult to learn. In fact, he can pick up any English book and at the very least recognize words and sentences. But if he is tasked with reading a Japanese text, he will be completely lost: No familiar letters, no whitespace, and only occasionally a glyph that looks similar to a punctuation mark appears.
Nevertheless, anyone can learn any language. Correct pronunciation and understanding alien utterances may be hard for the individual, but as soon as the words are transcribed to some kind of script, they can be studied and - given some time - understood. The script thus offers itself as a reliable medium of communication.
Sometimes the script can be very complex, though. For instance, the Japanese language is not much more difficult than German - but the Japanese script is. If someone untrained in the language is given a Japanese book and told to create a list of its vocabulary, he will likely have to succumb to the task.
Or does he not? Are there maybe ways to analyze the text, regardless of his unfamiliarity with this type of script and language? Should there not be characteristics shared by all languages which can be exploited?
This thesis assumes the point of view of such a person, and shows how to segment a corpus in an unfamiliar language while employing as little previous knowledge as possible.
To this end, a methodology for the analysis of unknown languages is developed. The single requirement made is that a large corpus in electronic form which underwent only a minimum of preprocessing is available. Analysis is limited strictly to the expression level; semantics are purposefully left out of consideration. This distinguishes this work clearly from other works, limits comparability to some extent, and may make detection of some kinds of language features hard or even impossible.
Only unsupervised analysis is admissible, and no specific information on grammatical rules, ways to segment the text, what separators look like etc. is employed. Furthermore, no parameters such as absolute thresholds or selection […]

Leseprobe

Inhaltsverzeichnis

Markus Stengel

Strings of Natural Languages

Unsupervised Analysis and Segmentation on the Expression Level

ISBN: 978-3-8366-0627-1

Druck Diplomica® Verlag GmbH, Hamburg, 2008

Zugl. Eberhard-Karls-Universität Tübingen, Tübingen, Deutschland, Diplomarbeit, 2006

Dieses Werk ist urheberrechtlich geschützt. Die dadurch begründeten Rechte,

insbesondere die der Übersetzung, des Nachdrucks, des Vortrags, der Entnahme von

Abbildungen und Tabellen, der Funksendung, der Mikroverfilmung oder der

Vervielfältigung auf anderen Wegen und der Speicherung in Datenverarbeitungsanlagen,

bleiben, auch bei nur auszugsweiser Verwertung, vorbehalten. Eine Vervielfältigung

dieses Werkes oder von Teilen dieses Werkes ist auch im Einzelfall nur in den Grenzen

der gesetzlichen Bestimmungen des Urheberrechtsgesetzes der Bundesrepublik

Deutschland in der jeweils geltenden Fassung zulässig. Sie ist grundsätzlich

vergütungspflichtig. Zuwiderhandlungen unterliegen den Strafbestimmungen des

Urheberrechtes.

Die Wiedergabe von Gebrauchsnamen, Handelsnamen, Warenbezeichnungen usw. in

diesem Werk berechtigt auch ohne besondere Kennzeichnung nicht zu der Annahme,

dass solche Namen im Sinne der Warenzeichen- und Markenschutz-Gesetzgebung als frei

zu betrachten wären und daher von jedermann benutzt werden dürften.

Die Informationen in diesem Werk wurden mit Sorgfalt erarbeitet. Dennoch können

Fehler nicht vollständig ausgeschlossen werden, und die Diplomarbeiten Agentur, die

Autoren oder Übersetzer übernehmen keine juristische Verantwortung oder irgendeine

Haftung für evtl. verbliebene fehlerhafte Angaben und deren Folgen.

http://www.diplomica.de, Hamburg 2008

Printed in Germany

CONTENTS

Table of Contents

iii

List of Figures

List of Tables

viii

List of Algorithms

List of Abbreviations

Introduction

1 Language

1.1

Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.2

Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.2.1

English . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.2.2

German . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.2.3

Hebrew . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.2.4

Japanese

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2 Categorization

2.1

Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.2

Sample Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.3

Conclusion

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3 Analysis Methods and Techniques

3.1

Level of Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2

Data Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2.1

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2.2

Information content and its quantification . . . . . . . . . . . . . . . 28

3.2.3

Kinds of data compression . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2.4

Run length encoding . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.2.5

Dictionary-based data compression: LZ78, LZW, LZMW . . . . . . 33

3.2.6

LZMW78 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.2.7

Sample application . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.3

Longest Common Subsequence . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.3.1

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.3.2

Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.4

Statistics: N-Gram and Term Frequency . . . . . . . . . . . . . . . . . . . . 43

3.4.1

Definitions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.4.2

Limited applicability of published statistics . . . . . . . . . . . . . . 44

3.4.3

The challenges of collecting statistics . . . . . . . . . . . . . . . . . . 45

3.4.4

Fixed term size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.4.5

Variable term size . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.4.6

Suffix tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.4.7

Suffix array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.5

Cryptology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.5.1

Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.5.2

Character frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.5.3

Index of coincidence . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3.5.4

Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4 Tasks and Results

4.1

Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.1.1

Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.1.2

Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

LIST OF FIGURES

3.1

Types of data compression . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.2

Suffix tree examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.3

Suffix array structures after sorting . . . . . . . . . . . . . . . . . . . . . . . 54

4.1

Entropy images for the corpora G1, H1 and J3 . . . . . . . . . . . . . . . . 68

4.2

Index of coincidence difference plots for the corpora E1, J4, G4 and G5 . . 70

5.1

The chain of tools developed in this work . . . . . . . . . . . . . . . . . . . 98

A.1 Language tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

B.1 Index of coincidence plot: G1 . . . . . . . . . . . . . . . . . . . . . . . . . . 109

B.2 Index of coincidence plot: G1 . . . . . . . . . . . . . . . . . . . . . . . . . . 109

B.3 Index of coincidence plot: G1 . . . . . . . . . . . . . . . . . . . . . . . . . . 110

B.4 Index of coincidence plot: G1 . . . . . . . . . . . . . . . . . . . . . . . . . . 110

B.5 Index of coincidence plot: G2 . . . . . . . . . . . . . . . . . . . . . . . . . . 111

B.6 Index of coincidence plot: G3 . . . . . . . . . . . . . . . . . . . . . . . . . . 111

B.7 Index of coincidence plot: G4 . . . . . . . . . . . . . . . . . . . . . . . . . . 112

B.8 Index of coincidence plot: G5 . . . . . . . . . . . . . . . . . . . . . . . . . . 112

B.9 Index of coincidence plot: H1 . . . . . . . . . . . . . . . . . . . . . . . . . . 113

B.10 Index of coincidence plot: H2 . . . . . . . . . . . . . . . . . . . . . . . . . . 113

B.11 Index of coincidence plot: J1 . . . . . . . . . . . . . . . . . . . . . . . . . . 114

B.12 Index of coincidence plot: J2 . . . . . . . . . . . . . . . . . . . . . . . . . . 114

B.13 Index of coincidence plot: J3 . . . . . . . . . . . . . . . . . . . . . . . . . . 115

B.14 Index of coincidence plot: J4 . . . . . . . . . . . . . . . . . . . . . . . . . . 115

B.15 Index of coincidence plot: J5 . . . . . . . . . . . . . . . . . . . . . . . . . . 116

B.16 Suffixes, prefixes and reduced SUs for E3

. . . . . . . . . . . . . . . . . . . 119

B.17 Suffixes, prefixes and reduced SUs for G3 . . . . . . . . . . . . . . . . . . . 120

B.18 Suffixes, prefixes and reduced SUs for H1 . . . . . . . . . . . . . . . . . . . 121

B.19 Suffixes, prefixes and reduced SUs for H2 . . . . . . . . . . . . . . . . . . . 122

B.20 Suffixes, prefixes and reduced SUs for J3 . . . . . . . . . . . . . . . . . . . . 123

B.21 Segmentation results for H1 and H2 . . . . . . . . . . . . . . . . . . . . . . 124

B.22 Segmentation results for G3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

LIST OF TABLES

1.1

Language families and languages . . . . . . . . . . . . . . . . . . . . . . . .

1.2

Frequencies of constituent orders . . . . . . . . . . . . . . . . . . . . . . . .

2.1

Sample tokenization categories . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.2

Sample text tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.3

Results of sample text categorization . . . . . . . . . . . . . . . . . . . . . . 25

2.4

Dictionary built from sample text categorization . . . . . . . . . . . . . . . 25

2.5

Sample text reencoded to the dictionary built from categorization . . . . . . 26

3.1

Sample encoding with LZ78 . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.2

Dictionary of an LZW sample compression . . . . . . . . . . . . . . . . . . . 35

3.3

Encoding steps of an LZW sample compression . . . . . . . . . . . . . . . . 36

3.4

Dictionary of an LZMW sample compression . . . . . . . . . . . . . . . . . 36

3.5

Sample encoding with LZMW78 . . . . . . . . . . . . . . . . . . . . . . . . 38

3.6

LZMW78 dictionary after encoding the vector (3,1,4,1,0,1,5,2) . . . . . . . . 39

3.7

LCS examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.8

Various categorizations exploitable by LCS . . . . . . . . . . . . . . . . . . 42

3.9

Frequencies of English and German letters . . . . . . . . . . . . . . . . . . . 45

3.10 Possible terms dependent on alphabet size . . . . . . . . . . . . . . . . . . . 46

3.11 Prefixes and suffixes of the string `BANANA$' . . . . . . . . . . . . . . . . 49

3.12 Suffix array illustration

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.13 Suffix array and prefix tables . . . . . . . . . . . . . . . . . . . . . . . . . . 54

vii

3.14 Order of letter frequency

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.15 Index of coincidence for various languages . . . . . . . . . . . . . . . . . . . 58

4.1

Corpora used in this work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.2

Experimental Setup: system specifications and implementation . . . . . . . 64

4.3

Sample rating results and rankings . . . . . . . . . . . . . . . . . . . . . . . 65

4.4

Sample meta-rating results

. . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.5

Maximum ratio of IC differences to IC for the individual corpora . . . . . .

4.6

Character order of the corpora . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.7

Pangram-ending character order

. . . . . . . . . . . . . . . . . . . . . . . . 74

4.8

LCS and compression results: syntactic separators . . . . . . . . . . . . . . 78

4.9

Sample problem dictionary for prefix tables . . . . . . . . . . . . . . . . . . 80

4.10 Aligner results for biblical corpora . . . . . . . . . . . . . . . . . . . . . . .

4.11 Suffixes, prefixes and reduced SUs for biblical corpora . . . . . . . . . . . . 85

4.12 Suffixes, prefixes and reduced SUs for J3 . . . . . . . . . . . . . . . . . . . . 87

4.13 compound detection results for biblical corpora . . . . . . . . . . . . . . . . 90

4.14 Most frequent meta-rated terms . . . . . . . . . . . . . . . . . . . . . . . . . 94

A.1 hiragana . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

A.2 katakana . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

A.3 Hebrew alphabet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

B.1 Aligner results for various corpora . . . . . . . . . . . . . . . . . . . . . . . 117

B.2 Aligner results for various corpora . . . . . . . . . . . . . . . . . . . . . . . 118

B.3 Detected compound samples for E3 . . . . . . . . . . . . . . . . . . . . . . . 125

B.4 Detected compound samples for G3 . . . . . . . . . . . . . . . . . . . . . . . 126

B.5 Detected compound samples for H1 . . . . . . . . . . . . . . . . . . . . . . . 127

B.6 Detected compound samples for H2 . . . . . . . . . . . . . . . . . . . . . . . 128

B.7 Detected compound samples for J3 . . . . . . . . . . . . . . . . . . . . . . . 129

viii

LIST OF ALGORITHMS

Entropy image creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

Index of coincidence computation . . . . . . . . . . . . . . . . . . . . . . . . 69

Detection of syntactic separators with LCS and LZMW78 . . . . . . . . . . . 76

Detect syntactic separators by aligning at selected strings . . . . . . . . . . . 80

Detection of prefixes and suffixes . . . . . . . . . . . . . . . . . . . . . . . . . 83

Detect and split compounds

. . . . . . . . . . . . . . . . . . . . . . . . . . . 89

List of Abbreviations

AIC

algorithmic information content

index of coincidence

identification (number)

KCC

Kolmogorov-Chaitin complexity

meta-rating

PSR

prefix, suffix, and/or reduced segmentation unit

RLE

run-length encoding

SIC

Shannon information content

SOV

subject-object-verb (sentence structure)

segmentation unit

SVO

subject-verb-object (sentence structure)

term frequency

LCS

longest common subsequence

LZW

Lempel-Ziv-Welch (compression)

LZMW

Lempel-Ziv-Miller-Wegman (compression)

LZMW78

my modification of LZMW (compression)

Introduction

The limits of my language are the limits of my mind. All I

know is what I have words for. Ludwig Wittgenstein

Everyone who has ever learned a second language knows how hard it is. There are

always differences: Some are glaringly obvious, and others are so subtle that even their

concepts are difficult to understand. One major reason for this is the way we learn: We try

to translate the words and concepts of the other language into those of our own language

which we are comfortable with.

As long as the languages are fairly similar, this works quite well. However, when the

languages differ to a great degree, problems are bound to appear. For example, to someone

whose first language is French, English is not difficult to learn. In fact, he can pick up any

English book and at the very least recognize words and sentences. But if he is tasked with

reading a Japanese text, he will be completely lost: No familiar letters, no whitespace, and

only occasionally a glyph that looks similar to a punctuation mark appears.

Nevertheless, anyone can learn any language. Correct pronunciation and understanding

alien utterances may be hard for the individual, but as soon as the words are transcribed

to some kind of script, they can be studied and - given some time - understood. The script

thus offers itself as a reliable medium of communication.

Sometimes the script can be very complex, though. For instance, the Japanese language

is not much more difficult than German - but the Japanese script is. If someone untrained

in the language is given a Japanese book and told to create a list of its vocabulary, he will

most likely have to succumb to the task.

Or does he not? Are there maybe ways to analyze the text, regardless of his unfamiliarity

with this type of script and language? Should there not be characteristics shared by all

languages which can be exploited?

This thesis assumes the point of view of such a person, and attempts to find ways to

segment a corpus in an unfamiliar language while employing as little previous knowledge

as possible.

To this end, a methodology for the analysis of unknown languages is developed. The

single requirement made is that a large corpus in electronic form which underwent only a

minimum of preprocessing is available. Analysis is limited strictly to the expression level;

semantics are purposefully left out of consideration. This distinguishes this work clearly

from other works, limits comparability to some extent, and is expected to make detection

of some kinds of language features hard or even impossible.

Only unsupervised analysis is admissible, and no information on grammatical rules,

ways to segment the text, what separators look like etc. is employed. Furthermore,

no parameters such as thresholds or selection of the n-best candidates are allowed; all

parameters and evaluation must be relative and justifiable, not based on experimental

results. Though this makes this thesis' task harder, it also offers the advantage that

parameters are not required, and thus need not be adjusted or optimized to fit to a corpus

or language.

Chapter one gives an overview of the languages examined in this work: English, German,

Hebrew and Japanese. It also argues their choice, suitability and representativeness.

Chapter two introduces categorization, a key concept in this work. Categorization is used

for segmentation, classification and other tasks. Furthermore, some sample categorizations

exemplify application of this concept.

Chapter three covers the technical basis of this work. Methods and techniques from

various fields are introduced, namely data compression, bioinformatics, statistics and

cryptology. The methods developed in this work employ chiefly the algorithms and

concepts introduced in this chapter.

Chapter four states the tasks tackled in this work and reports results and devised

methods. It starts with the experimental setup, and continues with an introduction to the

evaluation and rating methodology of this thesis. Then two ways to automatically create

excerpts from a corpus follow. The detection of syntactic separators and segmentation of

text conclude the chapter.

Finally, chapter five summarizes this work's achievements, and chapter six gives an

outlook on possible and promising future work.

CHAPTER 1

Language

This chapter gives an overview of the languages examined in this work: English, German,

Hebrew and Japanese. It starts with some required definitions. Then short introductions

to the individual languages and their scripts follow, and their choice, suitability and

representativeness is argued.

1.1

Definitions

The definitions given in this chapter will be used henceforth in this work.

Definition 1. Syntax is limited to the expression level terms without any semantic

function, i.e. all word forms, punctuation marks etc. are treated as strings.

This limitation is motivated and argued by Schweizer (2006, pp. 203 et seqq.). It is an

important distinction between this work and others and has far-reaching consequences.

For instance, some relationships between words, e.g. that between the singular and the

plural form of a word (see 1.2.2.2 for an example), cannot be detected. Furthermore, due

to the strict limitation to the expression level, the results of this thesis cannot be compared

directly with those of other works which use a traditional understanding of syntax (as a

conglomerate of observations of external variations and of semantic insights).

Definition 2. An alphabet is a finite set of symbols (Cormen et al. 2001, pp. 975-976).

The symbols that constitute an alphabet are also referred to as characters.

Note that this definition of alphabet includes whitespace characters; namely blanks,

tabulators, line feeds etc.

Definition 3. A formal language L over is any set of strings made up of symbols

from . L may also be denoted as

is the empty formal language and the empty

string. A language or natural language L

nat

is any subset of L deemed acceptable by

its speakers and writers.

Definition 4. A separator is used to segment data into smaller units. A syntactic

separator is a member of L which does not carry a semantic meaning by itself. In contrast,

a semantic separator is a member of L

nat

which has no strictly syntactic function.

Typical syntactic separators are punctuation marks or whitespace symbols, while

examples of semantic separators are particles or word elements that function as affixes.

Definition 5. A word form is a lexeme and none, one or more of its potential affixes.

For example, `stay' and `stayed ' are two different word forms.

Definition 6. A token is a block of one or more contiguous characters or bytes extracted

from a text or data stream.

A token does not need to be a word form, but it can be one.

Definition 7. Tokenization is the process of extracting tokens from a given text or data

stream by some specific method.

For instance, a standard way of extracting word forms from a text is to make use of

syntactic separators enclosing them. Besides the process itself, the result is also sometimes

referred to as tokenization.

In this work, segmentation is almost the same as tokenization. However, when this

term is used, it is expected that the resulting segments are likely to be words or similar

meaningful units.

Definition 8. A word form, affix, compound or the like is called correct or natural if

it is an element of its natural language.

Definition 9. A word form, affix, compound or the like is called candidate if it is not

(yet) certain to be a word form, affix, compound or the like, but has a high likelihood to be

one.

Definition 10. A segmentation unit (SU) is a word form candidate.

Definition 11. An affix is a string that is attached to another string S and fulfills the

following criteria: The affix occurs more than once at the same position with respect to S.

An affix preceding S is called prefix, one that is appended to S is called suffix. The affix

is a part of an SU.

Definition 12. Let S

, S

and I denote strings of lengths greater than or equal to one. If

the pattern S

occurs more than once, then I is an infix.

Definition 13. A dictionary holds strings of a formal language L.

This definition is similar to that of Salomon (2004, p. 165) and allows dictionaries to

comprise correct word forms of natural languages as well as strings of arbitrary combination

of symbols of their alphabets. For example, both `stay' and `!xychzta?' are valid entries.

1.2

Languages

For this work, four languages were examined: English, German, Hebrew and Japanese.

They belong to the three different language families listed in table 1.1. As the term family

implies, there must be some characteristics to languages which make it possible to segregate

among them and form groups, so-called language families. Diakonoff (2002, p. 722) states

one way how this can be done:

language family

languages

Afro-Asiatic (Hamito-Semitic)

Hebrew

Altaic

Japanese

Indo-European

English, German

Table 1.1: Language families and languages

The only real criterion for classifying certain languages together as a family

is the common origin of their most ancient vocabulary as well as of the word

elements used to express grammatical relations. A common source language

is revealed by a comparison of words from the supposedly related languages

expressing notions common to all human cultures (and therefore not as a rule

likely to have been borrowed from a group speaking another language) and also

by a comparison of the inflectional forms (for tense, voice, case, or whatever).

There are problems with this statement, though. "[E]xpressing notions common to all

human cultures (and therefore not as a rule likely to have been borrowed from a group

speaking another language)" is irrelevant to this work as it is not a syntactic criterion.

Moreover, it is hard to define these "notions common to all human cultures" and to make

them available for analysis.

For these reasons, the highlighted part above is excluded from the consideration as

to what a language family is. The definition of a language family is thus henceforth

based purely on syntactic criteria. These are still enough to justify calling the set of

languages that make up a language family by its assigned name: The omission of the

non-syntactic criterion does not invalidate the fact that the syntactic requirements are

still met. Furthermore, the syntactic criteria are still sufficient to raise expectations that

the scrutinized languages in this work are diverse enough to expose any technique that is

optimized too much for specific language characteristics.

Tomlin (1986, p. 1) gives further motivation for the choice of languages:

If one reads through older grammatical descriptions of languages, especially

grammars written in the late nineteenth and early twentieth centuries, one

often finds only a very brief section devoted to syntax, which deals almost

exclusively with word order. In a similar vein, the new second language learner

often is intrigued as much by word order differences in the new language as by

any other feature except, perhaps, phonology. Word order, thus, represents the

most overtly noticeable feature of cross-linguistic syntax, yet at the same time

it remains a tantalizing problem, both to describe the pertinent facts of word

order variability and to provide some explanation for the great diversity one

can see cross-linguistically.

Since this work is limited to the expression level, `word order' is understood differently

from Tomlin: What he terms `word order' is the order of certain semantic functions.

Therefore, Tomlin requires semantic information for his analysis, i.e. the meaning of words,

or he could not use terms like subject, verb and object. By contrast, this work must not

utilize this kind of information.

Nevertheless, Tomlin's findings offer a further way to evaluate the representativeness

of languages: He gives an account of the "frequencies of basic constituent orders in a

representative sample of the languages of the world" (Tomlin 1986, p. 22). They are listed

in table 1.2 along with the languages scrutinized in this work. As can be seen, the three

most common orders with a total frequency of 95.77% are covered. Therefore, it can be

claimed that there is a certain degree of generality to the results.

constituent order

frequency

examined language

subject-object-verb (SOV)

44.78%

Japanese

subject-verb-object (SVO)

41.79%

English, German

verb-subject-object (VSO)

9.20%

Hebrew

verb-object-subject (VOS)

2.99%

object-verb-subject (OVS)

1.24%

object-subject-verb (OSV)

0.00%

Table 1.2: Frequencies of the orderings of the three nuclear constituents of a transitive

clause (Tomlin 1986, p. 22)

The following introductions will focus on those characteristics which are relevant to this

work and are in no way meant to be a complete introduction to the languages. However,

sometimes supplementary information deemed noteworthy or useful to understand why

each language was selected is given as well.

1.2.1

English

According to Potter (2002) and Encyclopedia Britannica (2002a), English belongs to the

West-Germanic branch of the Indo-European language family. It is the mother tongue of

approximately 350 million people and ranks number one as second language. Furthermore,

it is the most widely taught foreign language.

1.2.1.1

Overview

The English alphabet consists of 26 characters, each in two variants: 26 upper case letters

with their corresponding 26 lower case letters. Besides these Arabic numerals

, various

punctuation markers and whitespace characters, the blank in particular, are also included

and essential to the script

Though closely related to German, another of the languages scrutinized in this work,

English has lost most of the system of inflections that German still retains from their

common ancestral language. Nowadays English is relatively uninflected and relies chiefly

on two mechanisms to achieve the same effects inflections are used for in other languages:

affixation and composition. Furthermore, flexibility of function, word order and openness

of vocabulary compensate for what these two morphological processes cannot accomplish.

English has a relatively strict subject-verb-object (SVO) sentence structure. Although

nouns, pronouns, and verbs are inflected, adverbs, prepositions, interjections and

conjunctions are invariable.

Affixation in English comprises suffixes as well as prefixes.

Remarkable about

this process is the stickiness of suffixes: Once a suffix has been attached to a stem, it is

likely to be added to the language vocabulary as a full-fledged word in its own right, e.g.

`study' and `ent' are combined to `student'. Its addition increases the dictionary of word

forms by more than one since the resulting word is a noun and thus is subject to noun

affixes, i.e. `-s' (`students') or `-like' (`student-like').

Composition is achieved by attaching two or more word forms.

For example, `fire'

and `work' are combined to `firework', as `free' and `loader' result in `freeloader'. However,

while joining the words, letters may be dropped. Therefore, `all' and `ready' result

in `already'. It is noteworthy that composition is not limited to nouns or verbs, but

almost any kind of combination is possible: `breakwater' (verb-noun), `icebreaker'

(noun-verb), `blackbird' (adjective-noun), `sugar-sweet' (noun-adjective), etc. As with

affixes, compounds are subject to further modifications, e.g. `test-drive' may be extended

to `test-driver'.

The loss of most of its inflections allows English to employ a mechanism termed

flexibility of function: Verbs and nouns can often be used both as nouns or verbs. This is

not possible with most other languages, especially not in other Indo-European languages,

since inflections cause verbs and nouns in those languages to have different endings. This

Though English comprises Roman numerals as well, they are no distinct entities of the alphabet but

combinations of letters.

Symbols like `$' or `

£' are not explicitly noted since their usage is regionally dependent, and for this

work they are not considered to be part of the natural language. Furthermore, they differ from the other

symbols since they are basically abbreviations for `dollars' and `pounds' and are therefore replaceable.

is easily illustrated by the compound example `test-drive' above: Both `to test-drive' and

`a test-drive' are valid. While this introduces a great deal of flexibility into the language, it

has its share of disadvantages. For instance, it cannot be surmised by its form whether

`roadkill' is a verb or a noun. Though the use of `to roadkill' is theoretically acceptable,

none of my dictionaries lists it.

Word order is another concept which pays tribute to English being a rather uninflected

language. In order to resolve ambiguities, English is much more inflexible in terms of

possible constructions. For example, `The boy gave the girl a ring.' may also be written

as `The boy gave a ring to the girl.', but `The girl got a ring from the boy.' cannot be

rewritten to `The girl got from the boy a ring.' Languages with inflections, e.g. English's

close relative German, allow sentence patterns like this.

Finally, openness of vocabulary refers to "the free admission of words from other

languages and the ready creation of compounds and derivatives. English adopts (without

change) or adapts (with slight change) any word really needed to name some new object

or to denote some new process" (Potter 2002, p. 654). As a consequence, "English has the

largest vocabulary of any language in the world" (Encyclopedia Britannica 2002a, p. 500).

1.2.1.2

Problems and challenges

One of the great challenges of the English languages is its large vocabulary. Though this

affects rather the technical side, namely the implementation, it should not be neglected.

The size of a language's vocabulary, and thus its dictionary, directly affects statistical

analysis. For example, if one is interested in the frequency of all possible combinations

of three words, numeric stability and memory usage can quickly prove themselves to be

considerable obstacles. Especially the latter may contribute to or worsen the already

considerable time required to complete extensive analysis.

While the lack of many differing inflections reduces the number of word forms

and thus the size of the dictionary, it makes the detection of affixes much more difficult. It

can be argued that it is not important to detect all affixes English possesses, and since

there are so few their existence may as well be ignored, treating them as separate head

words. However, this work is concerned with an unbiased system, and as it attempts to

detect potential affixes it might detect incorrect, that is unnatural, ones. Further analysis,

based on the wrong affixes, would identify incorrect stems. Though results like these are

interesting in their own right, an evaluation of the effectiveness and usefulness of such an

automatic analysis system would prove difficult.

Affixation further poses the problem that sometimes additional characters are injected

between the stem and an affix. For example `sin' and `er' result in `sinner' and thus in a

doubling of the letter n. Analysis might have created the - correct - hypothesis that `er' is

a common suffix. But when the dictionary is checked, it finds that `sinn', the reduced

`sinner', does not exist. This reduces the probability of `er' being a suffix. Thus the

corroboration of the hypothesis becomes harder.

On a more abstract level the lack of inflections makes classification and grouping

of words much more difficult. Due to the flexibility of function, division of word forms into

groups such as verbs or nouns becomes nearly impossible. This hinders further analysis to

a great extent.

The problems with compounds are of similar nature, though those compounds

that have omitted characters cause additional problems. For example, neither `already' nor

`fortnight' would be detected. Thus a reduction of the dictionary by splitting compounds

into their components could not be achieved. Additionally, in practice the injection

of characters as simple as `-' causes problems: For instance, from a purely syntactic

point of view, the compound `test-drive' exemplified above needs to be split into three

parts; namely `test', `-' and `drive'. As long as it is known that `-' is typically used for

composition it is no problem. However, without this information decomposition into `test-'

and `drive', or `test' and `-drive' might be attempted. Since likely neither `test-' nor `-drive'

exist in the dictionary, none of the combinations can be segmented completely; and as a

result, `test-drive' is not detected as a compound.

Therefore, an automatic decomposition system needs to be able to split a compound

into several parts at once. Furthermore, a single character may form a possible segment,

provided it exists independently as a segmentation unit. But then not only `-' becomes a

candidate, but also `a' and `I' which exist as separate word forms. Word forms such as

`Infatuation' (at the start of a sentence) then require considerably more time to analyze.

This kind of problem naturally arises in any language which has one-character words,

e.g. the Spanish `y'. Though this is - again - rather a technical problem, it does have

considerable effect on how extensive an analysis can be afforded (see 4.5.2.1 on page 88 for

more details).

1.2.1.3

Summary

The English script is not complex, but characteristics of the English language make up for

it. This affects especially the implementation of an automatic analysis system. Moreover,

due to its world-wide pervasiveness as a language and as an object of research, the results

of this thesis can be compared with those of other works.

1.2.2

German

German belongs to the West Germanic family of languages (Encyclopedia Britannica 2002b,

p. 210) and is closely related to English (Potter 2002, p. 654). According to Encyclopedia

Britannica it is the mother tongue of 90 million people and thus ranks 6th among the

languages of the world. Even though there are many different dialects, the only German

taught at school is called `Hochdeutsch', or `High German'.

It should be noted that there was a major overhaul of official German orthography which

went into effect August 2006. However, it does not affect this work for two reasons: Firstly,

all German texts which were analyzed complied with the old orthography

. Secondly,

though some irregularities and grammatical rules which make purely syntactic analysis

difficult were removed, most of the problems listed in 1.2.2.2 are not affected by the reform.

1.2.2.1

Overview

The German alphabet basically comprises that of the English language, but adds a few

extra characters called `Umlaute' (`umlauts'): ` ¨

A' and `¨

a', ` ¨

O' and `¨

o', ` ¨

U' and `¨

u'. These

are vowel alterations of `A' and `a', `O' and `o', `U' and `u' respectively, and `ß' denotes a

sharp s in lower case. In contrast to the other letters it has no explicit upper case variant.

Therefore, two accepted ways of writing it in upper case exist: Either use `SS', or retain

the `ß'.

German is an inflected language. In that regard it differs from English though both

share a common protolanguage (Potter 2002, p. 654). In German pronouns, nouns and

adjectives have four cases of declension, and one of three genders: masculine, feminine and

neuter. Furthermore, verbs conjugate according to first, second and third person, and

singular as well as plural. The sentence structure is generally subject-verb-object (SVO)

One of the main features of German is that it makes heavy use of capitalization,

i.e. every word that functions as a noun starts with an uppercase letter. This makes it

quite easy to differentiate between verbs and nouns, e.g. `to hear' translates to `h¨

oren',

whereas `hearing' translates to `H¨

oren'.

Capitalization has a strong effect on composition in German.

Whereas in English

the creation of compounds simply requires joining the words, conversion to lower case may

be necessary in German, particularly when two nouns are joined. For example, joining

There is not nearly as much written material available which complies with the new set of rules as with

the old one.

One example where one might argue that German occasionally switches to subject-object-verb (

SOV )

is the perfect tense: In English SVO is strictly preserved, e.g. `The boy gives the girl a ring' becomes `The

boy has given the girl a ring'. In German though, `Der Junge gibt dem M¨

adchen einen Ring' (`The boy

gives the girl a ring') becomes `Der Junge hat dem M¨

adchen einen Ring gegeben' (`The boy has the girl a

ring given'). However, as the auxiliary verb `hat' (`has') is conjugated, the sentence is considered to be

SVO in this work.

`Dampf' (`steam') and `Schiff' (`boat') results in `Dampfschiff' (`steamboat'), i.e. the `S' of

`Schiff' was converted to the lower case `s'.

German allows arbitrarily long compounds.

One of the longest words actually

in use is "Rindfleischetikettierungs¨

uberwachungsaufgaben¨

ubertragungsgesetz" (Landtag

Mecklenburg-Vorpommern 2000), which means `beef labeling regulation and delegation

of supervision law'. Though compounds this long are rare and not used frequently, this

example illustrates to which extent one may take composition in German.

Another speciality of compounding in German is due to the genders: The gender of a

compound is determined by its last joined component. For example, while `Dampf' is

masculine and `Schiff' is neuter, `Dampfschiff' is neuter. Furthermore, `Fahrt' (`ride') is

feminine, thus `Dampfschiffahrt' (`steamboat ride') is feminine. Note that the third `f' in

`Dampfschifff ahrt' was dropped: As in English, letters are sometimes omitted

. Besides

these specialities, composition works in German as in English, though sometimes specific

inflections, or their removals, are required.

There are many more inflections in German than there are in English.

Most of

them serve more than one purpose. For example, the suffix `e' is used to denote the plural

form in `Boote' (`boats') and the first person form of `sehen' `(ich) sehe' (`(I) see'). But it

may also be part of a word stem, e.g. `Freude' (`joy').

Sometimes not only the inflections change, but the stems as well. For instance, the

past tense of `heißen' (`to be called') is `hieß', and `pfeifen' (`to whistle') becomes `pfiff'

(`whistled').

Finally, the plural form often triggers vowel alteration:

`Haus' (`house') becomes

`H¨

auser' (`houses'), `Fuß' (`foot') becomes `F¨

uße' (`feet'), `Storch' (`stork') becomes `St¨

orche'

(`storks') etc. These alterations are not governed by rules but need to be learned by heart.

For instance, the plural form of `Pause' (`break') is not `P¨

auser' but `Pausen' (`breaks').

1.2.2.2

Problems and challenges

The main obstacles to an efficient and extensive analysis of German are alterations, letter

rearrangements, multi-functional suffixes, capitalization, composition and gender.

Vowel alteration makes it very hard, if not impossible to detect the correct plural

forms of many words. In the example of `Haus' and `H¨

auser' shown above, both word

forms have only three letters in common. There is a high probability that `Hause', as in

`zu Hause' (`at home'), would be considered the plural form, especially since `e' is actually

used elsewhere to denote plurality (example `Boote' above).

In the reformed writing the `f' is preserved, resulting in `Dampfschifffahrt'.

However, it does not matter if such plural forms are not recognized. Since this thesis

explicitly limits itself to the expression level (see definition 1 on page 3), it is expected and

acceptable that certain semantical relationsships cannot be detected.

In fact, since many common affixes fulfill multiple functions, it seems not advisable

to use common affixes as a means of classification. Furthermore, the large number of

inflections greatly increases the dictionary and thus slows down analysis.

The modifications of stems, chiefly visible in the rearrangement of letters, have the

same effect. But as with vowel alteration, it seems difficult to detect such a change and

the correct relations. In particular, there are no rules as to when a stem is modified. For

example, judging from the verb `laufen' (`to run') whose past tense is `lief' (`ran') it could

be assumed that the past tense of `kaufen' (`to buy') is `kief'. However, this is incorrect:

Its past tense is `kaufte' (`bought').

Capitalization introduces even more problems. In this work, no information on this

is provided, therefore the relationship between uppercase letters and lowercase letters

is unknown. Hence, nouns cannot be grouped together automatically, and this obvious

syntactic criterion cannot be used.

However, the biggest problem by far is the composition mechanism of German. Due to

capitalization, many compounds will not be detected as such as not all of their components

will be found individually, i.e. `schiff' in `Dampfschiff' above. The omission of letters

(`Dampfschiffahrt') worsens it further. Though more restricted than in the English language

it is also admissible to insert a `-' between the components of compounds, which causes the

same problems as stated for English above. Since compounds may be very long in German,

their decomposition requires a long time (see 4.5.2).

Finally, gender also impacts statistic analysis. For instance, it shall be assumed that

`Dampfschiff' was detected to consist of `Dampf' and `Schiff', which are masculine and

neuter respectively. Then the compound is decomposed and removed from the dictionary

to reduce its size and thus speed up analysis. Statistical analysis now faces the problem

that the number of possible combinations of word forms with `Dampf' has increased: In

German pronouns and articles show agreement for gender. In English the decomposition of

`steamboat' into `steam' and `boat' poses no problem if `the steamboat' is encountered and

reduced to `the steam', since `the steam' likely already exists in the dictionary. However,

in German the reduction of `das Dampfschiff' to `das Dampf' differs from `der Dampf'. As

a result, a new combination is created.

1.2.2.3

Summary

Since German is my native language, it is a natural candidate for language analysis

in this work. Furthermore, though closely related to English, it shows several distinct

characteristics which promise to make it difficult in other aspects.

1.2.3

Hebrew

According to Diakonoff (2002) Hebrew belongs to the Northern Central Semitic group of

the Hamito-Semitic languages. The Hamito-Semitic language family is the main language

family of southwestern Asia and northern Africa. It includes languages such as Arabic,

Hebrew, Amharic and Hausa. Though there is disagreement, the prevalent scholastic

opinion is that this family is not related to Indo-European languages. There are about 2.6

million speakers of Hebrew in Israel at present.

As I am not proficient in Hebrew, I had to rely on secondary literature for this

introduction. The sources, in descending order of importance, are Neef (2003, pp. 1-6),

Diakonoff (2002), Tomlin (1986, pp. 22, 188) and Wikipedia (2006a).

1.2.3.1

Overview

In this work, `Hebrew' does not refer to the modern Hebrew called `Ivrit', but to biblical

Hebrew. Its importance is rooted in it being the language of the Old Testament, with

the exception of the Aramaic parts (Neef 2003, p. 2). When vowels are not included, its

alphabet consists of 23 characters and the blank which functions as a syntactic separator.

The characters have no variants, the script is not case-sensitive. In that regard Hebrew

differs greatly from English and German. With a total of only 24 characters

it is also

the smallest alphabet of all languages scrutinized in this work. If vowels are included, the

alphabet size increases to 30, since there are six different symbols used to denote vowels.

The list of Hebrew characters is given in table A.3 on page 106, along with the

transcription system used in this work. Furthermore, although Hebrew is written from

right-to-left, examples are given exclusively transcripted to English alphabet and in left-to-

right order. This way it is easier to read for Non-Hebrew speakers. Besides, all my data

files were in that format.

Hebrew is an inflected language with a rich set of affixes which can form very complex

affix compound structures. In transcripted biblical Hebrew only the consonants are

denoted, forming what is called the root of a word (Neef 2003, p. 3). By the means of

vowel infixation the meaning is further specified. In this work, the script with and without

vowels is examined.

Vowel infixation is a remarkable characteristic with far-reaching consequences. For

instance, the acquirement of loanwords is greatly hindered by this mechanism. Nouns are

not affected as much by this. Therefore, some loanwords exist. On the other hand, verbs

can be subjected to numerous modifications. Furthermore, as vowels were not denoted in

the ancient script, ambiguities may occur.

Biblical Hebrew has two genders, masculine and feminine, and three types of number:

singular, dual and plural. They are marked by suffixes (Neef 2003, pp. 54-59). Though

The late Masoretic characters are excluded.

dual is yet another typical feature of Hebrew, it does not matter to this work as it is not

achieved through infixation. Thus it has no consequences for syntactic analysis.

Finally, the sentence structure of Hebrew is verb-subject-object (VSO) which is rare

among languages (Tomlin 1986, pp. 22, 188).

1.2.3.2

Problems and challenges

Biblical Hebrew shows a few characteristics which might tempt one into expecting the

language to be hard to analyze. However, not all of them have an effect on the script.

For example, vowel infixation does not matter if vowels are not denoted. If they are

denoted, then they increase the vobabulary and hamper analysis which expects relationships

between words to be expressed by contiguous strings, i.e. equal word stems. Of course,

this also only applies to those stems which actually change.

Ambiguities of meaning are irrelevant to a strictly syntactic analysis. Blanks separate

word forms and decompose compounds, and since the language is inflected, it should prove

easier to group words into categories such as verbs than e.g. in English.

1.2.3.3

Summary

Biblical Hebrew is an interesting contrast to the other languages scrutinized in this work. In

terms of the size of its alphabet it is on the opposite end of Japanese (see the introduction

to Japanese below), with English and German in between. It is an inflected language

allowing for classification, and since it has comparatively few loanwords (Diakonoff 2002,

p. 727) and compounds are practically pre-decomposed, there should not be too many

irregularities increasing the dictionary.

Nevertheless, the language is different enough to have the potential for unexpected

results. And last but not least, as I am not familiar with the language it forced me to look

at the results in a purely syntactic, unbiased way.

1.2.4

Japanese

Japanese is considered to have an extraordinarily complicated script.

For instance,

Backhouse writes that "there can be no doubt that the Japanese writing system is the

most complex in the world, and that its mastery requires an enormous investment of time

and effort on the part of learners" (Backhouse 1993, p. 38). Even though this work is only

interested in the syntactic challenges analyzing the Japanese script offers, it faces a plethora

of difficulties which will be outlined in this section. For a more thorough and in-depth

analysis the interested reader is referred to Eschbach-Szabo (2002). A short introduction

to the Japanese language and its script are presented below to explain why this language

was selected for analysis and what outcome is to be expected.

The following introduction is based on my own knowledge of the Japanese language,

but also borrows heavily from Backhouse (1993, pp. 38-63) and Shibatani (2002). Further

information was taken from Makino and Tsutsui (2002a, pp. 16-60) and Schneider (1998).

1.2.4.1

Overview

Japanese is the native language of more than 120 million speakers and thus ranks in the

top ten languages of the world. However, it is rarely spoken outside of Japan, and its

expansion in the study as a foreign language, caused by Japan's economic influence, is still

a relatively recent development: Japanese remains very much the language of Japan.

Though scrutinized closely, the origin of the Japanese language, as of the Japanese

people, remains obscure; only the relationship with the languages of the Ryukyu Islands

to the south of Japan is established. They are mutually indistinguishable to the extent

that these languages are commonly considered dialects of Japanese, rather than separate

languages. Beyond that, its heritage is unknown, though it is established that it is not

related to Chinese, which makes the use of the Chinese script (see below) the more

surprising.

Although there are many striking similarities to Korean in terms of phonetics,

accentuation and grammar, they do not suffice to establish a common heritage between

the two languages. Furthermore, other components of the Japanese language hint at

Austronesian languages. Nowadays, the prevalent assumption is that Japanese belongs to

the group of Altaic languages (Shibatani 2002, p. 732).

Japanese is a polysyllabic, agglutinative language with a strict subject-object-verb (SOV )

sentence structure. Syntactical elements lack independence and are appended as suffixes

and postpositions to independent words which carry meaning. Verbs and adjectives

conjugate with endings, and case distinctions are marked by enclitic particles. Nouns

neither decline nor indicate number or gender. Since modifiers are placed before the

modified, relative clauses and adjectives precede the modified nouns and adverbs come

before verbs. Finally, topic is a key concept: Once a topic has been introduced, or set,

it may be omitted henceforth until the topic changes. This makes very short sentences

possible, which may consist of as little as a single word.

There is a widespread notion of Japanese being a very difficult language.

This

can be attributed to its complicated system of honorifics which is used to establish the

hierarchic relationship between speakers, and its highly complex script (Schneider 1998,

pp. 474-476). Even though this work is concerned with syntactic analysis and thus affected

primarily by the latter, honorifics have considerable influence on any kind of analysis since

they introduce additional prefixes, suffixes, inflections, verbs and nouns.

1.2.4.2

Writing systems

To ease reading and understanding, transcriptions of Japanese scripts according to the

Hepburn system will be shown alongside to the original Japanese symbols

. Roman letters

are called `

', `r¯

omaji ', in Japanese. This term is also used when referring to

the transcription of Japanese to Roman letters. Except for the long vowels, a full list of

transcriptions is given in tables A.1 and A.2. Long vowels such as `aa' or `ee' are indicated

by a horizontal line over the Roman letter, i.e. ¯

a or ¯e.

What makes the analysis, and thus also the comprehension, of Japanese writing

such a challenging task can be demonstrated quite readily with a short example:

1998

WHO

(

)

(`1998 nen-no WHO-no kaigi-de wa,

daiokushin-no ichi nichi atari-no kyoy¯

osesshukijunry¯

o-ga, j¯

urai-no taij¯

u ichi

kiroguramu atari 10 pikokiroguramu-kara ichi-yon pikoguramu (pikoguramu-wa

ichi oku bun-no ichi) ni hikisagerareta.')(Nitsu and Sato 2003, p. 27)

The most striking difference to western languages like English or German is the lack of

whitespace characters such as a blank. Judging from its frequency and distribution in

the sample text above, one might suppose that the character `

' (`no') is the functional

equivalent of a blank. However, this is not the case: It is a particle marking the genitive

case.

Other characters look more familiar to Non-Japanese speakers, though: There are

Roman letters ("WHO") and Arabic numerals (`1998', `10', etc.). Furthermore, the

characters `

' and `

' look very similar to the period `.' and comma `,' punctuation

marks in English.

As a matter of fact, the sample text above contains five of the six commonly used

scripts in Japanese writing: hiragana, katakana, kanji, Roman letters and Arabic numerals.

Lacking are the also commonly used Roman numerals. All these scripts are used freely in

combination.

Though the traditional notation of Japanese is top-down-right-left, the use of the

left-right-top-down notation has increased, especially in electronic communications or data

storage.

kanji are often referred to as `Chinese characters', which is not completely accurate.

Being the oldest Japanese script, it developed from the Chinese script brought to Japan

about 1500 years ago. Since that time it has undergone significant modifications, chiefly

Sometimes no transcriptions are shown. This is usually the case if no reading exists, i.e. it is not a

member of the natural language, or not enough space is available.

simplifications of their shapes and strokes, and limitation of the number of kanji used in

daily writing. As Chinese underwent similar though different modifications, it seems more

fit to consider present day Chinese characters and Japanese kanji to be cousins.

Originally, Chinese characters were used to write Chinese which led to the introduction

of vast numbers of Japanized approximations of Chinese words into the Japanese language.

Naturally, they were written in their original, Chinese characters. For example, the Chinese

morpheme for `water' (pronounced `shui') was borrowed as `sui ' and written with its regular

character `

'. Therefore, from the viewpoint of the kanji, `

' has the Sino-Japanese

reading, or on-reading, `sui '. As readings from different epochs and different regions of

China were assigned to the kanji, each of them can have multiple on-readings.

At the same time, the characters were extended to represent Japanese morphemes as

well. In this case `

' was also used to represent the native morpheme mizu (`(cold) water').

As with the on-readings there can be multiple Japanese readings, or kun-readings, assigned

to a kanji.

Finally, kanji can also be read phonetically: The on or kun sound of the character

is used to represent a Japanese syllable, but the meaning is abstracted. Over time, this

led to the development of the kana syllabaries hiragana and katakana. In the modern

Japanese writing system the phonetic use of kanji is restricted largely to certain names.

Typically, one to eight kanji form a lexical unit, potentially augmented by the

other Japanese scripts. Excluding the few punctuation marks, there are no rules on how to

segment text and extract its word forms, so the reader must make use of his knowledge of

vocabulary to deduce the most likely segmentation. Since usually several segmentations

are possible, this means jumping back and forth to rule out unlikely combinations.

Surveys indicate that about 4000-5000 kanji are in current use.

Various official

recommendations have been made restricting the number of kanji approved for official

use in government publications, education and the mass media. There are currently 1945

sanctioned `general use kanji' (jooyookanji ), and an additional 284 kanji are authorized for

use in personal names.

Kanji are most commonly used to write nouns and stems of verbs and adjectives.

Furthermore, Japanese names are usually written in Kanji, though recently the usage of

hiragana and katakana has increased.

hiragana is a sound-based syllabic script.

When they adapted the Chinese script

for their own language, the Japanese faced certain difficulties based on the different

grammatical structures of the two languages: While Chinese is an uninflected language,

Japanese has both inflections and a large number of grammatical particles. Since the

characters were not suited to represent such elements, they were used purely phonetically.

For example, `ru' was conventionally represented phonetically by characters such as `

(`stop, remain') and `

'(`flow '). Over time, simplified versions which require less strokes

developed, i.e. `

Modern day hiragana consists of 46 basic symbols, though variants may inflate

this number to 82.

The full list is given in table A.1 along with their Hepburn

Romanization.

Generally, the primary function of hiragana is to write grammatical elements. This

includes prefixes, suffixes, particles, demonstrative words, the copula, grammatical nouns,

common verbs or Sino-Japanese items where the kanji have been proscribed from general

use.

The role of hiragana, as that of katakana, is slowly changing, though: Just a few decades

ago the rate kanji:kana was roughly 70:30 (Eschbach-Szabo 2002, p. 312). Now it is about

30:70, and the most frequently used kana are hiragana. The cause for this development

is the sheer number and complexity of the kanji: There are, at most, only 82 different

hiragana and 83 different katakana symbols (see tables A.1 and A.2), but several thousand

kanji. Therefore, more and more commonly used words are written in kana instead of kanji.

Theoretically, Japanese can be written entirely in kana, hiragana as well as katakana.

Books for young children are written this way. Paradoxically, Japanese written in this

form is often much more difficult to read since then the vocabulary is the sole source of

information about how to segment the text into meaningful units. But even if the full

required vocabulary is known, a certain ambiguity remains: Due to its homophonous

nature, Japanese words written in kana can often have various meanings, e.g. `kumo' (`

') can refer to `spider' (`

') as well as to `cloud' (`

'). Therefore, it is unlikely that

kana, or any other kind of script, can replace kanji anytime soon.

katakana were often derived from the same kanji as hiragana.

They developed

from diverse systems of priestly shorthand that aided the reading of Chinese texts and

Buddhist scriptures by supplying Japanese particles and endings missing in Chinese in

the form of abbreviated kanji strokes. Furthermore, they were also used to denote the

phonetics of words.

One of the words written in katakana in the introductory example on page 16 is `

' (`daiokishin'), `dioxin'. For a full list see table A.2 where they are shown

along with their Hepburn Romanization. Their appearance is stiffer and more angular

than their hiragana counterparts. katakana are often compared to the use of italics in

printed western languages, i.e. they are used for items which are in some way unusual or

for some particular special effect such as emphasis. Thus its primary use is in representing

loanwords other than from classical Chinese, particularly from English and other European

languages, writing names other than Chinese or Korean and onomatopoeic words. Since

there are more and different sounds in other languages, more combinations of katakana

than hiragana are admittable to approximate the foreign sound. Finally, katakana may

also be used to avoid complicated kanji or to make long sentences of hiragana and kanji

easier to read as they aid in visually segmenting them.

Besides the Japanese script, Japanese writing in general comprises three further

scripts: Roman letters, Arabic numerals and Roman numerals.

Roman letters comprise the letters from `A' to `Z', uppercase as well as lowercase,

though the latter is rarely used. Roman letters are regularly used for abbreviations and

acronyms which are commonly based on English and thus constitute a special group of

loanwords. It is important to note that these are often only based on English vocabulary

and are not genuinely English words. Many of them have been coined in Japan (e.g.

`skinship') and incorporated into the Japanese language.

Though the kanji script has its own numerals, their use as numerals is largely restricted

to traditional vertical writing. In horizontal writing Arabic numerals are the rule and they

are frequently encountered in vertical writing as well. In comparison Roman numerals have

a much more restricted role: They are usually employed to denote order, i.e. to number

chapters or sections in books. They could be considered to be included in the set of Roman

letters. But since they are kept distinct in literature, they are listed separately here as well.

1.2.4.3

Problems and challenges

As noted above, there is no spacing between word forms to aid in the segmentation of

a text. Japanese has various punctuation marks, but except for the equivalents of the

period (`

', `maru'), comma (`

', `ten') and quotation (`

...

', `

...

'), they are used

rarely. Comma- and period-type punctuation marks indicate phrase and sentence divisions,

but within these units the symbols of the various scripts follow each other without any

whitespace or separators.

Hence, the first task when dealing with the Japanese language is to find some way to

segment a given text into tokens of a size suitable for further analysis. But this task, quite

easy in languages like German or English, is very difficult. Since there are no whitespace

characters, their use as a syntactic separator to tokenize the text is not an option. Thus

the common approach is to employ a dictionary comprising the vocabulary and a set of

grammatical rules to extract word forms. But as this work's purpose is to use as little

information as possible, this avenue is barred

Still it is possible to make use of the punctuation marks; namely the period and comma.

However, experiments show that the resulting tokens can still be as long as a hundred

symbols. Thus this can only be a starting point, afterwards different techniques, chiefly

based on statistical analysis, are required.

On a side note, morphological parsers such as

chasen which use dictionaries and grammatical rules

to parse Japanese text have not yet shown themselves to be well-suited to the task of text segmentation.

Though they are useful and achieve acceptable results, I have not yet seen a non-trivial text longer than a

page which was segmented without error.

Statistical analysis needs to deal with the very core of Japanese writing: orthographic

variation. There is no standardized orthography in Japanese, and the Japanese and

non-Japanese scripts are used freely in combination wherever the writer seems fit. As

explained above, there is not only one but three Japanese scripts: kanji, hiragana and

katakana. The kana, that is hiragana and katakana, may be used as a replacement for

kanji or for themselves. For example, the kanji `

' (`suiatsu'), `water pressure', can

also be written in hiragana as `

', or in katakana as `

'. Though it is not

customary, one may also mix the scripts and write `

', `

' etc.

This becomes even more complicated with word forms which comprise more than one

script. For example, `

' (`toru'), `to take', or `

' (`amerikajin'), `American

(person)', could also be represented as `

' and `

', or `

' and `

' respectively. The acceptable but uncustomary kanji variant of the last one, `

could also be encountered. Of course, combining non-Japanese scripts with Japanese scripts

is also possible: `AMERIKA

', `AMERIKAJIN', `AmerikaJin', `amerikajin',`AMERIKA

' etc. are also valid.

It might be argued that from a purely syntactical point of view these arguments

are invalid as the various ways of writing can be considered variants or word forms and

deemed acceptable in their own right. This perspective, though understandable, ignores

the practical problems of a statistical analysis. For instance, having many different variants

reduces the frequency of a word form and thus the likelihood to recognize it as a distinct

unit, which in turn by further segmenting a given text might be used to discover further

units. The following example illustrates the problem:

Example

(`basu-ni notta

onna-ga nokoru. basu-ni notta no wa uresh¯i kara.')

The sentence above means `Because she is happy that she got on the bus, the woman

who got on the bus stays behind.' Here attention is directed towards the two variant

writings `

' (`notta', `got on') and `

' (`notta', `got on'). A simple statistical

analysis based on term frequency would find `

' and `

' to have a higher than

average occurrence and thus consider them word form candidates. This could then lead

to the isolation of the `

' and the `

', making them word form candidates on their own.

Since this was already an incorrect segmentation, further incorrect tokens could be created,

which in turn would lead to further incorrect segmentations etc. Note that this statistical

analysis approach is overly simplistic, but the example shows what difficulties the lack of a

standardized orthography can induce.

Besides these effects on frequency statistics, particles prove to be a further hindrance

to a successful, automatic segmentation. As mentioned before `

' can function as a

particle marking the genitive case. But it can also function as a nominalizer, be part of a

name or of an inflection. The same holds true for `

' (`kara'), a sentence final particle

denoting causality: For example, `

' (`kakaru'), `to need', is conjugated to `

' (`kakaranai '), `to not need', which contains `

As Japanese is an agglutinative language, it makes extensive use of suffixes.

Theoretically, these should be helpful for analysis, e.g. the classification of verbs according

to their tenses or their potential inflections. However, the lack of whitespace makes it

hard to identify beginning and end of a word form. For instance, `

' (`nai ') is a suffix

denoting negation. But it can also mean `within' or be the first part of the following word

form.

In view of the examples above it might be thought that concentrating on the kanji would

solve most of these problems. However, this proceeding discards a significant percentage of

Japanese script and effectively limits analysis to nouns. But even if this was an option for

this work, not only does it not solve all problems, but it even introduces new ones. For

example, the phrase `

' (`amerikajin mita'), `I saw an American', is reduced

to `

', a combination which does not exist in Japanese vocabulary. The consequences

are the same as for the wrong segmentations above.

Finally, concentrating on the kanji does not solve problems arising from composition

and abbreviation in Japanese. For example, the compound `

' (`kurikaesu'), `to

repeat (again and again)', contains the hiragana characters `

' (`ri ') and `

' (`su').

Ignoring them results in `

', a term which is not in the Japanese dictionary. This is a

Details

Seiten
Erscheinungsform: Originalausgabe
Erscheinungsjahr: 2006
ISBN (eBook): 9783836606271
DOI: 10.3239/9783836606271
Dateigröße: 2 MB
Sprache: Englisch
Institution / Hochschule: Eberhard-Karls-Universität Tübingen – Informatik 17, Methodik computerunterstützter Textinterpretation
Erscheinungsdatum: 2007 (Oktober)
Note: 1,0
Schlagworte: automatische syntaxanalyse korpuserstellung computerlinguistik korpuslinguistik meta-rating
Produktsicherheit: Diplom.de

Autor

Markus Stengel (Autor:in)