Lade Inhalt...

Strings of Natural Languages

Unsupervised Analysis and Segmentation on the Expression Level

©2006 Diplomarbeit 146 Seiten

Zusammenfassung

Inhaltsangabe:Abstract:
Learning a second language is often difficult. One major reason for this is the way we learn: We try to translate the words and concepts of the other language into those of our own language. As long as the languages are fairly similar, this works quite well. However, when the languages differ to a great degree, problems are bound to appear. For example, to someone whose first language is French, English is not difficult to learn. In fact, he can pick up any English book and at the very least recognize words and sentences. But if he is tasked with reading a Japanese text, he will be completely lost: No familiar letters, no whitespace, and only occasionally a glyph that looks similar to a punctuation mark appears.
Nevertheless, anyone can learn any language. Correct pronunciation and understanding alien utterances may be hard for the individual, but as soon as the words are transcribed to some kind of script, they can be studied and - given some time - understood. The script thus offers itself as a reliable medium of communication.
Sometimes the script can be very complex, though. For instance, the Japanese language is not much more difficult than German - but the Japanese script is. If someone untrained in the language is given a Japanese book and told to create a list of its vocabulary, he will likely have to succumb to the task.
Or does he not? Are there maybe ways to analyze the text, regardless of his unfamiliarity with this type of script and language? Should there not be characteristics shared by all languages which can be exploited?
This thesis assumes the point of view of such a person, and shows how to segment a corpus in an unfamiliar language while employing as little previous knowledge as possible.
To this end, a methodology for the analysis of unknown languages is developed. The single requirement made is that a large corpus in electronic form which underwent only a minimum of preprocessing is available. Analysis is limited strictly to the expression level; semantics are purposefully left out of consideration. This distinguishes this work clearly from other works, limits comparability to some extent, and may make detection of some kinds of language features hard or even impossible.
Only unsupervised analysis is admissible, and no specific information on grammatical rules, ways to segment the text, what separators look like etc. is employed. Furthermore, no parameters such as absolute thresholds or selection […]

Leseprobe

Inhaltsverzeichnis


Markus Stengel
Strings of Natural Languages
Unsupervised Analysis and Segmentation on the Expression Level
ISBN: 978-3-8366-0627-1
Druck Diplomica® Verlag GmbH, Hamburg, 2008
Zugl. Eberhard-Karls-Universität Tübingen, Tübingen, Deutschland, Diplomarbeit, 2006
Dieses Werk ist urheberrechtlich geschützt. Die dadurch begründeten Rechte,
insbesondere die der Übersetzung, des Nachdrucks, des Vortrags, der Entnahme von
Abbildungen und Tabellen, der Funksendung, der Mikroverfilmung oder der
Vervielfältigung auf anderen Wegen und der Speicherung in Datenverarbeitungsanlagen,
bleiben, auch bei nur auszugsweiser Verwertung, vorbehalten. Eine Vervielfältigung
dieses Werkes oder von Teilen dieses Werkes ist auch im Einzelfall nur in den Grenzen
der gesetzlichen Bestimmungen des Urheberrechtsgesetzes der Bundesrepublik
Deutschland in der jeweils geltenden Fassung zulässig. Sie ist grundsätzlich
vergütungspflichtig. Zuwiderhandlungen unterliegen den Strafbestimmungen des
Urheberrechtes.
Die Wiedergabe von Gebrauchsnamen, Handelsnamen, Warenbezeichnungen usw. in
diesem Werk berechtigt auch ohne besondere Kennzeichnung nicht zu der Annahme,
dass solche Namen im Sinne der Warenzeichen- und Markenschutz-Gesetzgebung als frei
zu betrachten wären und daher von jedermann benutzt werden dürften.
Die Informationen in diesem Werk wurden mit Sorgfalt erarbeitet. Dennoch können
Fehler nicht vollständig ausgeschlossen werden, und die Diplomarbeiten Agentur, die
Autoren oder Übersetzer übernehmen keine juristische Verantwortung oder irgendeine
Haftung für evtl. verbliebene fehlerhafte Angaben und deren Folgen.
© Diplomica Verlag GmbH
http://www.diplomica.de, Hamburg 2008
Printed in Germany

CONTENTS
Table of Contents
iii
List of Figures
vi
List of Tables
viii
List of Algorithms
ix
List of Abbreviations
xi
Introduction
1
1 Language
3
1.1
Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.2
Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
1.2.1
English . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
1.2.2
German . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.3
Hebrew . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2.4
Japanese
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2 Categorization
23
2.1
Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2
Sample Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
i

2.3
Conclusion
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3 Analysis Methods and Techniques
27
3.1
Level of Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2
Data Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.1
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.2
Information content and its quantification . . . . . . . . . . . . . . . 28
3.2.3
Kinds of data compression . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.4
Run length encoding . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
3.2.5
Dictionary-based data compression: LZ78, LZW, LZMW . . . . . . 33
3.2.6
LZMW78 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2.7
Sample application . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3
Longest Common Subsequence . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3.1
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3.2
Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
3.4
Statistics: N-Gram and Term Frequency . . . . . . . . . . . . . . . . . . . . 43
3.4.1
Definitions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4.2
Limited applicability of published statistics . . . . . . . . . . . . . . 44
3.4.3
The challenges of collecting statistics . . . . . . . . . . . . . . . . . . 45
3.4.4
Fixed term size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4.5
Variable term size . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.4.6
Suffix tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.4.7
Suffix array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.5
Cryptology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.5.1
Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.5.2
Character frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.5.3
Index of coincidence . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.5.4
Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4 Tasks and Results
61
4.1
Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
4.1.1
Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
4.1.2
Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
ii

LIST OF FIGURES
3.1
Types of data compression . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
3.2
Suffix tree examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.3
Suffix array structures after sorting . . . . . . . . . . . . . . . . . . . . . . . 54
4.1
Entropy images for the corpora G1, H1 and J3 . . . . . . . . . . . . . . . . 68
4.2
Index of coincidence difference plots for the corpora E1, J4, G4 and G5 . . 70
5.1
The chain of tools developed in this work . . . . . . . . . . . . . . . . . . . 98
A.1 Language tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
B.1 Index of coincidence plot: G1 . . . . . . . . . . . . . . . . . . . . . . . . . . 109
B.2 Index of coincidence plot: G1 . . . . . . . . . . . . . . . . . . . . . . . . . . 109
B.3 Index of coincidence plot: G1 . . . . . . . . . . . . . . . . . . . . . . . . . . 110
B.4 Index of coincidence plot: G1 . . . . . . . . . . . . . . . . . . . . . . . . . . 110
B.5 Index of coincidence plot: G2 . . . . . . . . . . . . . . . . . . . . . . . . . . 111
B.6 Index of coincidence plot: G3 . . . . . . . . . . . . . . . . . . . . . . . . . . 111
B.7 Index of coincidence plot: G4 . . . . . . . . . . . . . . . . . . . . . . . . . . 112
B.8 Index of coincidence plot: G5 . . . . . . . . . . . . . . . . . . . . . . . . . . 112
B.9 Index of coincidence plot: H1 . . . . . . . . . . . . . . . . . . . . . . . . . . 113
B.10 Index of coincidence plot: H2 . . . . . . . . . . . . . . . . . . . . . . . . . . 113
B.11 Index of coincidence plot: J1 . . . . . . . . . . . . . . . . . . . . . . . . . . 114
B.12 Index of coincidence plot: J2 . . . . . . . . . . . . . . . . . . . . . . . . . . 114
v

B.13 Index of coincidence plot: J3 . . . . . . . . . . . . . . . . . . . . . . . . . . 115
B.14 Index of coincidence plot: J4 . . . . . . . . . . . . . . . . . . . . . . . . . . 115
B.15 Index of coincidence plot: J5 . . . . . . . . . . . . . . . . . . . . . . . . . . 116
B.16 Suffixes, prefixes and reduced SUs for E3
. . . . . . . . . . . . . . . . . . . 119
B.17 Suffixes, prefixes and reduced SUs for G3 . . . . . . . . . . . . . . . . . . . 120
B.18 Suffixes, prefixes and reduced SUs for H1 . . . . . . . . . . . . . . . . . . . 121
B.19 Suffixes, prefixes and reduced SUs for H2 . . . . . . . . . . . . . . . . . . . 122
B.20 Suffixes, prefixes and reduced SUs for J3 . . . . . . . . . . . . . . . . . . . . 123
B.21 Segmentation results for H1 and H2 . . . . . . . . . . . . . . . . . . . . . . 124
B.22 Segmentation results for G3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
vi

LIST OF TABLES
1.1
Language families and languages . . . . . . . . . . . . . . . . . . . . . . . .
5
1.2
Frequencies of constituent orders . . . . . . . . . . . . . . . . . . . . . . . .
6
2.1
Sample tokenization categories . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2
Sample text tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3
Results of sample text categorization . . . . . . . . . . . . . . . . . . . . . . 25
2.4
Dictionary built from sample text categorization . . . . . . . . . . . . . . . 25
2.5
Sample text reencoded to the dictionary built from categorization . . . . . . 26
3.1
Sample encoding with LZ78 . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2
Dictionary of an LZW sample compression . . . . . . . . . . . . . . . . . . . 35
3.3
Encoding steps of an LZW sample compression . . . . . . . . . . . . . . . . 36
3.4
Dictionary of an LZMW sample compression . . . . . . . . . . . . . . . . . 36
3.5
Sample encoding with LZMW78 . . . . . . . . . . . . . . . . . . . . . . . . 38
3.6
LZMW78 dictionary after encoding the vector (3,1,4,1,0,1,5,2) . . . . . . . . 39
3.7
LCS examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
3.8
Various categorizations exploitable by LCS . . . . . . . . . . . . . . . . . . 42
3.9
Frequencies of English and German letters . . . . . . . . . . . . . . . . . . . 45
3.10 Possible terms dependent on alphabet size . . . . . . . . . . . . . . . . . . . 46
3.11 Prefixes and suffixes of the string `BANANA$' . . . . . . . . . . . . . . . . 49
3.12 Suffix array illustration
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.13 Suffix array and prefix tables . . . . . . . . . . . . . . . . . . . . . . . . . . 54
vii

3.14 Order of letter frequency
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.15 Index of coincidence for various languages . . . . . . . . . . . . . . . . . . . 58
4.1
Corpora used in this work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.2
Experimental Setup: system specifications and implementation . . . . . . . 64
4.3
Sample rating results and rankings . . . . . . . . . . . . . . . . . . . . . . . 65
4.4
Sample meta-rating results
. . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.5
Maximum ratio of IC differences to IC for the individual corpora . . . . . .
71
4.6
Character order of the corpora . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.7
Pangram-ending character order
. . . . . . . . . . . . . . . . . . . . . . . . 74
4.8
LCS and compression results: syntactic separators . . . . . . . . . . . . . . 78
4.9
Sample problem dictionary for prefix tables . . . . . . . . . . . . . . . . . . 80
4.10 Aligner results for biblical corpora . . . . . . . . . . . . . . . . . . . . . . .
81
4.11 Suffixes, prefixes and reduced SUs for biblical corpora . . . . . . . . . . . . 85
4.12 Suffixes, prefixes and reduced SUs for J3 . . . . . . . . . . . . . . . . . . . . 87
4.13 compound detection results for biblical corpora . . . . . . . . . . . . . . . . 90
4.14 Most frequent meta-rated terms . . . . . . . . . . . . . . . . . . . . . . . . . 94
A.1 hiragana . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
A.2 katakana . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
A.3 Hebrew alphabet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
B.1 Aligner results for various corpora . . . . . . . . . . . . . . . . . . . . . . . 117
B.2 Aligner results for various corpora . . . . . . . . . . . . . . . . . . . . . . . 118
B.3 Detected compound samples for E3 . . . . . . . . . . . . . . . . . . . . . . . 125
B.4 Detected compound samples for G3 . . . . . . . . . . . . . . . . . . . . . . . 126
B.5 Detected compound samples for H1 . . . . . . . . . . . . . . . . . . . . . . . 127
B.6 Detected compound samples for H2 . . . . . . . . . . . . . . . . . . . . . . . 128
B.7 Detected compound samples for J3 . . . . . . . . . . . . . . . . . . . . . . . 129
viii

LIST OF ALGORITHMS
1
Entropy image creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
2
Index of coincidence computation . . . . . . . . . . . . . . . . . . . . . . . . 69
3
Detection of syntactic separators with LCS and LZMW78 . . . . . . . . . . . 76
4
Detect syntactic separators by aligning at selected strings . . . . . . . . . . . 80
5
Detection of prefixes and suffixes . . . . . . . . . . . . . . . . . . . . . . . . . 83
6
Detect and split compounds
. . . . . . . . . . . . . . . . . . . . . . . . . . . 89
ix

List of Abbreviations
AIC
algorithmic information content
IC
index of coincidence
ID
identification (number)
KCC
Kolmogorov-Chaitin complexity
MR
meta-rating
PSR
prefix, suffix, and/or reduced segmentation unit
RLE
run-length encoding
SIC
Shannon information content
SOV
subject-object-verb (sentence structure)
SU
segmentation unit
SVO
subject-verb-object (sentence structure)
TF
term frequency
LCS
longest common subsequence
LZW
Lempel-Ziv-Welch (compression)
LZMW
Lempel-Ziv-Miller-Wegman (compression)
LZMW78
my modification of LZMW (compression)
xi

Introduction
The limits of my language are the limits of my mind. All I
know is what I have words for. ­ Ludwig Wittgenstein
Everyone who has ever learned a second language knows how hard it is. There are
always differences: Some are glaringly obvious, and others are so subtle that even their
concepts are difficult to understand. One major reason for this is the way we learn: We try
to translate the words and concepts of the other language into those of our own language
which we are comfortable with.
As long as the languages are fairly similar, this works quite well. However, when the
languages differ to a great degree, problems are bound to appear. For example, to someone
whose first language is French, English is not difficult to learn. In fact, he can pick up any
English book and at the very least recognize words and sentences. But if he is tasked with
reading a Japanese text, he will be completely lost: No familiar letters, no whitespace, and
only occasionally a glyph that looks similar to a punctuation mark appears.
Nevertheless, anyone can learn any language. Correct pronunciation and understanding
alien utterances may be hard for the individual, but as soon as the words are transcribed
to some kind of script, they can be studied and - given some time - understood. The script
thus offers itself as a reliable medium of communication.
Sometimes the script can be very complex, though. For instance, the Japanese language
is not much more difficult than German - but the Japanese script is. If someone untrained
in the language is given a Japanese book and told to create a list of its vocabulary, he will
most likely have to succumb to the task.
Or does he not? Are there maybe ways to analyze the text, regardless of his unfamiliarity
with this type of script and language? Should there not be characteristics shared by all
languages which can be exploited?
1

This thesis assumes the point of view of such a person, and attempts to find ways to
segment a corpus in an unfamiliar language while employing as little previous knowledge
as possible.
To this end, a methodology for the analysis of unknown languages is developed. The
single requirement made is that a large corpus in electronic form which underwent only a
minimum of preprocessing is available. Analysis is limited strictly to the expression level;
semantics are purposefully left out of consideration. This distinguishes this work clearly
from other works, limits comparability to some extent, and is expected to make detection
of some kinds of language features hard or even impossible.
Only unsupervised analysis is admissible, and no information on grammatical rules,
ways to segment the text, what separators look like etc. is employed. Furthermore,
no parameters such as thresholds or selection of the n-best candidates are allowed; all
parameters and evaluation must be relative and justifiable, not based on experimental
results. Though this makes this thesis' task harder, it also offers the advantage that
parameters are not required, and thus need not be adjusted or optimized to fit to a corpus
or language.
Chapter one gives an overview of the languages examined in this work: English, German,
Hebrew and Japanese. It also argues their choice, suitability and representativeness.
Chapter two introduces categorization, a key concept in this work. Categorization is used
for segmentation, classification and other tasks. Furthermore, some sample categorizations
exemplify application of this concept.
Chapter three covers the technical basis of this work. Methods and techniques from
various fields are introduced, namely data compression, bioinformatics, statistics and
cryptology. The methods developed in this work employ chiefly the algorithms and
concepts introduced in this chapter.
Chapter four states the tasks tackled in this work and reports results and devised
methods. It starts with the experimental setup, and continues with an introduction to the
evaluation and rating methodology of this thesis. Then two ways to automatically create
excerpts from a corpus follow. The detection of syntactic separators and segmentation of
text conclude the chapter.
Finally, chapter five summarizes this work's achievements, and chapter six gives an
outlook on possible and promising future work.
2

CHAPTER 1
Language
This chapter gives an overview of the languages examined in this work: English, German,
Hebrew and Japanese. It starts with some required definitions. Then short introductions
to the individual languages and their scripts follow, and their choice, suitability and
representativeness is argued.
1.1
Definitions
The definitions given in this chapter will be used henceforth in this work.
Definition 1. Syntax is limited to the expression level terms without any semantic
function, i.e. all word forms, punctuation marks etc. are treated as strings.
This limitation is motivated and argued by Schweizer (2006, pp. 203 et seqq.). It is an
important distinction between this work and others and has far-reaching consequences.
For instance, some relationships between words, e.g. that between the singular and the
plural form of a word (see 1.2.2.2 for an example), cannot be detected. Furthermore, due
to the strict limitation to the expression level, the results of this thesis cannot be compared
directly with those of other works which use a traditional understanding of syntax (as a
conglomerate of observations of external variations and of semantic insights).
Definition 2. An alphabet is a finite set of symbols (Cormen et al. 2001, pp. 975-976).
The symbols that constitute an alphabet are also referred to as characters.
Note that this definition of alphabet includes whitespace characters; namely blanks,
tabulators, line feeds etc.
Definition 3. A formal language L over is any set of strings made up of symbols
from . L may also be denoted as
.
is the empty formal language and the empty
3

string. A language or natural language L
nat
is any subset of L deemed acceptable by
its speakers and writers.
Definition 4. A separator is used to segment data into smaller units. A syntactic
separator is a member of L which does not carry a semantic meaning by itself. In contrast,
a semantic separator is a member of L
nat
which has no strictly syntactic function.
Typical syntactic separators are punctuation marks or whitespace symbols, while
examples of semantic separators are particles or word elements that function as affixes.
Definition 5. A word form is a lexeme and none, one or more of its potential affixes.
For example, `stay' and `stayed ' are two different word forms.
Definition 6. A token is a block of one or more contiguous characters or bytes extracted
from a text or data stream.
A token does not need to be a word form, but it can be one.
Definition 7. Tokenization is the process of extracting tokens from a given text or data
stream by some specific method.
For instance, a standard way of extracting word forms from a text is to make use of
syntactic separators enclosing them. Besides the process itself, the result is also sometimes
referred to as tokenization.
In this work, segmentation is almost the same as tokenization. However, when this
term is used, it is expected that the resulting segments are likely to be words or similar
meaningful units.
Definition 8. A word form, affix, compound or the like is called correct or natural if
it is an element of its natural language.
Definition 9. A word form, affix, compound or the like is called candidate if it is not
(yet) certain to be a word form, affix, compound or the like, but has a high likelihood to be
one.
Definition 10. A segmentation unit (SU) is a word form candidate.
Definition 11. An affix is a string that is attached to another string S and fulfills the
following criteria: The affix occurs more than once at the same position with respect to S.
An affix preceding S is called prefix, one that is appended to S is called suffix. The affix
is a part of an SU.
Definition 12. Let S
i
, S
j
and I denote strings of lengths greater than or equal to one. If
the pattern S
i
IS
j
occurs more than once, then I is an infix.
Definition 13. A dictionary holds strings of a formal language L.
This definition is similar to that of Salomon (2004, p. 165) and allows dictionaries to
comprise correct word forms of natural languages as well as strings of arbitrary combination
of symbols of their alphabets. For example, both `stay' and `!xychzta?' are valid entries.
4

1.2
Languages
For this work, four languages were examined: English, German, Hebrew and Japanese.
They belong to the three different language families listed in table 1.1. As the term family
implies, there must be some characteristics to languages which make it possible to segregate
among them and form groups, so-called language families. Diakonoff (2002, p. 722) states
one way how this can be done:
language family
languages
Afro-Asiatic (Hamito-Semitic)
Hebrew
Altaic
Japanese
Indo-European
English, German
Table 1.1: Language families and languages
The only real criterion for classifying certain languages together as a family
is the common origin of their most ancient vocabulary as well as of the word
elements used to express grammatical relations. A common source language
is revealed by a comparison of words from the supposedly related languages
expressing notions common to all human cultures (and therefore not as a rule
likely to have been borrowed from a group speaking another language) and also
by a comparison of the inflectional forms (for tense, voice, case, or whatever).
There are problems with this statement, though. "[E]xpressing notions common to all
human cultures (and therefore not as a rule likely to have been borrowed from a group
speaking another language)" is irrelevant to this work as it is not a syntactic criterion.
Moreover, it is hard to define these "notions common to all human cultures" and to make
them available for analysis.
For these reasons, the highlighted part above is excluded from the consideration as
to what a language family is. The definition of a language family is thus henceforth
based purely on syntactic criteria. These are still enough to justify calling the set of
languages that make up a language family by its assigned name: The omission of the
non-syntactic criterion does not invalidate the fact that the syntactic requirements are
still met. Furthermore, the syntactic criteria are still sufficient to raise expectations that
the scrutinized languages in this work are diverse enough to expose any technique that is
optimized too much for specific language characteristics.
Tomlin (1986, p. 1) gives further motivation for the choice of languages:
If one reads through older grammatical descriptions of languages, especially
grammars written in the late nineteenth and early twentieth centuries, one
often finds only a very brief section devoted to syntax, which deals almost
exclusively with word order. In a similar vein, the new second language learner
5

often is intrigued as much by word order differences in the new language as by
any other feature except, perhaps, phonology. Word order, thus, represents the
most overtly noticeable feature of cross-linguistic syntax, yet at the same time
it remains a tantalizing problem, both to describe the pertinent facts of word
order variability and to provide some explanation for the great diversity one
can see cross-linguistically.
Since this work is limited to the expression level, `word order' is understood differently
from Tomlin: What he terms `word order' is the order of certain semantic functions.
Therefore, Tomlin requires semantic information for his analysis, i.e. the meaning of words,
or he could not use terms like subject, verb and object. By contrast, this work must not
utilize this kind of information.
Nevertheless, Tomlin's findings offer a further way to evaluate the representativeness
of languages: He gives an account of the "frequencies of basic constituent orders in a
representative sample of the languages of the world" (Tomlin 1986, p. 22). They are listed
in table 1.2 along with the languages scrutinized in this work. As can be seen, the three
most common orders with a total frequency of 95.77% are covered. Therefore, it can be
claimed that there is a certain degree of generality to the results.
constituent order
frequency
examined language
subject-object-verb (SOV)
44.78%
Japanese
subject-verb-object (SVO)
41.79%
English, German
verb-subject-object (VSO)
9.20%
Hebrew
verb-object-subject (VOS)
2.99%
­
object-verb-subject (OVS)
1.24%
­
object-subject-verb (OSV)
0.00%
­
Table 1.2: Frequencies of the orderings of the three nuclear constituents of a transitive
clause (Tomlin 1986, p. 22)
The following introductions will focus on those characteristics which are relevant to this
work and are in no way meant to be a complete introduction to the languages. However,
sometimes supplementary information deemed noteworthy or useful to understand why
each language was selected is given as well.
1.2.1
English
According to Potter (2002) and Encyclopedia Britannica (2002a), English belongs to the
West-Germanic branch of the Indo-European language family. It is the mother tongue of
approximately 350 million people and ranks number one as second language. Furthermore,
it is the most widely taught foreign language.
6

1.2.1.1
Overview
The English alphabet consists of 26 characters, each in two variants: 26 upper case letters
with their corresponding 26 lower case letters. Besides these Arabic numerals
1
, various
punctuation markers and whitespace characters, the blank in particular, are also included
and essential to the script
2
.
Though closely related to German, another of the languages scrutinized in this work,
English has lost most of the system of inflections that German still retains from their
common ancestral language. Nowadays English is relatively uninflected and relies chiefly
on two mechanisms to achieve the same effects inflections are used for in other languages:
affixation and composition. Furthermore, flexibility of function, word order and openness
of vocabulary compensate for what these two morphological processes cannot accomplish.
English has a relatively strict subject-verb-object (SVO) sentence structure. Although
nouns, pronouns, and verbs are inflected, adverbs, prepositions, interjections and
conjunctions are invariable.
Affixation in English comprises suffixes as well as prefixes.
Remarkable about
this process is the stickiness of suffixes: Once a suffix has been attached to a stem, it is
likely to be added to the language vocabulary as a full-fledged word in its own right, e.g.
`study' and `ent' are combined to `student'. Its addition increases the dictionary of word
forms by more than one since the resulting word is a noun and thus is subject to noun
affixes, i.e. `-s' (`students') or `-like' (`student-like').
Composition is achieved by attaching two or more word forms.
For example, `fire'
and `work' are combined to `firework', as `free' and `loader' result in `freeloader'. However,
while joining the words, letters may be dropped. Therefore, `all' and `ready' result
in `already'. It is noteworthy that composition is not limited to nouns or verbs, but
almost any kind of combination is possible: `breakwater' (verb-noun), `icebreaker'
(noun-verb), `blackbird' (adjective-noun), `sugar-sweet' (noun-adjective), etc. As with
affixes, compounds are subject to further modifications, e.g. `test-drive' may be extended
to `test-driver'.
The loss of most of its inflections allows English to employ a mechanism termed
flexibility of function: Verbs and nouns can often be used both as nouns or verbs. This is
not possible with most other languages, especially not in other Indo-European languages,
since inflections cause verbs and nouns in those languages to have different endings. This
1
Though English comprises Roman numerals as well, they are no distinct entities of the alphabet but
combinations of letters.
2
Symbols like `$' or `
£' are not explicitly noted since their usage is regionally dependent, and for this
work they are not considered to be part of the natural language. Furthermore, they differ from the other
symbols since they are basically abbreviations for `dollars' and `pounds' and are therefore replaceable.
7

is easily illustrated by the compound example `test-drive' above: Both `to test-drive' and
`a test-drive' are valid. While this introduces a great deal of flexibility into the language, it
has its share of disadvantages. For instance, it cannot be surmised by its form whether
`roadkill' is a verb or a noun. Though the use of `to roadkill' is theoretically acceptable,
none of my dictionaries lists it.
Word order is another concept which pays tribute to English being a rather uninflected
language. In order to resolve ambiguities, English is much more inflexible in terms of
possible constructions. For example, `The boy gave the girl a ring.' may also be written
as `The boy gave a ring to the girl.', but `The girl got a ring from the boy.' cannot be
rewritten to `The girl got from the boy a ring.' Languages with inflections, e.g. English's
close relative German, allow sentence patterns like this.
Finally, openness of vocabulary refers to "the free admission of words from other
languages and the ready creation of compounds and derivatives. English adopts (without
change) or adapts (with slight change) any word really needed to name some new object
or to denote some new process" (Potter 2002, p. 654). As a consequence, "English has the
largest vocabulary of any language in the world" (Encyclopedia Britannica 2002a, p. 500).
1.2.1.2
Problems and challenges
One of the great challenges of the English languages is its large vocabulary. Though this
affects rather the technical side, namely the implementation, it should not be neglected.
The size of a language's vocabulary, and thus its dictionary, directly affects statistical
analysis. For example, if one is interested in the frequency of all possible combinations
of three words, numeric stability and memory usage can quickly prove themselves to be
considerable obstacles. Especially the latter may contribute to or worsen the already
considerable time required to complete extensive analysis.
While the lack of many differing inflections reduces the number of word forms
and thus the size of the dictionary, it makes the detection of affixes much more difficult. It
can be argued that it is not important to detect all affixes English possesses, and since
there are so few their existence may as well be ignored, treating them as separate head
words. However, this work is concerned with an unbiased system, and as it attempts to
detect potential affixes it might detect incorrect, that is unnatural, ones. Further analysis,
based on the wrong affixes, would identify incorrect stems. Though results like these are
interesting in their own right, an evaluation of the effectiveness and usefulness of such an
automatic analysis system would prove difficult.
Affixation further poses the problem that sometimes additional characters are injected
between the stem and an affix. For example `sin' and `er' result in `sinner' and thus in a
8

doubling of the letter n. Analysis might have created the - correct - hypothesis that `er' is
a common suffix. But when the dictionary is checked, it finds that `sinn', the reduced
`sinner', does not exist. This reduces the probability of `er' being a suffix. Thus the
corroboration of the hypothesis becomes harder.
On a more abstract level the lack of inflections makes classification and grouping
of words much more difficult. Due to the flexibility of function, division of word forms into
groups such as verbs or nouns becomes nearly impossible. This hinders further analysis to
a great extent.
The problems with compounds are of similar nature, though those compounds
that have omitted characters cause additional problems. For example, neither `already' nor
`fortnight' would be detected. Thus a reduction of the dictionary by splitting compounds
into their components could not be achieved. Additionally, in practice the injection
of characters as simple as `-' causes problems: For instance, from a purely syntactic
point of view, the compound `test-drive' exemplified above needs to be split into three
parts; namely `test', `-' and `drive'. As long as it is known that `-' is typically used for
composition it is no problem. However, without this information decomposition into `test-'
and `drive', or `test' and `-drive' might be attempted. Since likely neither `test-' nor `-drive'
exist in the dictionary, none of the combinations can be segmented completely; and as a
result, `test-drive' is not detected as a compound.
Therefore, an automatic decomposition system needs to be able to split a compound
into several parts at once. Furthermore, a single character may form a possible segment,
provided it exists independently as a segmentation unit. But then not only `-' becomes a
candidate, but also `a' and `I' which exist as separate word forms. Word forms such as
`Infatuation' (at the start of a sentence) then require considerably more time to analyze.
This kind of problem naturally arises in any language which has one-character words,
e.g. the Spanish `y'. Though this is - again - rather a technical problem, it does have
considerable effect on how extensive an analysis can be afforded (see 4.5.2.1 on page 88 for
more details).
1.2.1.3
Summary
The English script is not complex, but characteristics of the English language make up for
it. This affects especially the implementation of an automatic analysis system. Moreover,
due to its world-wide pervasiveness as a language and as an object of research, the results
of this thesis can be compared with those of other works.
9

1.2.2
German
German belongs to the West Germanic family of languages (Encyclopedia Britannica 2002b,
p. 210) and is closely related to English (Potter 2002, p. 654). According to Encyclopedia
Britannica it is the mother tongue of 90 million people and thus ranks 6th among the
languages of the world. Even though there are many different dialects, the only German
taught at school is called `Hochdeutsch', or `High German'.
It should be noted that there was a major overhaul of official German orthography which
went into effect August 2006. However, it does not affect this work for two reasons: Firstly,
all German texts which were analyzed complied with the old orthography
3
. Secondly,
though some irregularities and grammatical rules which make purely syntactic analysis
difficult were removed, most of the problems listed in 1.2.2.2 are not affected by the reform.
1.2.2.1
Overview
The German alphabet basically comprises that of the English language, but adds a few
extra characters called `Umlaute' (`umlauts'): ` ¨
A' and `¨
a', ` ¨
O' and `¨
o', ` ¨
U' and `¨
u'. These
are vowel alterations of `A' and `a', `O' and `o', `U' and `u' respectively, and `ß' denotes a
sharp s in lower case. In contrast to the other letters it has no explicit upper case variant.
Therefore, two accepted ways of writing it in upper case exist: Either use `SS', or retain
the `ß'.
German is an inflected language. In that regard it differs from English though both
share a common protolanguage (Potter 2002, p. 654). In German pronouns, nouns and
adjectives have four cases of declension, and one of three genders: masculine, feminine and
neuter. Furthermore, verbs conjugate according to first, second and third person, and
singular as well as plural. The sentence structure is generally subject-verb-object (SVO)
4
.
One of the main features of German is that it makes heavy use of capitalization,
i.e. every word that functions as a noun starts with an uppercase letter. This makes it
quite easy to differentiate between verbs and nouns, e.g. `to hear' translates to `h¨
oren',
whereas `hearing' translates to `H¨
oren'.
Capitalization has a strong effect on composition in German.
Whereas in English
the creation of compounds simply requires joining the words, conversion to lower case may
be necessary in German, particularly when two nouns are joined. For example, joining
3
There is not nearly as much written material available which complies with the new set of rules as with
the old one.
4
One example where one might argue that German occasionally switches to subject-object-verb (
SOV )
is the perfect tense: In English SVO is strictly preserved, e.g. `The boy gives the girl a ring' becomes `The
boy has given the girl a ring'. In German though, `Der Junge gibt dem M¨
adchen einen Ring' (`The boy
gives the girl a ring') becomes `Der Junge hat dem M¨
adchen einen Ring gegeben' (`The boy has the girl a
ring given'). However, as the auxiliary verb `hat' (`has') is conjugated, the sentence is considered to be
SVO in this work.
10

`Dampf' (`steam') and `Schiff' (`boat') results in `Dampfschiff' (`steamboat'), i.e. the `S' of
`Schiff' was converted to the lower case `s'.
German allows arbitrarily long compounds.
One of the longest words actually
in use is "Rindfleischetikettierungs¨
uberwachungsaufgaben¨
ubertragungsgesetz" (Landtag
Mecklenburg-Vorpommern 2000), which means `beef labeling regulation and delegation
of supervision law'. Though compounds this long are rare and not used frequently, this
example illustrates to which extent one may take composition in German.
Another speciality of compounding in German is due to the genders: The gender of a
compound is determined by its last joined component. For example, while `Dampf' is
masculine and `Schiff' is neuter, `Dampfschiff' is neuter. Furthermore, `Fahrt' (`ride') is
feminine, thus `Dampfschiffahrt' (`steamboat ride') is feminine. Note that the third `f' in
`Dampfschifff ahrt' was dropped: As in English, letters are sometimes omitted
5
. Besides
these specialities, composition works in German as in English, though sometimes specific
inflections, or their removals, are required.
There are many more inflections in German than there are in English.
Most of
them serve more than one purpose. For example, the suffix `e' is used to denote the plural
form in `Boote' (`boats') and the first person form of `sehen' `(ich) sehe' (`(I) see'). But it
may also be part of a word stem, e.g. `Freude' (`joy').
Sometimes not only the inflections change, but the stems as well. For instance, the
past tense of `heißen' (`to be called') is `hieß', and `pfeifen' (`to whistle') becomes `pfiff'
(`whistled').
Finally, the plural form often triggers vowel alteration:
`Haus' (`house') becomes
`H¨
auser' (`houses'), `Fuß' (`foot') becomes `F¨
uße' (`feet'), `Storch' (`stork') becomes `St¨
orche'
(`storks') etc. These alterations are not governed by rules but need to be learned by heart.
For instance, the plural form of `Pause' (`break') is not `P¨
auser' but `Pausen' (`breaks').
1.2.2.2
Problems and challenges
The main obstacles to an efficient and extensive analysis of German are alterations, letter
rearrangements, multi-functional suffixes, capitalization, composition and gender.
Vowel alteration makes it very hard, if not impossible to detect the correct plural
forms of many words. In the example of `Haus' and `H¨
auser' shown above, both word
forms have only three letters in common. There is a high probability that `Hause', as in
`zu Hause' (`at home'), would be considered the plural form, especially since `e' is actually
used elsewhere to denote plurality (example `Boote' above).
5
In the reformed writing the `f' is preserved, resulting in `Dampfschifffahrt'.
11

However, it does not matter if such plural forms are not recognized. Since this thesis
explicitly limits itself to the expression level (see definition 1 on page 3), it is expected and
acceptable that certain semantical relationsships cannot be detected.
In fact, since many common affixes fulfill multiple functions, it seems not advisable
to use common affixes as a means of classification. Furthermore, the large number of
inflections greatly increases the dictionary and thus slows down analysis.
The modifications of stems, chiefly visible in the rearrangement of letters, have the
same effect. But as with vowel alteration, it seems difficult to detect such a change and
the correct relations. In particular, there are no rules as to when a stem is modified. For
example, judging from the verb `laufen' (`to run') whose past tense is `lief' (`ran') it could
be assumed that the past tense of `kaufen' (`to buy') is `kief'. However, this is incorrect:
Its past tense is `kaufte' (`bought').
Capitalization introduces even more problems. In this work, no information on this
is provided, therefore the relationship between uppercase letters and lowercase letters
is unknown. Hence, nouns cannot be grouped together automatically, and this obvious
syntactic criterion cannot be used.
However, the biggest problem by far is the composition mechanism of German. Due to
capitalization, many compounds will not be detected as such as not all of their components
will be found individually, i.e. `schiff' in `Dampfschiff' above. The omission of letters
(`Dampfschiffahrt') worsens it further. Though more restricted than in the English language
it is also admissible to insert a `-' between the components of compounds, which causes the
same problems as stated for English above. Since compounds may be very long in German,
their decomposition requires a long time (see 4.5.2).
Finally, gender also impacts statistic analysis. For instance, it shall be assumed that
`Dampfschiff' was detected to consist of `Dampf' and `Schiff', which are masculine and
neuter respectively. Then the compound is decomposed and removed from the dictionary
to reduce its size and thus speed up analysis. Statistical analysis now faces the problem
that the number of possible combinations of word forms with `Dampf' has increased: In
German pronouns and articles show agreement for gender. In English the decomposition of
`steamboat' into `steam' and `boat' poses no problem if `the steamboat' is encountered and
reduced to `the steam', since `the steam' likely already exists in the dictionary. However,
in German the reduction of `das Dampfschiff' to `das Dampf' differs from `der Dampf'. As
a result, a new combination is created.
1.2.2.3
Summary
Since German is my native language, it is a natural candidate for language analysis
in this work. Furthermore, though closely related to English, it shows several distinct
characteristics which promise to make it difficult in other aspects.
12

1.2.3
Hebrew
According to Diakonoff (2002) Hebrew belongs to the Northern Central Semitic group of
the Hamito-Semitic languages. The Hamito-Semitic language family is the main language
family of southwestern Asia and northern Africa. It includes languages such as Arabic,
Hebrew, Amharic and Hausa. Though there is disagreement, the prevalent scholastic
opinion is that this family is not related to Indo-European languages. There are about 2.6
million speakers of Hebrew in Israel at present.
As I am not proficient in Hebrew, I had to rely on secondary literature for this
introduction. The sources, in descending order of importance, are Neef (2003, pp. 1-6),
Diakonoff (2002), Tomlin (1986, pp. 22, 188) and Wikipedia (2006a).
1.2.3.1
Overview
In this work, `Hebrew' does not refer to the modern Hebrew called `Ivrit', but to biblical
Hebrew. Its importance is rooted in it being the language of the Old Testament, with
the exception of the Aramaic parts (Neef 2003, p. 2). When vowels are not included, its
alphabet consists of 23 characters and the blank which functions as a syntactic separator.
The characters have no variants, the script is not case-sensitive. In that regard Hebrew
differs greatly from English and German. With a total of only 24 characters
6
it is also
the smallest alphabet of all languages scrutinized in this work. If vowels are included, the
alphabet size increases to 30, since there are six different symbols used to denote vowels.
The list of Hebrew characters is given in table A.3 on page 106, along with the
transcription system used in this work. Furthermore, although Hebrew is written from
right-to-left, examples are given exclusively transcripted to English alphabet and in left-to-
right order. This way it is easier to read for Non-Hebrew speakers. Besides, all my data
files were in that format.
Hebrew is an inflected language with a rich set of affixes which can form very complex
affix compound structures. In transcripted biblical Hebrew only the consonants are
denoted, forming what is called the root of a word (Neef 2003, p. 3). By the means of
vowel infixation the meaning is further specified. In this work, the script with and without
vowels is examined.
Vowel infixation is a remarkable characteristic with far-reaching consequences. For
instance, the acquirement of loanwords is greatly hindered by this mechanism. Nouns are
not affected as much by this. Therefore, some loanwords exist. On the other hand, verbs
can be subjected to numerous modifications. Furthermore, as vowels were not denoted in
the ancient script, ambiguities may occur.
Biblical Hebrew has two genders, masculine and feminine, and three types of number:
singular, dual and plural. They are marked by suffixes (Neef 2003, pp. 54-59). Though
6
The late Masoretic characters are excluded.
13

dual is yet another typical feature of Hebrew, it does not matter to this work as it is not
achieved through infixation. Thus it has no consequences for syntactic analysis.
Finally, the sentence structure of Hebrew is verb-subject-object (VSO) which is rare
among languages (Tomlin 1986, pp. 22, 188).
1.2.3.2
Problems and challenges
Biblical Hebrew shows a few characteristics which might tempt one into expecting the
language to be hard to analyze. However, not all of them have an effect on the script.
For example, vowel infixation does not matter if vowels are not denoted. If they are
denoted, then they increase the vobabulary and hamper analysis which expects relationships
between words to be expressed by contiguous strings, i.e. equal word stems. Of course,
this also only applies to those stems which actually change.
Ambiguities of meaning are irrelevant to a strictly syntactic analysis. Blanks separate
word forms and decompose compounds, and since the language is inflected, it should prove
easier to group words into categories such as verbs than e.g. in English.
1.2.3.3
Summary
Biblical Hebrew is an interesting contrast to the other languages scrutinized in this work. In
terms of the size of its alphabet it is on the opposite end of Japanese (see the introduction
to Japanese below), with English and German in between. It is an inflected language
allowing for classification, and since it has comparatively few loanwords (Diakonoff 2002,
p. 727) and compounds are practically pre-decomposed, there should not be too many
irregularities increasing the dictionary.
Nevertheless, the language is different enough to have the potential for unexpected
results. And last but not least, as I am not familiar with the language it forced me to look
at the results in a purely syntactic, unbiased way.
1.2.4
Japanese
Japanese is considered to have an extraordinarily complicated script.
For instance,
Backhouse writes that "there can be no doubt that the Japanese writing system is the
most complex in the world, and that its mastery requires an enormous investment of time
and effort on the part of learners" (Backhouse 1993, p. 38). Even though this work is only
interested in the syntactic challenges analyzing the Japanese script offers, it faces a plethora
of difficulties which will be outlined in this section. For a more thorough and in-depth
analysis the interested reader is referred to Eschbach-Szabo (2002). A short introduction
to the Japanese language and its script are presented below to explain why this language
was selected for analysis and what outcome is to be expected.
14

The following introduction is based on my own knowledge of the Japanese language,
but also borrows heavily from Backhouse (1993, pp. 38-63) and Shibatani (2002). Further
information was taken from Makino and Tsutsui (2002a, pp. 16-60) and Schneider (1998).
1.2.4.1
Overview
Japanese is the native language of more than 120 million speakers and thus ranks in the
top ten languages of the world. However, it is rarely spoken outside of Japan, and its
expansion in the study as a foreign language, caused by Japan's economic influence, is still
a relatively recent development: Japanese remains very much the language of Japan.
Though scrutinized closely, the origin of the Japanese language, as of the Japanese
people, remains obscure; only the relationship with the languages of the Ryukyu Islands
to the south of Japan is established. They are mutually indistinguishable to the extent
that these languages are commonly considered dialects of Japanese, rather than separate
languages. Beyond that, its heritage is unknown, though it is established that it is not
related to Chinese, which makes the use of the Chinese script (see below) the more
surprising.
Although there are many striking similarities to Korean in terms of phonetics,
accentuation and grammar, they do not suffice to establish a common heritage between
the two languages. Furthermore, other components of the Japanese language hint at
Austronesian languages. Nowadays, the prevalent assumption is that Japanese belongs to
the group of Altaic languages (Shibatani 2002, p. 732).
Japanese is a polysyllabic, agglutinative language with a strict subject-object-verb (SOV )
sentence structure. Syntactical elements lack independence and are appended as suffixes
and postpositions to independent words which carry meaning. Verbs and adjectives
conjugate with endings, and case distinctions are marked by enclitic particles. Nouns
neither decline nor indicate number or gender. Since modifiers are placed before the
modified, relative clauses and adjectives precede the modified nouns and adverbs come
before verbs. Finally, topic is a key concept: Once a topic has been introduced, or set,
it may be omitted henceforth until the topic changes. This makes very short sentences
possible, which may consist of as little as a single word.
There is a widespread notion of Japanese being a very difficult language.
This
can be attributed to its complicated system of honorifics which is used to establish the
hierarchic relationship between speakers, and its highly complex script (Schneider 1998,
pp. 474-476). Even though this work is concerned with syntactic analysis and thus affected
primarily by the latter, honorifics have considerable influence on any kind of analysis since
they introduce additional prefixes, suffixes, inflections, verbs and nouns.
15

1.2.4.2
Writing systems
To ease reading and understanding, transcriptions of Japanese scripts according to the
Hepburn system will be shown alongside to the original Japanese symbols
7
. Roman letters
are called `
', `r¯
omaji ', in Japanese. This term is also used when referring to
the transcription of Japanese to Roman letters. Except for the long vowels, a full list of
transcriptions is given in tables A.1 and A.2. Long vowels such as `aa' or `ee' are indicated
by a horizontal line over the Roman letter, i.e. ¯
a or ¯e.
What makes the analysis, and thus also the comprehension, of Japanese writing
such a challenging task can be demonstrated quite readily with a short example:
1998
WHO
(
)
(`1998 nen-no WHO-no kaigi-de wa,
daiokushin-no ichi nichi atari-no kyoy¯
osesshukijunry¯
o-ga, j¯
urai-no taij¯
u ichi
kiroguramu atari 10 pikokiroguramu-kara ichi-yon pikoguramu (pikoguramu-wa
ichi oku bun-no ichi) ni hikisagerareta.')(Nitsu and Sato 2003, p. 27)
The most striking difference to western languages like English or German is the lack of
whitespace characters such as a blank. Judging from its frequency and distribution in
the sample text above, one might suppose that the character `
' (`no') is the functional
equivalent of a blank. However, this is not the case: It is a particle marking the genitive
case.
Other characters look more familiar to Non-Japanese speakers, though: There are
Roman letters ("WHO") and Arabic numerals (`1998', `10', etc.). Furthermore, the
characters `
' and `
' look very similar to the period `.' and comma `,' punctuation
marks in English.
As a matter of fact, the sample text above contains five of the six commonly used
scripts in Japanese writing: hiragana, katakana, kanji, Roman letters and Arabic numerals.
Lacking are the also commonly used Roman numerals. All these scripts are used freely in
combination.
Though the traditional notation of Japanese is top-down-right-left, the use of the
left-right-top-down notation has increased, especially in electronic communications or data
storage.
kanji are often referred to as `Chinese characters', which is not completely accurate.
Being the oldest Japanese script, it developed from the Chinese script brought to Japan
about 1500 years ago. Since that time it has undergone significant modifications, chiefly
7
Sometimes no transcriptions are shown. This is usually the case if no reading exists, i.e. it is not a
member of the natural language, or not enough space is available.
16

simplifications of their shapes and strokes, and limitation of the number of kanji used in
daily writing. As Chinese underwent similar though different modifications, it seems more
fit to consider present day Chinese characters and Japanese kanji to be cousins.
Originally, Chinese characters were used to write Chinese which led to the introduction
of vast numbers of Japanized approximations of Chinese words into the Japanese language.
Naturally, they were written in their original, Chinese characters. For example, the Chinese
morpheme for `water' (pronounced `shui') was borrowed as `sui ' and written with its regular
character `
'. Therefore, from the viewpoint of the kanji, `
' has the Sino-Japanese
reading, or on-reading, `sui '. As readings from different epochs and different regions of
China were assigned to the kanji, each of them can have multiple on-readings.
At the same time, the characters were extended to represent Japanese morphemes as
well. In this case `
' was also used to represent the native morpheme mizu (`(cold) water').
As with the on-readings there can be multiple Japanese readings, or kun-readings, assigned
to a kanji.
Finally, kanji can also be read phonetically: The on or kun sound of the character
is used to represent a Japanese syllable, but the meaning is abstracted. Over time, this
led to the development of the kana syllabaries hiragana and katakana. In the modern
Japanese writing system the phonetic use of kanji is restricted largely to certain names.
Typically, one to eight kanji form a lexical unit, potentially augmented by the
other Japanese scripts. Excluding the few punctuation marks, there are no rules on how to
segment text and extract its word forms, so the reader must make use of his knowledge of
vocabulary to deduce the most likely segmentation. Since usually several segmentations
are possible, this means jumping back and forth to rule out unlikely combinations.
Surveys indicate that about 4000-5000 kanji are in current use.
Various official
recommendations have been made restricting the number of kanji approved for official
use in government publications, education and the mass media. There are currently 1945
sanctioned `general use kanji' (jooyookanji ), and an additional 284 kanji are authorized for
use in personal names.
Kanji are most commonly used to write nouns and stems of verbs and adjectives.
Furthermore, Japanese names are usually written in Kanji, though recently the usage of
hiragana and katakana has increased.
hiragana is a sound-based syllabic script.
When they adapted the Chinese script
for their own language, the Japanese faced certain difficulties based on the different
grammatical structures of the two languages: While Chinese is an uninflected language,
Japanese has both inflections and a large number of grammatical particles. Since the
characters were not suited to represent such elements, they were used purely phonetically.
For example, `ru' was conventionally represented phonetically by characters such as `
'
(`stop, remain') and `
'(`flow '). Over time, simplified versions which require less strokes
17

developed, i.e. `
'.
Modern day hiragana consists of 46 basic symbols, though variants may inflate
this number to 82.
The full list is given in table A.1 along with their Hepburn
Romanization.
Generally, the primary function of hiragana is to write grammatical elements. This
includes prefixes, suffixes, particles, demonstrative words, the copula, grammatical nouns,
common verbs or Sino-Japanese items where the kanji have been proscribed from general
use.
The role of hiragana, as that of katakana, is slowly changing, though: Just a few decades
ago the rate kanji:kana was roughly 70:30 (Eschbach-Szabo 2002, p. 312). Now it is about
30:70, and the most frequently used kana are hiragana. The cause for this development
is the sheer number and complexity of the kanji: There are, at most, only 82 different
hiragana and 83 different katakana symbols (see tables A.1 and A.2), but several thousand
kanji. Therefore, more and more commonly used words are written in kana instead of kanji.
Theoretically, Japanese can be written entirely in kana, hiragana as well as katakana.
Books for young children are written this way. Paradoxically, Japanese written in this
form is often much more difficult to read since then the vocabulary is the sole source of
information about how to segment the text into meaningful units. But even if the full
required vocabulary is known, a certain ambiguity remains: Due to its homophonous
nature, Japanese words written in kana can often have various meanings, e.g. `kumo' (`
') can refer to `spider' (`
') as well as to `cloud' (`
'). Therefore, it is unlikely that
kana, or any other kind of script, can replace kanji anytime soon.
katakana were often derived from the same kanji as hiragana.
They developed
from diverse systems of priestly shorthand that aided the reading of Chinese texts and
Buddhist scriptures by supplying Japanese particles and endings missing in Chinese in
the form of abbreviated kanji strokes. Furthermore, they were also used to denote the
phonetics of words.
One of the words written in katakana in the introductory example on page 16 is `
' (`daiokishin'), `dioxin'. For a full list see table A.2 where they are shown
along with their Hepburn Romanization. Their appearance is stiffer and more angular
than their hiragana counterparts. katakana are often compared to the use of italics in
printed western languages, i.e. they are used for items which are in some way unusual or
for some particular special effect such as emphasis. Thus its primary use is in representing
loanwords other than from classical Chinese, particularly from English and other European
languages, writing names other than Chinese or Korean and onomatopoeic words. Since
there are more and different sounds in other languages, more combinations of katakana
than hiragana are admittable to approximate the foreign sound. Finally, katakana may
also be used to avoid complicated kanji or to make long sentences of hiragana and kanji
18

easier to read as they aid in visually segmenting them.
Besides the Japanese script, Japanese writing in general comprises three further
scripts: Roman letters, Arabic numerals and Roman numerals.
Roman letters comprise the letters from `A' to `Z', uppercase as well as lowercase,
though the latter is rarely used. Roman letters are regularly used for abbreviations and
acronyms which are commonly based on English and thus constitute a special group of
loanwords. It is important to note that these are often only based on English vocabulary
and are not genuinely English words. Many of them have been coined in Japan (e.g.
`skinship') and incorporated into the Japanese language.
Though the kanji script has its own numerals, their use as numerals is largely restricted
to traditional vertical writing. In horizontal writing Arabic numerals are the rule and they
are frequently encountered in vertical writing as well. In comparison Roman numerals have
a much more restricted role: They are usually employed to denote order, i.e. to number
chapters or sections in books. They could be considered to be included in the set of Roman
letters. But since they are kept distinct in literature, they are listed separately here as well.
1.2.4.3
Problems and challenges
As noted above, there is no spacing between word forms to aid in the segmentation of
a text. Japanese has various punctuation marks, but except for the equivalents of the
period (`
', `maru'), comma (`
', `ten') and quotation (`
...
', `
...
'), they are used
rarely. Comma- and period-type punctuation marks indicate phrase and sentence divisions,
but within these units the symbols of the various scripts follow each other without any
whitespace or separators.
Hence, the first task when dealing with the Japanese language is to find some way to
segment a given text into tokens of a size suitable for further analysis. But this task, quite
easy in languages like German or English, is very difficult. Since there are no whitespace
characters, their use as a syntactic separator to tokenize the text is not an option. Thus
the common approach is to employ a dictionary comprising the vocabulary and a set of
grammatical rules to extract word forms. But as this work's purpose is to use as little
information as possible, this avenue is barred
8
.
Still it is possible to make use of the punctuation marks; namely the period and comma.
However, experiments show that the resulting tokens can still be as long as a hundred
symbols. Thus this can only be a starting point, afterwards different techniques, chiefly
based on statistical analysis, are required.
8
On a side note, morphological parsers such as
chasen which use dictionaries and grammatical rules
to parse Japanese text have not yet shown themselves to be well-suited to the task of text segmentation.
Though they are useful and achieve acceptable results, I have not yet seen a non-trivial text longer than a
page which was segmented without error.
19

Statistical analysis needs to deal with the very core of Japanese writing: orthographic
variation. There is no standardized orthography in Japanese, and the Japanese and
non-Japanese scripts are used freely in combination wherever the writer seems fit. As
explained above, there is not only one but three Japanese scripts: kanji, hiragana and
katakana. The kana, that is hiragana and katakana, may be used as a replacement for
kanji or for themselves. For example, the kanji `
' (`suiatsu'), `water pressure', can
also be written in hiragana as `
', or in katakana as `
'. Though it is not
customary, one may also mix the scripts and write `
', `
', `
' etc.
This becomes even more complicated with word forms which comprise more than one
script. For example, `
' (`toru'), `to take', or `
' (`amerikajin'), `American
(person)', could also be represented as `
' and `
', or `
' and `
' respectively. The acceptable but uncustomary kanji variant of the last one, `
',
could also be encountered. Of course, combining non-Japanese scripts with Japanese scripts
is also possible: `AMERIKA
', `AMERIKAJIN', `AmerikaJin', `amerikajin',`AMERIKA
' etc. are also valid.
It might be argued that from a purely syntactical point of view these arguments
are invalid as the various ways of writing can be considered variants or word forms and
deemed acceptable in their own right. This perspective, though understandable, ignores
the practical problems of a statistical analysis. For instance, having many different variants
reduces the frequency of a word form and thus the likelihood to recognize it as a distinct
unit, which in turn by further segmenting a given text might be used to discover further
units. The following example illustrates the problem:
Example
(`basu-ni notta
onna-ga nokoru. basu-ni notta no wa uresh¯i kara.')
The sentence above means `Because she is happy that she got on the bus, the woman
who got on the bus stays behind.' Here attention is directed towards the two variant
writings `
' (`notta', `got on') and `
' (`notta', `got on'). A simple statistical
analysis based on term frequency would find `
' and `
' to have a higher than
average occurrence and thus consider them word form candidates. This could then lead
to the isolation of the `
' and the `
', making them word form candidates on their own.
Since this was already an incorrect segmentation, further incorrect tokens could be created,
which in turn would lead to further incorrect segmentations etc. Note that this statistical
analysis approach is overly simplistic, but the example shows what difficulties the lack of a
standardized orthography can induce.
Besides these effects on frequency statistics, particles prove to be a further hindrance
to a successful, automatic segmentation. As mentioned before `
' can function as a
particle marking the genitive case. But it can also function as a nominalizer, be part of a
name or of an inflection. The same holds true for `
' (`kara'), a sentence final particle
20

denoting causality: For example, `
' (`kakaru'), `to need', is conjugated to `
' (`kakaranai '), `to not need', which contains `
'.
As Japanese is an agglutinative language, it makes extensive use of suffixes.
Theoretically, these should be helpful for analysis, e.g. the classification of verbs according
to their tenses or their potential inflections. However, the lack of whitespace makes it
hard to identify beginning and end of a word form. For instance, `
' (`nai ') is a suffix
denoting negation. But it can also mean `within' or be the first part of the following word
form.
In view of the examples above it might be thought that concentrating on the kanji would
solve most of these problems. However, this proceeding discards a significant percentage of
Japanese script and effectively limits analysis to nouns. But even if this was an option for
this work, not only does it not solve all problems, but it even introduces new ones. For
example, the phrase `
' (`amerikajin mita'), `I saw an American', is reduced
to `
', a combination which does not exist in Japanese vocabulary. The consequences
are the same as for the wrong segmentations above.
Finally, concentrating on the kanji does not solve problems arising from composition
and abbreviation in Japanese. For example, the compound `
' (`kurikaesu'), `to
repeat (again and again)', contains the hiragana characters `
' (`ri ') and `
' (`su').
Ignoring them results in `
', a term which is not in the Japanese dictionary. This is a
similar problem to the example of `
' (`amerikajin mita') above. Likewise
`
' (`gaikokugink¯
o'), `foreign bank', may be abbreviated as `
' (`gaik¯
o'), a term
which will most likely not be encountered outside of the text passage that it is defined for.
1.2.4.4
Summary
Japanese is a highly complex script and differs greatly in style and grammar from the other
candidate languages. The characteristics and problems outlined above make this language
extraordinarily hard to analyze with an automatic analysis system, and this was the reason
why it was selected for this work.
21

Details

Seiten
Erscheinungsform
Originalausgabe
Jahr
2006
ISBN (eBook)
9783836606271
DOI
10.3239/9783836606271
Dateigröße
2 MB
Sprache
Englisch
Institution / Hochschule
Eberhard-Karls-Universität Tübingen – Informatik 17, Methodik computerunterstützter Textinterpretation
Erscheinungsdatum
2007 (Oktober)
Note
1,0
Schlagworte
automatische syntaxanalyse korpuserstellung computerlinguistik korpuslinguistik meta-rating
Zurück

Titel: Strings of Natural Languages
book preview page numper 1
book preview page numper 2
book preview page numper 3
book preview page numper 4
book preview page numper 5
book preview page numper 6
book preview page numper 7
book preview page numper 8
book preview page numper 9
book preview page numper 10
book preview page numper 11
book preview page numper 12
book preview page numper 13
book preview page numper 14
book preview page numper 15
book preview page numper 16
book preview page numper 17
book preview page numper 18
book preview page numper 19
book preview page numper 20
book preview page numper 21
book preview page numper 22
book preview page numper 23
book preview page numper 24
book preview page numper 25
book preview page numper 26
book preview page numper 27
book preview page numper 28
book preview page numper 29
book preview page numper 30
146 Seiten
Cookie-Einstellungen