On the Algorithmic Tractability of Single Nucleotide Polymorphism (SNP) Analysis and Related Problems
©2003
Diplomarbeit
134 Seiten
Zusammenfassung
Inhaltsangabe:Abstract:
This work brings together two areas of sciencebiology and informaticsthat have only recently been connected in the emerging (and vastly growing) research field of Bioinformatics. In order to achieve a common basis for Parts 2 and 3 of this work, Part 1 intends to introduce the computer scientist to the relevant biological background and terminology (Chapter 2), and to familiarize the biologist with the relevant topics from theoretical computer science (Chapter 3).
Chapter 2 first introduces some terminology from the field of genetics, thereby defining SNPs. We then motivate the analysis of SNPs by two applications, i.e. the analysis of evolutionary development and the field of pharmacogenetics. Especially the field of pharmacogenetics is capable of having an enormous impact on medicine and the pharmaceutical industry in the near future by using SNP data to predict the efficacy of medication.
Chapter 3 gives a brief introduction to the field of computational complexity. We will see and motivate how algorithms are analyzed in theoretical computer science. This will lead to the definition of complexity classes, introducing the class NP which includes computationally hard problems. Some of the hard problems in the class NP can be solved efficiently using the tool of fixed-parameter tractability, introduced at the end of this chapter.
An important application of SNP data is in the analysis of the evolutionary history of species development (phylogenetic analysis part two chapters 4 and 5). As will be made plausible in Chapter 5 using SNP data isin many wayssuperior to previous approaches of phylogenetic analysis. In order to analyze the development of species using SNP data, an underlying model of evolution must be specified. A popular model is the so-called perfect phylogeny, but the construction of this phylogeny is a computationally hard problem when there are inconsistencies (such as read-errors or an imperfect .t to the model of perfect phylogeny) in the underlying data.
Chapter 4 analyzes the problem of forbidden submatrix removal which is closely connected to constructing perfect phylogenieswe will see in Chapter 5 that its computational complexity is directly related to that of constructing a perfect phylogeny from data which is partially erroneous. In this chapter, we analyze the algorithmic tractability of forbidden submatrix removal, characterizing cases where this problem is NP-complete (being […]
This work brings together two areas of sciencebiology and informaticsthat have only recently been connected in the emerging (and vastly growing) research field of Bioinformatics. In order to achieve a common basis for Parts 2 and 3 of this work, Part 1 intends to introduce the computer scientist to the relevant biological background and terminology (Chapter 2), and to familiarize the biologist with the relevant topics from theoretical computer science (Chapter 3).
Chapter 2 first introduces some terminology from the field of genetics, thereby defining SNPs. We then motivate the analysis of SNPs by two applications, i.e. the analysis of evolutionary development and the field of pharmacogenetics. Especially the field of pharmacogenetics is capable of having an enormous impact on medicine and the pharmaceutical industry in the near future by using SNP data to predict the efficacy of medication.
Chapter 3 gives a brief introduction to the field of computational complexity. We will see and motivate how algorithms are analyzed in theoretical computer science. This will lead to the definition of complexity classes, introducing the class NP which includes computationally hard problems. Some of the hard problems in the class NP can be solved efficiently using the tool of fixed-parameter tractability, introduced at the end of this chapter.
An important application of SNP data is in the analysis of the evolutionary history of species development (phylogenetic analysis part two chapters 4 and 5). As will be made plausible in Chapter 5 using SNP data isin many wayssuperior to previous approaches of phylogenetic analysis. In order to analyze the development of species using SNP data, an underlying model of evolution must be specified. A popular model is the so-called perfect phylogeny, but the construction of this phylogeny is a computationally hard problem when there are inconsistencies (such as read-errors or an imperfect .t to the model of perfect phylogeny) in the underlying data.
Chapter 4 analyzes the problem of forbidden submatrix removal which is closely connected to constructing perfect phylogenieswe will see in Chapter 5 that its computational complexity is directly related to that of constructing a perfect phylogeny from data which is partially erroneous. In this chapter, we analyze the algorithmic tractability of forbidden submatrix removal, characterizing cases where this problem is NP-complete (being […]
Leseprobe
Inhaltsverzeichnis
ID 7481
Wernicke, Sebastian: On the Algorithmic Tractability of Single Nucleotide Polymorphism
(SNP) Analysis and Related Problems
Hamburg: Diplomica GmbH, 2003
Zugl.: Eberhard-Karls-Universität Tübingen, Universität, Diplomarbeit, 2003
Dieses Werk ist urheberrechtlich geschützt. Die dadurch begründeten Rechte,
insbesondere die der Übersetzung, des Nachdrucks, des Vortrags, der Entnahme von
Abbildungen und Tabellen, der Funksendung, der Mikroverfilmung oder der
Vervielfältigung auf anderen Wegen und der Speicherung in Datenverarbeitungsanlagen,
bleiben, auch bei nur auszugsweiser Verwertung, vorbehalten. Eine Vervielfältigung
dieses Werkes oder von Teilen dieses Werkes ist auch im Einzelfall nur in den Grenzen
der gesetzlichen Bestimmungen des Urheberrechtsgesetzes der Bundesrepublik
Deutschland in der jeweils geltenden Fassung zulässig. Sie ist grundsätzlich
vergütungspflichtig. Zuwiderhandlungen unterliegen den Strafbestimmungen des
Urheberrechtes.
Die Wiedergabe von Gebrauchsnamen, Handelsnamen, Warenbezeichnungen usw. in
diesem Werk berechtigt auch ohne besondere Kennzeichnung nicht zu der Annahme,
dass solche Namen im Sinne der Warenzeichen- und Markenschutz-Gesetzgebung als frei
zu betrachten wären und daher von jedermann benutzt werden dürften.
Die Informationen in diesem Werk wurden mit Sorgfalt erarbeitet. Dennoch können
Fehler nicht vollständig ausgeschlossen werden, und die Diplomarbeiten Agentur, die
Autoren oder Übersetzer übernehmen keine juristische Verantwortung oder irgendeine
Haftung für evtl. verbliebene fehlerhafte Angaben und deren Folgen.
Diplomica GmbH
http://www.diplom.de, Hamburg 2003
Printed in Germany
Contents
1 Introduction
1
1.1
The Human Genome and SNPs . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.2
Overview of this Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
2 Biological Background and Motivation
4
2.1
Basic Genetic Terminology
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
2.2
An Introduction to SNPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
2.3
Importance and Prospects of SNP Mapping . . . . . . . . . . . . . . . . . . .
8
2.3.1
SNPs in the Study of Population History . . . . . . . . . . . . . . . .
8
2.3.2
SNPs and Pharmacogenetics . . . . . . . . . . . . . . . . . . . . . . . 10
3 Computer Science Preliminaries and Notation
14
3.1
Notation for Matrices and Graphs . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2
Crash Course in Computational Complexity Theory . . . . . . . . . . . . . . 15
3.2.1
Machine-Independent Analysis . . . . . . . . . . . . . . . . . . . . . . 15
3.2.2
Running Time--Keeping Score . . . . . . . . . . . . . . . . . . . . . . 17
3.2.3
Complexity Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3
Fixed-Parameter Tractability (FPT) . . . . . . . . . . . . . . . . . . . . . . . 23
3.3.1
An Efficient Algorithm for Vertex Cover . . . . . . . . . . . . . . . 23
3.3.2
Formal Definition and Aspects of FPT . . . . . . . . . . . . . . . . . . 26
4 Submatrix Removal Problems
29
4.1
Definitions and Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2
A Reduction to d-Hitting Set . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2.1
Finding Forbidden Submatrices . . . . . . . . . . . . . . . . . . . . . . 33
4.2.2
Approximability and Fixed-Parameter Tractability Results . . . . . . 35
4.3
Hardness Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3.1
Overview of Results--Four Theorems . . . . . . . . . . . . . . . . . . 37
4.3.2
Proofs for Theorems 4.13 and 4.14 . . . . . . . . . . . . . . . . . . . . 38
iii
CONTENTS
iv
4.3.3
Proof of Theorem 4.11 . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3.4
Proof of Theorem 4.12 . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.4
Discussion and Future Extensions . . . . . . . . . . . . . . . . . . . . . . . . . 51
5 Perfect Phylogeny Problems
53
5.1
Phylogenetic Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.1.1
Introduction and Motivation . . . . . . . . . . . . . . . . . . . . . . . 53
5.1.2
Formal Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.2
Perfect Phylogeny Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.3
Relation to Forbidden Submatrix Problems . . . . . . . . . . . . . . . . . . . 58
5.4
Minimum Species Removal
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.5
Minimum Character Removal . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6 Graph Bipartization
67
6.1
Introduction and Known Results . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.2
Reducing Edge Bipartization to Vertex Bipartization
. . . . . . . . . 70
6.3
A Branch&Bound Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.3.1
Initial Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.3.2
Data Reduction Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.4
Implementation and Comparison of the Algorithms . . . . . . . . . . . . . . . 87
6.4.1
Using the Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.4.2
Some Implementation Details . . . . . . . . . . . . . . . . . . . . . . . 88
6.4.3
Tests and Test Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
7 Using Graph Bipartization in SNP Analysis
96
7.1
Introduction and Overview of Results . . . . . . . . . . . . . . . . . . . . . . 96
7.2
SNP Haplotype Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
7.3
Inferring Haplotypes from Genotypes . . . . . . . . . . . . . . . . . . . . . . . 100
7.3.1
Minimum Genotype Removal . . . . . . . . . . . . . . . . . . . . . . . 106
7.3.2
Minimum Site Removal . . . . . . . . . . . . . . . . . . . . . . . . . . 109
7.4
Testing Branch&Bound on SNP Data . . . . . . . . . . . . . . . . . . . . . . 112
8 Conclusion
115
8.1
Summary of Results and Future Extensions . . . . . . . . . . . . . . . . . . . 115
8.2
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
List of Figures
119
Bibliography
120
Chapter 1
Introduction
1.1
The Human Genome and SNPs
Throughout its life, an individual's hereditary potentials and limits are determined by its
very own genes. Consequently, a lot of effort has been put into the Human Genome Project.
As of April 2003, 95.8% of the human genome has been sequenced in a very high quality (see
[NCBI03] for up-to-date information) and a goal has been set asking for a complete sequence
due the end of this year as more and more chromosomes become fully mapped (see, e.g.,
[Heil03]). However--quoting from [WeHu02]-- this achievement is merely the foundation
for far deeper research as
". . . we are ending the era of determining the sequence of the genetic code and
entering the beginning of the age of deciphering the biology of life underlying that
code."
Hearing that a 95.8 percentile portion of the human genome has been sequenced, one is
immediately bound to ask as to which human's genome we are actually referring to--after
all, every human has a unique genetic markup. At the beginning of the sequencing process
by the Human Genome Project, geneticists thought they would indeed have to make a choice
as to who would be chosen to provide a reference sequence of the four nucleotides A, C, G,
and T of his DNA.
1
One individual would surely constitute a "blueprint" of the human
species, the study of genetic variation among our species would however gain little insight
from this [Chak01].
The sequence we have obtained by the Human Genome Project fortunately was not obtained
through making that kind of choice: Genetic variation among humans can--in almost every
case--be traced back to variations that occur within a single nucleotide. Such a site where
there are two different nucleotides to be found in two different DNAs, is commonly referred
to as a Single Nucleotide Polymorphism (SNP, pronounced "snip") [HGSC01]. A simple
definition is given by [Ston01]:
". . . DNA is a linear combination of four nucleotides; compare two sequences,
position by position, and wherever you come across different nucleotides at the
same position, that's a SNP."
1
A thorough introduction to genetic terminology including SNPs is given in Chapter 2
1
CHAPTER 1. INTRODUCTION
2
During the Human Genome Project, 1.4 million sites of genetic variations have been mapped
[SNP01] along a reference sequence composed of hundreds of different genomes. The reason
why it was possible to combine such a multitude of different individuals' genomes into a
coherent map of the human genome lies in the fact that DNA is mostly conserved around
SNPs . The importance of SNPs is outlined e.g. in [Ston01] who refers to them as the "bread
and butter of DNA sequence variation" for they are witnesses of unique past mutations in
our genetic markup. Therefore, SNPs can give valuable hints about common evolutionary
ancestors. But there is an even more severe economic influence: As genes are widely held
responsible for the likeliness for the acquisition of certain diseases and the responsiveness
to various medical treatments, SNPs can either be made directly responsible for such a
variation or they may at least aid in the identification of the corresponding gene.
2
In this work, we shall deal with topics from the field of theoretical bioinformatics that are
connected to SNPs. Recent research [LBILS01, EHK03] has shown that all the useful appli-
cations and prospects of SNP data come at a price: Many computational problems arising
during the acquisition and application of SNP data have been proven to be computationally
"hard", meaning that they are widely believed to be impossible to solve in reasonable time .
3
However, there are techniques such as fixed-parameter tractability and data-reduction (both
to be introduced in more detail throughout this work) that allow even "hard" problems to
be solved efficiently in practical applications. This work explores the possible use of these
techniques for computationally hard problems connected to SNP analysis.
1.2
Overview of this Work
The main part of this work (Chapters 2 to 7) can be divided into three parts:
· Part 1 (Chapters 2 and 3): Introducing the Terminology
This work brings together two areas of science--biology and informatics--that have
only recently been connected in the emerging (and vastly growing) research field of
bioinformatics. In order to achieve a common basis for Parts 2 and 3 of this work,
Part 1 intends to introduce the computer scientist to the relevant biological background
and terminology (Chapter 2), and to familiarize the biologist with the relevant topics
from theoretical computer science (Chapter 3).
Chapter 2 first introduces some terminology from the field of genetics, thereby defining
SNPs. We then motivate the analysis of SNPs by two applications: The analysis
of evolutionary development and the field of pharmacogenetics. Especially the field
of pharmacogenetics is capable of having an enormous impact on medicine and the
pharmaceutical industry in the near future by using SNP data to predict the efficacy
of medication.
Chapter 3 gives a brief introduction to the field of computational complexity. We will
see and motivate how algorithms are analyzed in theoretical computer science. This
will lead to the definition of "complexity classes", introducing the class NP which
includes computationally hard problems. Some of the hard problems in the class NP
can be solved efficiently using the tool of fixed-parameter tractability, introduced at
the end of this chapter.
2
Section 2.3 gives a detailed introduction to the prospects of SNP mapping and analysis.
3
Chapter 3 introduces the topic of computational hardness in more detail.
CHAPTER 1. INTRODUCTION
3
· Part 2 (Chapters 4 and 5): Applying SNP Data (Perfect Phylogenies)
An important application of SNP data is in the analysis of the evolutionary history of
species development (phylogenetic analysis). As will be made plausible in Chapter 5,
using SNP data is--in many ways--superior to previous approaches of phylogenetic
analysis. In order to analyze the development of species using SNP data, an under-
lying model of evolution must be specified. A popular model is the so-called perfect
phylogeny, but the construction of this phylogeny is a computationally hard problem
when there are inconsistencies (such as read-errors or an imperfect fit to the model of
perfect phylogeny) in the underlying data.
Chapter 4 analyzes the problem of "forbidden submatrix removal" which is closely
connected to constructing perfect phylogenies--we will see in Chapter 5 that its com-
putational complexity is directly related to that of constructing a perfect phylogeny
from data which is partially erroneous. In this chapter, we analyze the algorithmic
tractability of "forbidden submatrix removal", characterizing cases where this problem
is NP-complete (being fixed-parameter tractable in general).
Chapter 5 introduces the concept, motivation, and some known results for phylogenetic
analysis. We then apply the results from Chapter 4 to perfect phylogeny problems,
i.e., the problem of dealing with data-inconsistencies with respect to the underlying
evolutionary model of perfect phylogeny. It will be shown that these problems are all
fixed-parameter tractable and can be efficiently solved using existing algorithms.
· Part 3 (Chapters 6 and 7): Obtaining SNP Data
Basically, obtaining SNP data requires sequencing two DNA strands and comparing
them to each other. The problems lie in the details: Firstly, current techniques only
allow sequences of at most 500 base pairs in length to be sequenced as a whole, and
secondly, it is--in terms of cost and labor--often only possible to detect the presence
of SNP sites rather than being able to tell which of the two DNAs contained which
base. Part 3 of this work analyzes the computational complexity of these two problems
by relating them to a graph-theoretic problem
4
called Graph Bipartization.
Chapter 6 introduces the computationally hard problem of Graph Bipartization,
stating some known results and showing the relative hardness of the two Graph
Bipartization-problem variants Edge Bipartization and Vertex Bipartization
(the latter one of which is proven to be at least as hard as the former one). Following
this introduction, we develop and test practical algorithms for Graph Bipartization.
These algorithms--although they require a long time even for medium-sized general
graphs--prove to be efficient for the Graph Bipartization problems that arise during
the acquisition of SNPs, even if these graphs contain a few hundred vertices.
Chapter 7 introduces a formal definition of the computational problems of SNP analysis
and proves their close relationship to Graph Bipartization. The last section of this
chapter shows that the algorithms developed in Chapter 6 can be used to efficiently
solve the presented problems by solving their corresponding Graph Bipartization
problem.
This work is concluded by Chapter 8, presenting a summary of results and suggestions for
future research related to this work.
4
Graphs are introduced in Chapter 3
Chapter 2
Biological Background and
Motivation
In this chapter, we establish some basic terminology from the field of genetics used through-
out this work
1
. Afterwards, we introduce SNPs and current techniques used to detect and
map them. The last section of this chapter provides an introduction to pharmacogenetics--
the area that sparked economic and scientific interest in SNPs.
2.1
Basic Genetic Terminology
All living organisms encode their genetic information in the form of deoxyribonucleic acid
(DNA, for short). DNA is a double-helix polymer where each strand is a long chain of
polymerized monomer nucleotides.
2
Basically, these are four different nucleotides; to a
deoxyribose sugar bonded with a phosphate, one of four possible bases is attached to form a
nucleotide; these possible bases include the two purines (adenine and guanine) and the two
pyrimidines (cytosine and thymine).
3
For abbreviation, the nucleotides in a strand of DNA
are denoted by the first letter of their respective base (adenine by A, guanine by G, cytosine
by C, and thymine by T). The nucleotides are joined to form a single strand of DNA by
covalently
4
bonding a phosphate of one nucleotide with the sugar of the next, a strand starts
with a sugar (this end is called the 3'-end) and ends with a phosphate (called 5'-end).
5
Two
single strands of DNA are held together by hydrogen bonds between the bases of opposing
nucleotides in the double-strand. These bonds specifically bind adenine with thymine and
cytosine with guanine. The structure of DNA is shown in Figure 2.1. Although being
1
For a more thorough introduction on genetics, see, e.g., [GMS00], [GGL02] or the chapters on genetics
in biochemistry books such as [VoVo95] or [BJS02].
2
In general, polymer designates the class of very large molecules (macromolecules) that are multiples of
simpler chemical units called monomers.
3
Actually, DNA has a lot less homogenous buildup than this because the nucleotides can be further
modified by an organism at their deoxyribose sugar. For example, bacteria use such a modification to be
able to distinguish their own DNA from foreign DNA coming, e.g., from a phage (bacterial virus). However,
such modifications will not have to concern us in this work because even in the presence of them, the basic
principles of DNA replication and translation hold.
4
A covalent bond is the interatomic linkage that results from two atoms forming a common electron
orbital.
5
The terms 3' and 5' are due to the enumeration of the carbon atoms in the deoxyribose sugar.
4
CHAPTER 2. BIOLOGICAL BACKGROUND AND MOTIVATION
5
P
P
D
D
D
P
P
P
P
D
D
D
D
P
P
P
P
D
D
D
D
P
P
D
5' end
3' end
P
D
P
D
P
D
P
D
P
D
P
D
P
D
P
D
P
D
P
D
P
D
D
P
5' end
3' end
T
A
C
G
G
C
G
C
A
T
A
T
A
T
T
A
T
A
G
C
G
C
G
C
3'--ACCGATATCGGA--5'
5'--TGGCTATAGCCT--3'
Figure 2.1: Chemical structure of DNA (above) and its abbreviated notation (below). The
letters within the molecular structure above stand for phosphate (P), deoxyribose (D), ade-
nine (A), guanine (G), cytosine (C), and thymine (T). The dashed vertical lines indicate
hydrogen bonds.
just two strands of bases attached to a phosphate-sugar backbone, DNA can be extremely
long by molecular measures.
6
In order for the DNA to fit into a single cell
7
whilst still
being accessible for replication and transcription into RNA to make proteins, human DNA
is organized into very dense complexes of proteins and DNA, called chromosomes. Humans
have 22 pairs of autosomes
8
and one pair of sex chromosomes (with females carrying two
"X" chromosomes and men one "X" and one "Y" chromosome) within--almost--each cell.
The complete DNA sequence of an organism is called its genome, its genetic constitution
as a whole is called genotype
9
. Genetic areas of interest in a genome are called loci
10
. A
gene is a unit of DNA that encodes hereditary information, i.e., the sequence of all proteins
expressed by an organism, on a locus of an individual's chromosome.
11
Any one of two
or more genes that may occur alternatively at a given locus on a chromosome is called an
allele. A combination of alleles that is likely to be inherited as a whole and may be found on
one chromosome is called a haplotype. The sequence of DNA within a gene determines the
synthesis of proteins, experiments indicating that each gene is responsible for the synthesis
of one protein. Each one of the 20 proteinogenic amino acids
12
is encoded by one or more
triplets of bases. Mutations, disruptions altering the genetic information (and therefore in
most cases the corresponding protein as well), may be due to deleting, inserting, replacing,
or rearranging nucleotides in a gene; they are responsible for the unique individual genetic
markup of organisms. As already mentioned above, a human cell contains two copies of
every chromosome (excluding the gender-specific chromosomes X and Y), where one copy
is inherited from each parent. Since each parent has its unique genetic markup, equivalent
6
E.g., the diploid DNA of a human being has a total length of approximately 1.8m [VoVo95].
7
With a few exceptions, every cell in a living organism contains its whole hereditary information.
8
Autosomes are those chromosomes that control the inheritance of all characteristics except sex-linked
ones
9
The genotype of an organism is the basis for its phenotype, where phenotype denotes the two or more
distinct forms of a characteristic within a population.
10
Loci is the plural form of "locus".
11
Note that the majority of DNA is presumed not to contain any genetic information [VoVo95].
12
Proteins are basically chains of polymerized amino acids.
CHAPTER 2. BIOLOGICAL BACKGROUND AND MOTIVATION
6
3'--ACAAAACTGAGCACTATTGGATCTACGACTGT --5'
SNP map
First DNA strand
Second DNA strand
3'--ACTATACTCAGCACTCTAGCATCTACGACTCT --5'
SNP sites
Figure 2.2: Mapping SNPs by comparison of two individuals' DNA sequence. Note that as
mentioned in the text, a single nucleotide variation must occur in at least 1% of a population's
individuals in order to be called "polymorphism" instead of "substitution".
genes in the two chromosomes may differ. Identical alleles on both chromosomes are referred
to as being homozygous, different alleles are denoted heterozygous.
2.2
An Introduction to SNPs
A polymorphism is a region of the genome that varies between different individuals.
13
Con-
sequently, a single nucleotide polymorphism (SNP, pronounced "Snip") is a genetic variation
caused by the change of one single nucleotide (see Figure 2.2). These variations occur quite
frequently among humans--on average, a SNP may be found approximately every 1.91 · 10
3
bases ("1.91 kilobases"), implying that over 90% of sequences longer than 20 kilobases will
contain a SNP [Chak01]. SNPs are not evenly distributed across chromosomes, most genes
contain just one or two SNPs. Currently, 93% of all genes are known to contain at least
one SNP [SNP01]. Depending on whether they are found within genes or not, SNPs are
either labeled cSNPs (coding SNPs) or ncSNPs (non-coding) SNPs. Generally, ncSNPs ap-
pear more frequently than cSNPs [Mu02]. Recall from the last section that more than one
triplet of bases may encode a certain amino acid. Often, triplets that encode the same
amino acid differ in a single nucleotide from each other. If a cSNP does not introduce an
amino acid change in the encoded protein, it is named sSNP (synonymous SNP), and nsSNP
(non-synonymous SNP) otherwise.
14
In the human genome, the ratio of sSNPs to snSNPs
is approximately one to one [Carg99].
Before outlining some prospects and the scientific as well as economic impact of SNP analysis,
we will now give a brief overview as to how SNPs are identified. In his survey on the usage
of SNPs as a tool in human genetics, Gray [GCS00] names four methods for SNP detection:
13
More precisely, a polymorphism has been defined as "the least common allele occurring in 1% or greater
of a population" [Mare97], thereby distinguishing a polymorphism from a substitution that may occur in
less than 1% of a population's individuals.
14
For example, there is an nsSNP in the gene for the HLA-H protein, where a crucial disulfide bond is
disrupted by changing the 282nd amino acid from cysteine to tyrosine, causing a metabolic disorder known
as hereditary hemochromatosis [PSS97].
CHAPTER 2. BIOLOGICAL BACKGROUND AND MOTIVATION
7
· Identification of single strand conformation polymorphisms (SSCPs): In this technique,
DNA fragments of a locus containing the presumed SNP are amplified (e.g., multiplied
into many identical fragments) using PCR amplification.
15
These fragments are then
put on a polyacrylamide gel to which a current of diluting liquid is applied. Due to
different folding of DNA fragments with different sequences, the speed of fragments
will differ if they contain SNPs. The presence of SNPs may afterwards be confirmed
by sequencing the respective patterns. This method is widely deprecated because of
its low throughput and sometimes poor detection rate of about 70%.
· Heteroduplex Analysis: During PCR amplification of an individual that is heterozy-
gous for a SNP, a heteroduplex
16
may be formed between two strands that are com-
plementary to each other with exception of the SNP site. These heteroduplexes can
then be detected either as a gel band (analogously to SSCP detection) or using high-
performance liquid chromatography (HPLC). This SNP detection method combines
reasonable throughput rates of 10 minutes per sample with a high detection rate be-
tween 95% and 100%.
· Direct DNA sequencing: This is the currently favored high-throughput method for
detecting SNPs. According to [Carg99], almost a million base pairs can be analyzed
in 48 hours with detection rates for heterozygotes ranging between 95% (using cheap
"dye-terminator sequencing") and 100% (using a more expensive and laborious method
known as "dye-primer sequencing"). Dye-terminator sequencing has been used by the
SNP Consortium [Hold02] which published over 1.4 million SNPs in human DNA
[SNP01]. Comparing equal loci in different versions of high-quality DNA sequences
has recently led to an increase of in silico detection of SNPs.
· Variant detector arrays (VDAs): In this technique, glass chips with arrays of oligonu-
cleotides are used to bind specific sequences derived in PCR amplification. VDA has
a quality comparable to dye-terminator sequencing and is especially useful in rapidly
scanning large amounts of DNA sequence.
17
It should also be stressed that for SNP detection, an appropriate set of alleles from which
SNPs are to be inferred needs to be chosen, as the different alleles occur with quite different
frequencies in different populations (such as human ethnic groups).
In Chapter 7, we will be concerned with the algorithmic tractability of two problems that
arise during the identification of SNPs: First, the sequencing of chromosomes in order to
obtain haplotypes has to deal with some errors in the reading and assembling process of the
DNA sequences (this will be discussed in more detail in Chapter 7). Second, haplotypes
are--due to prohibitively high cost and labor--seldomly identified by sequencing single
chromosomes. Rather, genotype information (both copies of a chromosome) is obtained,
from which haplotypes can be inferred under certain assumptions. We will see in Chapter 7
that both problems are closely related (in a certain way even equivalent) to a problem called
"graph bipartization" for which we will develop efficient algorithms in Chapter 6.
15
The polymerase chain reaction (PCR, for short) can quickly and accurately make numerous identical
copies of a specific DNA fragment. A PCR machine is capable of producing billions of copies of a DNA
fragment in just a few hours. PCR is a widely used technique in diagnosing genetic diseases, detecting low
levels of viral infection and for genetic fingerprinting in forensic medicine.
16
A heteroduplex may either be a piece of DNA in which the two strands are different, or it is the product
of annealing a piece of mRNA and the corresponding DNA strand.
17
SNP identification through arrays is a rapidly growing market responsible for the recent development
of biotechnology companies such as Affymetrix, Applied Biosystems, Marligen Biosciences, and Orchid
Biosciences, among many others.
CHAPTER 2. BIOLOGICAL BACKGROUND AND MOTIVATION
8
2.3
Importance and Prospects of SNP Mapping
SNPs are mainly useful for two areas of research: the study of population history (e.g.,
see [BBNE03]) and--an area of great economical significance--pharmacogenetics (e.g., see
[Rose00]). In this section, we will introduce both applications. The algorithmic tractability
of some problems in the study of population history using SNPs is dealt with in Chapter 5.
2.3.1
SNPs in the Study of Population History
SNPs often are a basis for various studies of population history (e.g., see [Tish96], [Tish00],
and [Mu02]). Historically, such studies employed gene trees of non-recombining loci inherited
from one parent such as mitochondrial DNA or the Y chromosome [Avis94]. A disadvantage
of this approach is that such loci are subject to a lot of stochastic parameters, which in
turn caused the requirement of vast amounts of loci to be analyzed to gain confidence over
the results. The preference for SNPs as genetic markers arose from this problem, as SNPs
provide a broad range of unlinked nuclear genetic markers and are thus able to capture "a
genome-wide picture of population history" [Niel00]. Furthermore, SNPs are advantageous
over previous methods such as using microsatellites
18
for population history studies because
they show a very favorable mutation pattern and greatly simplify the task of unbiased
sampling of genetic variation [BBNE03]:
· Mutation pattern: Microsatellites have a mutation rate of 10
-4
per generation as
opposed to the rate of 10
-8
displayed by SNPs. This makes multiple mutations for a
single SNP unlikely, therefore only two alleles exist of most SNPs. Such a property
greatly facilitates populational analysis--e.g., we will make use of it in the algorithmic
inference of haplotypes from genotypes in Chapter 7. Furthermore, mutations in SNPs
are more evenly distributed than in microsatellites, where the mutation rate is often
hard to estimate [BBNE03].
· Unbiased Sampling: Due to their uniform mutation rates, SNPs may be selected at
random in populational studies, avoiding previous bias that arose due to the fact that
often, only loci with well known mutation rates would be chosen for analysis. Fur-
thermore, [BBNE03] suggests that cross-species analysis of SNPs can provide greater
insight into the natural occurring rate of genome-wide variation than biased loci such
as microsatellites.
SNPs have been of great interest in populational studies due to the phenomenon that often
there is a collection of SNP sites where the individual SNPs are not independent of each
other; rather, a phenomenon called linkage disequilibrium is observed.
19
This phenomenon
refers to the fact that often, haplotype combinations of alleles at different loci are correlated
in their appearance frequency, forming a so-called linkage disequilibrium block (LD block)
[DRSHL01]. The size of LD blocks is often debated and ranges from size suggestions of a few
kilobases (in empirical research [Dunn00] and computer simulation experiments [Krug99])
to more than a hundred thousand kilobases [Abec01]. Independent of the discussed size, the
presence of linkage disequilibrium is believed to reflect the fact that haplotypes descended
18
Microsatellites are stretches of DNA consisting of tandem repeats of a small, simple sequence of nu-
cleotides, e.g., GCTGCTGCT. . . GCT. Often in literature, microsatellites are also referred to simple tandem
repeats (STRs).
19
For the human genome, linkage disequilibrium was studied, e.g., in [Reic01].
CHAPTER 2. BIOLOGICAL BACKGROUND AND MOTIVATION
9
"populational bottleneck"
population narrows
e.g., by selection process
States of first, fifth and seventh SNP site correlate due to
the respective common ancestral haplotype
SNP maps for the individuals of a large initial population:
states of different SNP sites show no correlation
population redevelops
from bottleneck haplotypes
Figure 2.3: Development of linkage disequilibria in SNP sites: An initial, genetically diverse
population is drastically reduced in its number of individuals (e.g., by selection processes
in a unique environment). This causes a "populational bottleneck" where only a very few
different haplotypes remain within a population. Redevelopment of a population based on
this non-diverse genetic material causes linkage disequilibrium in the alleles, which decays
over time due to mutations and recombination.
from common ancestral chromosomes; linkage disequilibrium may therefore also be an in-
dicator for populational bottlenecks
20
[Mu02]. The phenomenon of linkage disequilibrium
relating to SNPs is illustrated in Figure 2.3.
Linkage disequilibrium of individuals' genes within a population "decays" with population
history due to recombination [HaCl97]. It is believed that linkage disequilibrium around
common alleles is a lot less frequent than around rare alleles, which are generally younger
and thus less decayed by recombination [Watt77]. Using these assumptions, Reich et al.
[Reic01] have shown that they can relate some linkage disequilibria to events such as the
last glacial maximum 30 000-15 000 years ago, migration patterns in ancient Europe, or
the dispersal of anatomically modern humans in Africa. The article mentions that for some
20
A populational bottleneck is a period in population history where there are very few individuals in
the population. These individuals then gave rise to the haplotypes found in a population today--conserved
genetic patterns in the haplotypes can therefore be backwardly related to the respective ancestral individuals
that lived during the bottleneck.
CHAPTER 2. BIOLOGICAL BACKGROUND AND MOTIVATION
10
populations not as large as Europeans ([Reic01] mentions Yorubans as an example), the
resolution of their linkage disequilibrium blocks is too coarse, but nevertheless--referring to
studies such as [Tish96], [Tish00], or [Mate01]--" [. . . ] simultaneous assessment [of linkage
disequilibria] at multiple regions of the genome provides an approach for studying history
with potentially greater sensitivity to certain aspects of history than traditional methods
[. . . ]" [Reic01]
Additionally to being able to gain deep biological insights into species development and
population history, the study of linkage disequilibria and the search for common ancestors
of species might also have high economical and health political impact. One example for this
is the recent study of the malaria parasite Plasmodium Falciparum in [Mu02], this parasite
has been of intense interest since it infects hundreds of millions of people each year, being
responsible for almost 3 million annual deaths [Brem01].
21
An effective vaccine against
malaria, for example, must trigger an immune response that is equivalent or superior to
the one gained by contact with natural antigens. By finding some common SNP regions in
different Plasmodium Falciparum populations, it is the hope of current research to build an
accurate map of the ancestral relationship of various Plasmodium Falciparum strains. Such a
map of ancestral relationships could help in identifying common antigens for immunizations
[Gard98].
It was conjectured in [RLHA98] that the human malaria parasite experienced a populational
bottleneck about 5000 years ago, further sparking the hope that it would be possible to find
some common drug targets among malaria parasites. Although an extensive study of SNPs
on the Plasmodium Falciparum genome carried out by Mu et al. [Mu02] have shown that
this is probably not the case and Plasmodium Falciparum is rather a "quite ancient and
diverse" population (with the most recent common ancestor being a few hundred thousand
years old), it is still hoped that some more recent common ancestors of different strains can
be found in order to obtain an assay of promising drug targets and vaccines:
"For the first time, a wealth of information is available [. . . ] that comprise the
life cycle of the malaria parasite, providing abundant opportunities for the study
of [the . . . ] complex interactions that result in disease." [Gard98]
Although the genetic sequence and the insights gained by SNPs alone are no cure for malaria
and other widespread diseases, they seem to be a promising start.
The study of populational history based on traits of individuals (which--among others--may
be the presence of highly correlated SNP sites) and its algorithmic tractability is studied in
Chapter 5 of this work, where a special model of analysis called perfect phylogeny will be
employed. As will be seen in Chapter 5, SNPs provide very good data for this model due to
their very low mutation rate.
2.3.2
SNPs and Pharmacogenetics
The understanding of SNPs is believed to be a key to the research area known as pharmaco-
genetics. Using SNPs in pharmacogenetics is of immense economical interest to pharmaceu-
tical companies. It has led to the founding and funding (hundreds of millions of US dollars)
of the SNP Consortium [Hold02], a joint effort of major pharmaceutical companies such as
Bayer, Bristol-Myers Squibb, Glaxo Wellcome, Aventis, Novartis, Pfizer, Roche, SmithKline
21
2001-2010 has been named the "Malaria Rollback Decade" by the WHO [WHO03] to emphasize efforts
being made in limiting the widespread of malaria.
CHAPTER 2. BIOLOGICAL BACKGROUND AND MOTIVATION
11
Beecham, and Zeneca. Interdisciplinary connections of the SNP Consortium include IBM
and Motorola.
The problem with a lot of drug therapies is the possibility of adverse drug reactions by
patients: Research by Lazarou, Pomeranz, and Corey [LPC98] suggests that, in 1994, such
reactions were responsible for millions of hospitalizations and almost a hundred thousand
deaths. This value is not likely to have improved lately and is hindering the introduction of
new medications that are effective in most patients but pose unbearable risks: For example,
the quite effective anticonvulsant drug Lamictal
c
by Glaxo Wellcome is only reluctantly
prescribed because of a potentially fatal skin rash that arises as a side effect in five percent
of all patients taking the drug [Maso99]. The problem of the different effects drugs exert
on patients has long been known and studied, already over a hundred years ago Sir William
Osler
22
reflected:
"If it were not for the great variability among individuals, medicine might as well
be a science and not an art." (as cited by [Rose00])
Pharmacogenetics is an area of research that studies how genetic variation influences a
patients responsiveness and responses to drugs (a good introduction to pharmacogenetics
is, e.g., [Rose00]), thereby trying to give physicians the possibility of using objective data
about a patient's likeliness to react to prescribed drugs in a predictive way. The basic idea in
pharmacogenetics is to build a profile of an individual's genetic variations in order to predict
effectiveness and side-effects of drugs. As was discussed above--since genetic variation is
mainly due to SNPs--a hope of pharmacogenetics relies on building an accurate map of
SNP haplotypes.
Roughly speaking, the hope is to identify linkage disequilibrium loci around certain genes
that are susceptible for causing a certain adverse reaction to drugs. The same technique has
already been applied in the study of individuals' susceptibility to certain complex genetic
diseases such as Alzheimer's disease: In an analysis of polymorphisms on the ApoE gene
locus on chromosome 19, Martin et al. [Mart00] reported the detection of those SNPs
in linkage equilibrium that are associated with Alzheimer's disease. A number of other
studies successfully related the susceptibility for complex genetic diseases such as migraine
with aura, psoriasis, and insulin-independent diabetes mellitus to certain SNPs in linkage
disequilibrium [Rose00]. Now, just as linkage disequilibria can be related to the susceptibility
for diseases, they can also be related to certain drug reactions. Two good examples for this
are, e.g., patient's reactions to nortriptyline and beta-2-agonists:
· Nortriptyline is a medication against depression which is converted to an inactive
compound within the body by drug metabolizing enzymes called the cytochrome P450
enzymes. Specifically, an enzyme labeled CYP2D6 is a key in inactivating nortryptiline
and removing the inactivated substance from the body--except in some people that
have variations in their CYP2D6 encoding gene. These variations may lead to two
undesired effects [DeVa94]: People referred to as "ultra metabolizers" have a variation
that causes the synthesis of too much CYP2D6 in their body, thus inactivating so
much nortriptyline that these people are likely to receive insufficient antidepressant
effects from nortriptyline. A more dangerous variation is found in people referred to
as "poor metabolizers" who do not synthesize sufficient amounts of CYP2D6--these
22
Osler, Sir William, Baronet (*1849, 1919). Canadian physician and professor of medicine who played a
key role in the transformation of the curriculum of medical education in the late 19th and early 20th century.
CHAPTER 2. BIOLOGICAL BACKGROUND AND MOTIVATION
12
SNP genotype profile of
patients in clinical trial
medicine effective?
ineffective
effective
effective vs. ineffective
SNP genotype profile of
predictors for efficacy
SNP profile for prediction
"effective"
SNP profile for prediction
"ineffective"
Figure 2.4: Profiling SNPs in pharmacogenetics: If there is a section of the SNP genotype
profile that proves to be different in patients where a drug is effective as opposed to patients
where a drug shows no efficacy or undesired side-effects, this region can be used to predict the
effectiveness and potential risks due to side effects in a patient before a drug is prescribed.
are likely to experience toxic side effects due to accumulation of nortriptyline in their
body. Genetic testing for variation in the gene for the CYP2D6 enzyme could avoid
both scenarios.
· Beta-2-antagonists such as albuterol are important to the treatment of asthma. In-
teracting with the beta-2-adrenergic receptors in the lung, they cause the freeing of
airways by inducing muscle relaxation in the lung muscles. A SNP in the gene encoding
the beta-2-adrenergic receptor causes the carriers of one SNP variant to express fewer
of these receptors, therefore receiving little relief of asthma symptoms upon a standard
dose of albuterol [Ligg97]. Testing for presence of the specified SNP in patients can
allow to clearly identify those 45% in the North American population
23
who can only
poorly control their asthma by beta-2-antagonists.
It is clear that the prospects of being able to predict the efficacy of a drug whilst minimizing
the risk of side-effects is of great interest to the pharmaceutical industry, which could then--
as Roses proposes in [Rose00]--create efficacy profiles for patients (see Figure 2.4) already in
phase II
24
clinical trials of medication. Abbreviated SNP profiles
25
could be used to record
23
This figure should be similar for Europeans.
24
Clinical trials are divided into five steps: Preclinical Research includes controlled experiments using a
new substance in animals and test tubes and may take several years. Phase I trials are first tests of the
investigated drug on humans. Doses are gradually increased to ensure safety. Phase II trials will gather
information about the actual efficacy of a drug. Phase III trials studies a drugs effects with respect to
gender, age, race, etc. A successful phase III trial leads to the admission of a drug to the public market.
Occasionally, phase IV trials are conducted that are--in principle--phase III trials on an even broader
variety of patients.
CHAPTER 2. BIOLOGICAL BACKGROUND AND MOTIVATION
13
adverse reactions in patients and thus be able to predetect even the most rare adverse events.
Furthermore, the development of new, more effective drugs can be facilitated: The parallel
developments of drugs targeting specific symptoms is facilitated because patients who do not
respond to a certain medication can be profiled in early clinical trial stages. Additionally,
the development of medications which are highly effective in only a comparably small part
of a population (e.g., a medication with 30% response rate) become profitable as they may
be specifically prescribed to patients to whom the respective substance will be effective.
Pharmacogenetics relying on SNP linkage analysis seems to be a promising start to replacing
trial-and-error prescriptions with specifically targeted medical therapies.
25
Abbreviated SNP profiles contain only the SNP information of a patient that is relevant for a drug
efficacy prediction. The introduction of abbreviated profiles plays an important role in the discussion about
the fear of "individual DNA profiling" because they cannot be backwardly related to a patient.
Chapter 3
Computer Science Preliminaries
and Notation
The first section of this chapter introduces the notation used throughout this work, followed
by a brief introduction to those ideas in computational complexity that are important to this
work. Especially, the last section focuses on fixed-parameter tractability, laying a foundation
for the computational complexity analysis in the following chapters.
3.1
Notation for Matrices and Graphs
Matrices. By an n × m matrix A we are referring to a rectangular arrangement of n · m
elements into n rows and m columns. By a
ij
we designate the element in A that may be
found at the jth position of the ith row. We will use the terms A and (a
ij
) synonymously.
Graphs. A graph consists of vertices and edges, where a vertex is an object with a name
and other properties (such as a color) and an edge is the connection of two vertices. We
will denote a graph G with n vertices and m edges by (V, E) where V = {v
1
, . . . , v
n
}
and E = {e
1
, . . . , e
m
} V × V . In this work, only simple, undirected graphs are considered,
meaning
· a vertex cannot be connected to itself by an edge,
· no two vertices may be connected by more than one edge, and
· an edge leading from a vertex u to a vertex w also leads from w to u.
By the term subgraph G = (V , E ) we are referring to a graph with V V and E =
E
(V × V ). Given a set V , G = (V , E ) with E = E (V × V ) is called the subgraph
induced by V in G.
A vertex v is said to have degree d--denoted by degree(v) = d--if there are exactly d edges
in G that are adjacent to v.
A path p of length in a graph G = (V, E) is a sequence of + 1 distinct vertices v
1
v
2
. . . v
+1
in G such that for each 1 i , v
i
and v
i
+1
are connected by an edge. A cycle of
length
in G is a sequence of
vertices v
1
v
2
. . . v v
1
in G such that v
1
. . . v
is a path in G
14
CHAPTER 3. COMPUTER SCIENCE PRELIMINARIES AND NOTATION
15
and {v , v
1
} E. We call G = (V, E) a tree if it contains no cycles; a tree containing a
specially designated node
1
--called the root of the tree--is called rooted. Nodes of degree 1 in
a rooted tree are called leafs. A graph G = (V, E) is called connected if any two vertices u, v
V
are connected by a path in G. A subgraph of G that is maximally connected with respect
to its number of vertices is called a connected component.
A graph G = (V, E) is called bipartite if we can divide the set V of vertices into two
disjoint subsets V
1
and V
2
such that E contains neither edges between vertices in V
1
nor
edges between vertices in V
2
. The graph G is called planar if it can be embedded into an
(Euclidian) plane without any intersecting edges.
In this work--especially in Chapter 6--we will be using the following set of operations on
graphs:
· Subgraph removal. Let V V be a subgraph of G = (V, E). By G \ V we will
denote the subgraph that is induced in G by V \ V .
· Vertex deletion. Let u be a vertex in a graph G = (V, E). By G \ {u}, we denote
the graph that is induced in G by V \ {u}.
· Edge deletion. Let e be an edge in a graph G = (V, E). By G \ {e}, we denote the
graph G = (V, E ) with E = E \ {e}.
A vertex separator in a graph G = (V, E) is a set V V of vertices in G such that G \ V is
not connected. If |V| = k, we call V a vertex separator of order k. The definition of an edge
separator E E of order k is analogous.
3.2
Crash Course in Computational Complexity Theory
Generally speaking, "an algorithm is a procedure to accomplish a specific task. It is the
idea behind any computer program" [Skie98]. The first goal for any algorithm is to be
effective, i.e. providing correct solutions to a given problem, however, an effective algorithm
is of little use if it is not efficient, i.e., requires more resources (especially time) to solve a
problem than can be provided. Computational complexity theory deals with the amount
of resources--the two most important of which are time and memory
2
--required to solve
a certain computational problem by an algorithm. A brief introduction to analyzing the
time complexity of algorithms is given in [Skie98], a very thorough treatment of complexity
theory may be found, e.g., in [Papa94]. This section will introduce some basic terminology
from computational complexity theory that will be used throughout this work.
3.2.1
Machine-Independent Analysis
Imagine that we are given an algorithm called A (in any programming language) that solves
a certain problem P when given an input I, called an instance of P. We would now like to
analyze the performance--especially concerning speed--of this algorithm. The most obvious
way of this would be to run A on a lot of instances of P and measure the time it takes for A
1
In order to distinguish tree from graphs more easily throughout this work, we will use the term "vertex"
for general graphs and the synonymous "node" for vertices in trees.
2
Since only the notion of time complexity is important for this work, we shall omit space complexity in
the following introduction.
CHAPTER 3. COMPUTER SCIENCE PRELIMINARIES AND NOTATION
16
to complete its task each time. However, with this approach, we quickly run into a multitude
of problems, the most crucial of which are that the absolute time measured is influenced
by the actual machines computer architecture
3
, absolute time values are only useful for one
particular type of machine, we can seldomly test the algorithm on all conceivable instances,
and a purely practical analysis provides no indication about an algorithm's maximal (worst-
case) running time.
Complexity theory tries to avoid these problems arising from a direct machine analysis by
analyzing computational problems in a more formal and mathematical way. This analysis
is machine-independent whilst still trying to incorporate the fundamental workings of a
modern computer. Traditionally, complexity theory relies on the Turing Machine as its
model of computation which is, however, a quite abstract model of computation lacking
any close relationship to modern computers.
4
For this work, we do not require the many
special features offered by Turing Machines and shall therefore rely on another model of
computation that is sufficiently precise for our analysis, far more intuitive and more closely
related to a "real" computer than a Turing Machine.
5
This model is the RAM model of
computation, which Skiena [Skie98] describes quite vividly as a computer
"where each `simple' operation (+,-,*,=,IF,call) takes exactly 1 time step [and]
loops and subroutines are not considered simple operations. Instead, they are
the composition of many single-step operations. . . . Each memory access takes
exactly one time step, and we have as much memory as we need."
Using this model, the computational time of an algorithm is given by simply counting the
number of time steps it takes the RAM machine to execute it on a given problem instance.
The advantage of the RAM model lies in the fact that it captures the essential behavior of
a modern computer without introducing any fickle parameters such as memory bandwidth,
actual processor speed, and memory access time, just to name a few. The next subsection
demonstrates the usage of this model in the analysis of an algorithm's time complexity.
This, however, requires a last step of formalization: Besides the machine model, the term
"problem" needs to be specified.
6
Computational problems may be stated in many ways, the most important of which--at least
in the context of this work--are decision problems (the output can be just either "Yes" or
"No") and optimization problems (the output is a solution which is minimal/maximal in
some respect). Most of computational complexity theory solely deals with decision problems
because almost any "reasonable" way of stating a problem can be transformed into a decision
problem.
7
Although a whole branch of theoretical informatics--computability--has evolved
concerning the existence of decision problems that are undecidable (i.e., not algorithmically
3
A modern computer's performance is, e.g., influenced by its processor, memory bandwidth, techniques
such as pipelining and caching, the operating system, the programming language and its compiler, etc. Due
to these many factors it is sometimes even difficult to obtain consistent results on a single, defined machine.
4
There are many reasons why Turing Machines are nevertheless used in computational complexity theory:
For example, requirements such as memory and time are very easy to define for a Turing Machine and can
be analyzed with great accuracy. Furthermore, Turing Machines can simulate other machine models--one
Turing Machine can, in theory, even simulate an infinite number of Turing Machines. The simulation of
an algorithm on the RAM model (which will be introduced shortly) by a Turing Machine requires only
polynomially more time than its execution on directly on the RAM (the term "polynomially more time" will
also be defined more precisely later on in this chapter).
5
A quite thorough analysis of different machine models can be found in Chapter 2 of [Papa94].
6
An algorithm is by definition already given in a formal fashion.
7
E.g., instead of asking "What are all prime numbers between 0 and 280581?" we can solve the decision
problems "Is 1 a prime?" ("No"), "Is 2 a prime?" ("Yes"), . . . , "Is 280580 a prime?" ("No") separately.
CHAPTER 3. COMPUTER SCIENCE PRELIMINARIES AND NOTATION
17
solvable) by computers, we can assume for this work that we are always given decidable
decision problems.
Analyzing decision problems is closely related to the fact that in complexity theory, a pro-
blem is generally formulated as a language L and asking (i.e., deciding by an algorithm)
whether a given instance I is part of that language. Both the language L and the instance I
are a subset of
for an alphabet , where is a finite set of symbols and
is the set of
all words that may be generated by concatenation of symbols from , including the empty
word
which contains no symbols at all.
8
Expressing a given problem as a language is--in
most practical cases--quite straightforward.
9
For this work, the computational complexity
for solving a problem can be seen as equivalent to the complexity of answering the cor-
responding decidability question. It should be noted that stating a problem in form of a
language presents this particular problem in a very abstract form. Neither an algorithm for
solving the problem is given nor any obvious hint about the time complexity of solving this
problem. In order to deal with this, we will introduce the model of complexity classes and
reductions later on in this chapter.
For the sake of simplifying the discussion in this work, we will refrain from stating problems
in the form of a language. Instead, we will simply assume that the given problems may be
stated in the form of a language. Furthermore, instead of asking whether an instance I is
in L, we shall directly deal the object x that I represents (such as a word, number, graph,
etc.)
10
. We then call x an instance of the problem P = "I L?".
3.2.2
Running Time--Keeping Score
We discussed at the beginning of the last subsection that, given an algorithm for a problem P,
knowing how this algorithm performs on certain instances x of P is of little use. Rather,
in order to understand the quality of an algorithm, it is vital to know how it performs on
any conceivable instance x of P and express this performance in an intuitive way. This
is done by introducing three new ideas: Analyzing how the running time of algorithms
scales with the problem size, distinguishing between worst, best and average-case complexity
(emphasizing on worst-case complexity), and analyzing the scaling of the algorithm in its
asymptotic behavior.
The first idea is based on the intuitive observation that an algorithm should generally take
longer time to run as the presented instance becomes larger. For instance, a graph problem
on a general
11
graph consisting of ten vertices should be easier to solve than the same
problem for a general graph with a thousand vertices. Therefore, if x is an instance of a
problem P, the running time t of an algorithm is expressed as a mathematical function f
8
For example, if = {0, 1},
= { , 0, 1, 00, 01, 10, 11, 000, 001, 010, 011, . . . }.
9
E.g., the problem of deciding whether a given number is a prime number would have to be expressed
as L
prime
= {p {0, 1}
| p is the binary representation of a prime number} (with = {0, 1}) and then
asking "given I
, is I L
prime
?".
10
We shall assume for this work that such a representation, i.e, the encoding of x as I is always a valid
one, meaning we do not need to worry about any cases where I does not encode a valid object x.
11
By "general" we mean that the given graph has no special properties that greatly simplify the solution
of the respective problem.
CHAPTER 3. COMPUTER SCIENCE PRELIMINARIES AND NOTATION
18
x
x
f
(x)
c
· g(x)
x
x
f
(x)
c
· g(x)
x
x
f
(x)
c
2
· g(x)
c
1
· g(x)
f
(x) = O(g(x))
f
(x) = (g(x))
f
(x) = (g(x))
Figure 3.1: A function f (x) and its bounds in O-notation: From left to right, g(x) is an
upper, lower and tight bound on f (x). Note how the respective bounding property of g only
needs to be true for all x > x .
of n := |x|, the size of x:
12
t
algorithm
(x) = f (|x|) = f (n)
Given a fixed size n for the input instance x and an algorithm A that runs with x as an
input, we distinguish between the best-case, average-case, and worst-case running time:
t
best
(A, n) =
min
x
: |x|=n
t
A
(x), t
avg
(A, n) =
x
: |x|=n
t
A
(x)
|{x | |x| = n}|
,
and t
worst
(A, n) =
max
x
: |x|=n
t
A
(x).
Most of the time, only the worst-case complexity of an algorithm is interesting since average-
case and--especially--best-case complexity provide no information whatsoever about the
running time that A might have when presented with any instance x. E.g., for average-
time complexity the problem here lies in the definition of "average": There may be some
problems which are rather easy to solve on many instances, but this is of no use if we
should--consciously or not--be dealing just with hard instances during the application of the
algorithm.
13
Albeit open to criticism about being too pessimistic, the worst-case complexity
of an algorithm seems to be the most useful measure for its performance.
The computational RAM model introduced in the last subsection provided a way to measure
the running time of a given algorithm A exact to a single time unit. This degree of accuracy
is not useful as for such exact counts the function f that measures the running time of A will
often get very complicated and unintuitive to analyze.
14
Moreover, as we are interested in
the performance of an algorithm on a real-world machine and not on the hypothetical RAM,
measuring the running time of A down to the last time unit finds no application. Therefore,
12
Later on, we will analyze algorithms in more detail using various parameters of the input. For example,
the running time of a graph algorithm may depend on the number of edges as well as on the number of
vertices in the graph. The number of edges in the graph is lower than |V |
2
, but explicitly using the number
of edges provides a better analysis. In order to simplify the discussion, however, we will for now assume that
there is just a single input size parameter given.
13
Moreover, the corresponding mathematical analysis of average-case complexity even for simple algo-
rithms and a clear definition of "average case" is often highly involved and complicated.
14
Moreover, there are often trivial steps depending on the notation of the algorithm (such as initialization
of variables) that require just a little constant amount of time and are thus not interesting for a general
performance-analysis.
CHAPTER 3. COMPUTER SCIENCE PRELIMINARIES AND NOTATION
19
computational complexity theory, instead of directly analyzing f , rather analyzes the upper
and lower bounds of f using the O-Notation (pronounced "Big Oh Notation"):
Definition 3.1 (O-Notation):
Given two functions f :
R
>
0
R
>
0
and g :
R
>
0
R
>
0
. We will say that
· f = O(g) if there exist a constant c
R and an x R such that for all x > x , f(x)
c
· g(x) (g is an upper bound for f ).
· f = (g) if there exist a constant c
R and an x R such that for all x > x , f(x)
c
· g(x) (g is a lower bound for f ).
· f = (g) if there exist two constants c
1
, c
2
R with c
1
c
2
and an x
R such that
for all x > x , c
1
· g(x) f (x) c
2
· g(x) (g is a tight bound for f ).
This notation is illustrated by Figure 3.1.
15
Analogously to preferring the worst-case complexity over the best- and average-case com-
plexities when analyzing the performance of an algorithm, it is common practice to provide
an upper bound for an algorithm's running time instead of a lower or tight one.
16
To provide an example on how to determine the running time of an algorithm and express it
in O-Notation, we will now analyze an algorithm for a well-known problem in computational
complexity, called Vertex Cover.
Definition 3.2 (Vertex Cover)
Input: A graph G = (V, E) and a parameter k.
Question: Is it possible to choose a set V V with |V | k such that every edge in E has
at least one endpoint in V ?
In Figure 3.2, a graph and one of its vertex covers is given in order to illustrate this definition.
A very trivial algorithm A
VCtrivial
for this would be to simply try all possible solutions of
size k and see whether one of these hypothetical solutions is indeed a vertex cover for the
given graph:
Algorithm: Trivial algorithm A
VCtrivial
for Vertex Cover
Input: A graph G = (V, E) with V = {v
1
, . . . , v
n
}
Output: "Yes" if G has a vertex cover of size k, "No" otherwise
01
for every k-sized subset V of V do
02
if V is a vertex cover for G
03
return "Yes"
04
return "No"
15
For a concrete example, consider the function f (x) = x
4
+ x
2
+ x - ln x + 1234. If x 6, we have
f
(x) = x
4
+ x
2
+ x - ln x + 1234 < x
4
+ x
2
+ x - ln x + 6
4
<
5 · x
4
=: 5 · g(x),
f
(x) = x
4
+ x
2
+ x - ln x + 1234 > x
4
>
1 · x
3
=: 1 · g(x), and
1 · g(x) := 1 · x
4
< f
(x) = x
4
+ x
2
- ln x + 1234 < 5 · x
4
= 5 · g(x)
and therefore f (x) = O(x
4
), f (x) = (x
3
), and f (x) = (x
4
).
16
This is mainly due to the fact that tight bounds on the running time of algorithms are often neither
intuitive nor easy to grasp mathematically.
CHAPTER 3. COMPUTER SCIENCE PRELIMINARIES AND NOTATION
20
A Vertex Cover for G
A graph G = (V, E)
(black vertices)
Figure 3.2: A graph G = (V, E) and a vertex cover of size 17 (black vertices) for G. Note
how for each edge in G, at least one of its endpoints is in the given vertex cover. The shown
vertex cover for G is optimal in the sense that there is no vertex cover for G with fewer than
17 vertices (this was verified using a computer program).
We will now analyze the running time of A
VCtrivial
in terms of the number of vertices (|V |)
and the number of edges (|E|) in G.
17
Let us start with lines
03
and
04
: Since both
terminate A
VCtrivial
, they are executed at most once, and thus do not play a role in the
asymptotic running time of A
VCtrivial
. Line
02
can be executed by calling the following
subroutine: Iterate over all edges of G, and check for every edge whether at least one of its
endpoints is in V . If we have been clever enough and marked those vertices that are in V
during the execution of line
01
, executing this line only requires O(|E|) running time. For the
seemingly most difficult line to analyze, line
01
, we make use of the machine-independency
of our analysis by using an algorithm for generating subsets from the extensive available
literature on algorithms (e.g., [CLRS01], [Knut97]).
18
In [Knut03], we can find an algorithm
that generates all k-sized subsets V of V in
O
(
|V |
k
) = O
|V |!
k
!(|V | - k)!
= O
1
k
!
|V | · (|V | - 1) · · · (|V | - k + 1)
k factors
=O(|V|
k
)
time on a RAM-like machine. For finding the total running time of A
VCtrivial
, it is now
sufficient to observe that line
01
causes line
02
to be executed once for each of the at most
|V |
k
subsets generated. Taking into account the time requirements of line
02
, the total running
17
A quick glance at the algorithm demonstrates the advantage of all the conventions we have introduced
above. E.g., if we were not to use the O-Notation for the worst-case bound we are about to determine, we
would explicitly have to look at the exact number of steps a RAM needs to generate a subset in line 01, to
store G, to determine whether V is a vertex cover of G, and so on.
18
Note that the RAM models in literature do not necessarily need to be defined precisely the way we
have. E.g., in the RAM model used in [Knut97], some computing steps take more than one unit of time.
However, this is not important for the performance of an algorithm in O-notation: Assume, for example,
that each simple computational operation would consume four time units instead of one on a machine RAM
as opposed to our RAM model. If there is an algorithm that, e.g., requires O(n
3
) time on the RAM, it
would require O((4n)
3
) = O(64n
3
) = O(n
3
) time on the RAM .
CHAPTER 3. COMPUTER SCIENCE PRELIMINARIES AND NOTATION
21
time for A
VCtrivial
is therefore bounded by
O
(|V |
k
· |E|).
This upper bound is quite unsatisfactory for practical applications, for it implies an enormous
worst-case running time even for small graphs and small k.
19
Note that from the discussion
so far, it is not clear wether this is due to a poorly designed algorithm we have come up with
or it is a result of some "inherent complexity" of Vertex Cover. The next subsection and
the following section will demonstrate that actually both is true, that is, Vertex Cover
is believed to be "hard to solve" (we will define this more precisely in the next subsection)
but there are ways of "taming" this inherent complexity, as will be shown in Section 3.3.
3.2.3
Complexity Classes
In the previous subsection, we have given an algorithm to solve Vertex Cover that was
quite impractical for large input graphs. However, it was not clear whether this problem is
hard to solve in general or if we just haven't come up with a good algorithm. We would
now like to know wether there is a better algorithm for Vertex Cover than the one
presented, or--even better--know the fastest possible algorithm for Vertex Cover (i.e.,
the minimum time complexity of Vertex Cover). The first request is comparably easy to
come by, we just have to look for an algorithm with a better worst-case running time than
the one presented. The latter however is a lot harder to deal with, because in finding a lower
bound for the time complexity of Vertex Cover it is necessary to consider every thinkable
algorithm for Vertex Cover--even algorithms that have not yet been found.
20
However,
there is another way to approach the problem of complexity bounds using reductions and
complexity classes.
Reductions will allow us to divide problems into different "classes of difficulty". The idea
behind this is the following: Although not knowing how hard an individual problem might
be, we can relate problems to each other so that we know they are both "equally hard" to
solve, meaning if there is a fast algorithm for one problem, there must be one for the other
problem, too. A collection of such related problems is called a complexity class (a more
formal definition will follow shortly). Problems are grouped together in complexity classes
by finding a computationally "cheap"
21
transformation between instances of one problem
and the other. Then, loosely speaking, if we know that if one of the two problems turns out
to be easy to solve, we also know that the second problem is easy to solve, because we can
apply the algorithm for the easy problem to transformed instances of the other one. In a
more formal fashion:
Definition 3.3 (Polynomial Time Reduction):
Given two languages L
1
1
and L
2
2
. We call L
1
1
polynomial-time reducible
to L
2
2
(designated L
1
poly
L
2
) if there is a function R from
1
to
2
that can be
computed in polynomial time on any x
1
and
x
L
1
R(x) L
2
.
19
For example, finding out if a graph with 75 vertices and 200 edges has a vertex cover of size 10 would
require c · 10
21
steps on our RAM where c is some constant 1 omitted on the O-notation.
20
Except for a few very rare cases of problems (such as sorting), the question of lower complexity bounds
therefore generally remains unanswered.
21
In our context, this will imply a polynomial running time with respect to the original instance's size.
CHAPTER 3. COMPUTER SCIENCE PRELIMINARIES AND NOTATION
22
In complexity theory, there are a lot of more specialized reductions, some of which we will
get to know in Section 3.3, that are more "delicate" in the sense that they impose stricter
requirements on R than just being computable in polynomial time. However, we shall work
just with polynomial time reductions for the rest of this section.
The concept of polynomial time reduction may be used to build a hierarchy of computational
problems. This hierarchy consists of classes C. In each class, we can find those problems that
are solvable using the resources allowed by the respective class. Furthermore, we introduce
the concept of completeness to identify those problems in a class that are computationally
as hard to solve as any other problem in that class. In this way, if a problem that is complete
for a class should prove to be "easy" to solve, we know the same to be true for all other
problems that are in C.
Definition 3.4 (Complexity Class Hardness and Completeness):
Let C be a complexity class. A language L is called C-hard if all languages in C can be
reduced in polynomial time to L. We call L C-complete, if L is C-hard and in C.
There is a vast number of complexity classes known today (see, for example, [Aaro03]), each
of them grouping together problems with various properties. Two of the first classes that
were developed and are of much interest for this work are P and NP.
Definition 3.5 (P and NP):
The complexity class P is the class of computational problems that can be solved in polyno-
mial time on a deterministic Turing Machine.
The complexity class NP is the class of computational problems that can be solved in poly-
nomial time on a nondeterministic Turing Machine.
Although our definition uses the term "polynomial" to describe all problems in NP, it
is widely believed that all NP-complete problems are only solvable in exponential time.
The reason for this is the computational model underlying the definition: A nondetermin-
istic Turing Machine is a very unrealistic model of computation, being able to--vaguely
speaking--correctly "guess" the solution to a problem and then only needing to verify its
correctness (the process of checking must then be done in polynomial time). However, all
computers known today are deterministic, and therefore they have to "emulate" the guessing
steps of the nondeterministic Turing Machine in order to find a solution to an NP-complete
problem. This emulation is done by simply checking all possibilities for a possible solution
which takes--in worst-case complexity--an exponential amount of time.
It must be stressed that no proof whatsoever has been given that problems in NP are at
best solvable in exponential time. All we have stated is a plausibility argument: There are
thousands of problems known to be NP-complete (even the rather outdated list of [GaJo79]
lists hundreds of NP-complete problems), finding a polynomial time algorithm for just one
NP-complete problem would show that all NP-complete problems are polynomially solvable,
but this has not happened in spite of over 25 years of research so far. There are therefore
two important things to be remembered throughout this work:
· When saying that a problem is harder than another one, we are always referring to
relative complexity bounds, i.e., we are saying that if a "harder" problem should turn
out to be efficiently solvable, so will the easier problem (but not vice-versa).
CHAPTER 3. COMPUTER SCIENCE PRELIMINARIES AND NOTATION
23
· Sometimes we will use an argument such as ". . . --since this would imply that P = NP"
in our proofs, which is based on the unlikeliness of P = NP. It would, however, be
more correct--albeit unusual--to write "Unless P = NP, the following holds true:
. . . ".
The Vertex Cover problem posed at the beginning of the previous subsection has been
proven to be NP-complete in [GaJo79], where it is shown that Vertex Cover is even NP-
complete for planar graphs where each vertex has a degree of at most 3
22
. A long time, an
NP-completeness proof for a problem was taken as a synonym for "unsolvable already for
moderate input sizes" (coining the term "intractable"). However, this is not true in general,
as the next section demonstrates.
3.3
Fixed-Parameter Tractability (FPT)
We have seen in the previous section how algorithms can be analyzed machine-independently
by observing how they scale with the size of their respective input. This size we named n. We
have also seen the class NP, reasoning that problems complete for this class most probably
have a worst-case running time that is exponential, i.e., an NP-complete problem can only
be solved in
(a
n
)
time for some a > 1. Since this usually implies unreasonably high running times for large n,
problems that are NP-hard are also referred to as being intractable. We have also stated
in the last section--citing from [GaJo79]--that the problem Vertex Cover (see Defini-
tion 3.2) is NP-complete. The NP-completeness of Vertex Cover implies that it is most
probably only solvable in O(a
n
) time where n is the size of the input instance and a is some
constant. However, this definition provides us with two loopholes:
· We have made no statement about the size of a. A small a could lead to algorithms
that are fast enough even for a fairly large n.
· We have made no good use of the fact that--besides the size of the input graph--any
instance of Vertex Cover contains a parameter k that might be restricted to a small
value. What if we could restrict the exponential complexity of Vertex Cover to the
parameter k which is given along with the input graph according to Definition 3.2 ?
It might seem at first that observing these two "loopholes" is just splitting hairs in an
imprecise definition, but in this section, we shall use exactly them to develop a more efficient
algorithm for Vertex Cover than the trivial O(|V |
k
· |E|) algorithm used as an example
for complexity analysis in the last section. After that, a short introduction to parameterized
complexity theory is given.
3.3.1
An Efficient Algorithm for Vertex Cover
At the end of this section, we will have improved the O(|V |
k
· |E|) algorithm for Vertex
Cover given in the previous section to an O(2
k
· |E|) algorithm. The strategy for this will
be quite straightforward.
22
Recall the definitions of "planar" and "degree" from Section 3.1.
CHAPTER 3. COMPUTER SCIENCE PRELIMINARIES AND NOTATION
24
Recall Definition 3.2. If a given graph G has a vertex cover of size k, then we can choose k
vertices in G such that every edge in G includes at least one vertex from the cover. This
means, if we were to search for a vertex cover V for G we can simply pick any edge e =
{v
a
, v
b
} in G, and then, knowing that either v
a
or v
b
must be in the vertex cover, consider
two distinct cases for V : Either V contains v
a
or V contains v
b
. For each of these cases,
we would then look at the uncovered edges, pick one, and again consider the two cases for
putting a vertex of that edge into V (the common term for this is to branch into those two
cases). This recursive algorithm leads to a tree-like structure searching for vertex covers of
size k for G--depicted in Figure 3.3--that is commonly referred to as a search tree. Note
that for each level down the search tree, we have one vertex less left to form a vertex cover
for G. If we cannot find a vertex cover for G in the kth level of the search tree, then, as we
have tried all possibilities of a vertex cover for G, G has no vertex cover of size k.
The described algorithm can be rewritten in a more formal fashion:
Algorithm: search tree algorithm A
tree
for Vertex Cover
Input: A graph G = (V, E) with V = {v
1
, . . . , v
n
} and a parameter k
Output: "Yes" if G has a vertex cover of size k, "No" otherwise
01
if G contains no edges then
02
return "Yes"
03
if G contains edges and k = 0 then
04
return "No"
05
pick an edge e = {v
a
, v
b
} from G
06
V
a
V \ {v
a
}
07
E
a
E \ {e E | v
a
is an endpoint of e}
08
if
A
tree
with G
a
:= (V
a
, E
a
) and k - 1 as inputs returns "Yes" then
09
return "Yes"
10
V
b
V \ {v
b
}
11
E
b
E \ {e E | v
b
is an endpoint of e}
12
if
A
tree
with G
b
:= (V
b
, E
b
) and k - 1 as inputs returns "Yes" then
13
return "Yes"
14
return "No"
This algorithm is illustrated in Figure 3.3. So what is the running time of this algorithm?
Without the recursion (i.e., calling A
tree
as a subprocedure), A
tree
would require O(|E|)
time to generate G
a
and G
b
, since each edge must be looked at to see if it is adjacent to v
a
or v
b
, respectively (we shall assume that deleting a vertex from G takes constant time). The
algorithm A
tree
calls A
tree
(with a different input, especially, k is decreased by one) at most
two times. Each of those calls again calls A
tree
at most two times, and so on, until the
algorithm is called with k = 0. This means, in a worst-case analysis, A
tree
is called
2
initial k
· 2
k
-1
· 2
k
-2
· · ·
2
k
-k+1
·
1
k
-k=0
= 2
k
times. Each call itself takes--as mentioned above--O(|E|) time which means in total, A
tree
requires at most
O
(2
k
|E|)
time to solve a given instance (G, k) of Vertex Cover, a fairly large improvement compared
to the trivial algorithm proposed in the last section, and moreover, the exponential part in
the running time of A
tree
is independent of the size of G. This makes Vertex Cover a
CHAPTER 3. COMPUTER SCIENCE PRELIMINARIES AND NOTATION
25
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
initial k
k
- 1
k
- 2
k
- 3
a
b
c
d
e
f
g
h
i
j
i
h
a
e
d
c
h
g
a
b
a
e
j
f
Figure 3.3: The search tree for finding a vertex cover of size k for a given graph: Given
the graph G and a parameter k > 3, the above figure demonstrates how a search tree
algorithm would try to find a vertex cover of size k for G. In each node of the search tree,
an edge e = {u, v} that is not yet covered is chosen from G (designated by coloring the
adjacent vertices of e grey) and the algorithm branches into two cases: Either, u is in the
vertex cover or v. The respective vertex from G is chosen into the vertex cover (and can
then, along with its adjacent edges which are now covered, be removed from G), k decreased,
and the algorithm proceeds if no vertex cover for G has been found yet and we have not yet
chosen k vertices into the cover.
fixed-parameter tractable problem, because, as long as k is constant, the time required to
solve Vertex Cover on (G, k) using A
tree
is polynomial (for Vertex Cover, even linear)
with respect to the size of the input graph G.
Note that A
tree
is not the optimal fixed-parameter algorithm for Vertex Cover known to-
day. In [CKJ01] and [NiRo03
b
], O(1.29
k
)-algorithms for solving Vertex Cover are given.
This is done by optimizing the search tree: Instead of branching into two subcases and
decreasing k by one each time the algorithm is called recursively, the algorithm may branch
into more complex cases, allowing it to decrease k by more than 1 in some branches of
the tree. Using the mathematical tool of recursion analysis, it can be analyzed how these
complex cases decrease the base of the exponent in the algorithm. It should furthermore
be noted that the algorithm uses the technique of problem kernel reduction
23
on Vertex
23
Kernel reductions are based on the idea that using the parameter k, we can already decide for some
parts of the input instance how they will add to the solution of the problem. A problem kernel reduction
causes the input instance to be smaller than f (k) for some f whilst being computable in polynomial time.