On the Algorithmic Tractability of Single Nucleotide Polymorphism (SNP) Analysis and Related Problems

Wernicke, Sebastian

On the Algorithmic Tractability of Single Nucleotide Polymorphism (SNP) Analysis and Related Problems

Zusammenfassung

Inhaltsangabe:Abstract:
This work brings together two areas of sciencebiology and informaticsthat have only recently been connected in the emerging (and vastly growing) research field of Bioinformatics. In order to achieve a common basis for Parts 2 and 3 of this work, Part 1 intends to introduce the computer scientist to the relevant biological background and terminology (Chapter 2), and to familiarize the biologist with the relevant topics from theoretical computer science (Chapter 3).
Chapter 2 first introduces some terminology from the field of genetics, thereby defining SNPs. We then motivate the analysis of SNPs by two applications, i.e. the analysis of evolutionary development and the field of pharmacogenetics. Especially the field of pharmacogenetics is capable of having an enormous impact on medicine and the pharmaceutical industry in the near future by using SNP data to predict the efficacy of medication.
Chapter 3 gives a brief introduction to the field of computational complexity. We will see and motivate how algorithms are analyzed in theoretical computer science. This will lead to the definition of complexity classes, introducing the class NP which includes computationally hard problems. Some of the hard problems in the class NP can be solved efficiently using the tool of fixed-parameter tractability, introduced at the end of this chapter.
An important application of SNP data is in the analysis of the evolutionary history of species development (phylogenetic analysis part two chapters 4 and 5). As will be made plausible in Chapter 5 using SNP data isin many wayssuperior to previous approaches of phylogenetic analysis. In order to analyze the development of species using SNP data, an underlying model of evolution must be specified. A popular model is the so-called perfect phylogeny, but the construction of this phylogeny is a computationally hard problem when there are inconsistencies (such as read-errors or an imperfect .t to the model of perfect phylogeny) in the underlying data.
Chapter 4 analyzes the problem of forbidden submatrix removal which is closely connected to constructing perfect phylogenieswe will see in Chapter 5 that its computational complexity is directly related to that of constructing a perfect phylogeny from data which is partially erroneous. In this chapter, we analyze the algorithmic tractability of forbidden submatrix removal, characterizing cases where this problem is NP-complete (being […]

Leseprobe

Inhaltsverzeichnis

ID 7481

Wernicke, Sebastian: On the Algorithmic Tractability of Single Nucleotide Polymorphism

(SNP) Analysis and Related Problems

Hamburg: Diplomica GmbH, 2003

Zugl.: Eberhard-Karls-Universität Tübingen, Universität, Diplomarbeit, 2003

Dieses Werk ist urheberrechtlich geschützt. Die dadurch begründeten Rechte,

insbesondere die der Übersetzung, des Nachdrucks, des Vortrags, der Entnahme von

Abbildungen und Tabellen, der Funksendung, der Mikroverfilmung oder der

Vervielfältigung auf anderen Wegen und der Speicherung in Datenverarbeitungsanlagen,

bleiben, auch bei nur auszugsweiser Verwertung, vorbehalten. Eine Vervielfältigung

dieses Werkes oder von Teilen dieses Werkes ist auch im Einzelfall nur in den Grenzen

der gesetzlichen Bestimmungen des Urheberrechtsgesetzes der Bundesrepublik

Deutschland in der jeweils geltenden Fassung zulässig. Sie ist grundsätzlich

vergütungspflichtig. Zuwiderhandlungen unterliegen den Strafbestimmungen des

Urheberrechtes.

Die Wiedergabe von Gebrauchsnamen, Handelsnamen, Warenbezeichnungen usw. in

diesem Werk berechtigt auch ohne besondere Kennzeichnung nicht zu der Annahme,

dass solche Namen im Sinne der Warenzeichen- und Markenschutz-Gesetzgebung als frei

zu betrachten wären und daher von jedermann benutzt werden dürften.

Die Informationen in diesem Werk wurden mit Sorgfalt erarbeitet. Dennoch können

Fehler nicht vollständig ausgeschlossen werden, und die Diplomarbeiten Agentur, die

Autoren oder Übersetzer übernehmen keine juristische Verantwortung oder irgendeine

Haftung für evtl. verbliebene fehlerhafte Angaben und deren Folgen.

Diplomica GmbH

http://www.diplom.de, Hamburg 2003

Printed in Germany

Contents

1 Introduction

1.1

The Human Genome and SNPs . . . . . . . . . . . . . . . . . . . . . . . . . .

1.2

Overview of this Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2 Biological Background and Motivation

2.1

Basic Genetic Terminology

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.2

An Introduction to SNPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.3

Importance and Prospects of SNP Mapping . . . . . . . . . . . . . . . . . . .

2.3.1

SNPs in the Study of Population History . . . . . . . . . . . . . . . .

2.3.2

SNPs and Pharmacogenetics . . . . . . . . . . . . . . . . . . . . . . . 10

3 Computer Science Preliminaries and Notation

3.1

Notation for Matrices and Graphs . . . . . . . . . . . . . . . . . . . . . . . . 14

3.2

Crash Course in Computational Complexity Theory . . . . . . . . . . . . . . 15

3.2.1

Machine-Independent Analysis . . . . . . . . . . . . . . . . . . . . . . 15

3.2.2

Running Time--Keeping Score . . . . . . . . . . . . . . . . . . . . . . 17

3.2.3

Complexity Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.3

Fixed-Parameter Tractability (FPT) . . . . . . . . . . . . . . . . . . . . . . . 23

3.3.1

An Efficient Algorithm for Vertex Cover . . . . . . . . . . . . . . . 23

3.3.2

Formal Definition and Aspects of FPT . . . . . . . . . . . . . . . . . . 26

4 Submatrix Removal Problems

4.1

Definitions and Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.2

A Reduction to d-Hitting Set . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.2.1

Finding Forbidden Submatrices . . . . . . . . . . . . . . . . . . . . . . 33

4.2.2

Approximability and Fixed-Parameter Tractability Results . . . . . . 35

4.3

Hardness Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.3.1

Overview of Results--Four Theorems . . . . . . . . . . . . . . . . . . 37

4.3.2

Proofs for Theorems 4.13 and 4.14 . . . . . . . . . . . . . . . . . . . . 38

iii

CONTENTS

4.3.3

Proof of Theorem 4.11 . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.3.4

Proof of Theorem 4.12 . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.4

Discussion and Future Extensions . . . . . . . . . . . . . . . . . . . . . . . . . 51

5 Perfect Phylogeny Problems

5.1

Phylogenetic Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.1.1

Introduction and Motivation . . . . . . . . . . . . . . . . . . . . . . . 53

5.1.2

Formal Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.2

Perfect Phylogeny Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.3

Relation to Forbidden Submatrix Problems . . . . . . . . . . . . . . . . . . . 58

5.4

Minimum Species Removal

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.5

Minimum Character Removal . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

6 Graph Bipartization

6.1

Introduction and Known Results . . . . . . . . . . . . . . . . . . . . . . . . . 67

6.2

Reducing Edge Bipartization to Vertex Bipartization

. . . . . . . . . 70

6.3

A Branch&Bound Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

6.3.1

Initial Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

6.3.2

Data Reduction Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

6.4

Implementation and Comparison of the Algorithms . . . . . . . . . . . . . . . 87

6.4.1

Using the Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

6.4.2

Some Implementation Details . . . . . . . . . . . . . . . . . . . . . . . 88

6.4.3

Tests and Test Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

7 Using Graph Bipartization in SNP Analysis

7.1

Introduction and Overview of Results . . . . . . . . . . . . . . . . . . . . . . 96

7.2

SNP Haplotype Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

7.3

Inferring Haplotypes from Genotypes . . . . . . . . . . . . . . . . . . . . . . . 100

7.3.1

Minimum Genotype Removal . . . . . . . . . . . . . . . . . . . . . . . 106

7.3.2

Minimum Site Removal . . . . . . . . . . . . . . . . . . . . . . . . . . 109

7.4

Testing Branch&Bound on SNP Data . . . . . . . . . . . . . . . . . . . . . . 112

8 Conclusion

115

8.1

Summary of Results and Future Extensions . . . . . . . . . . . . . . . . . . . 115

8.2

Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

List of Figures

119

Bibliography

120

Chapter 1

Introduction

1.1

The Human Genome and SNPs

Throughout its life, an individual's hereditary potentials and limits are determined by its

very own genes. Consequently, a lot of effort has been put into the Human Genome Project.

As of April 2003, 95.8% of the human genome has been sequenced in a very high quality (see

[NCBI03] for up-to-date information) and a goal has been set asking for a complete sequence

due the end of this year as more and more chromosomes become fully mapped (see, e.g.,

[Heil03]). However--quoting from [WeHu02]-- this achievement is merely the foundation

for far deeper research as

". . . we are ending the era of determining the sequence of the genetic code and

entering the beginning of the age of deciphering the biology of life underlying that

code."

Hearing that a 95.8 percentile portion of the human genome has been sequenced, one is

immediately bound to ask as to which human's genome we are actually referring to--after

all, every human has a unique genetic markup. At the beginning of the sequencing process

by the Human Genome Project, geneticists thought they would indeed have to make a choice

as to who would be chosen to provide a reference sequence of the four nucleotides A, C, G,

and T of his DNA.

One individual would surely constitute a "blueprint" of the human

species, the study of genetic variation among our species would however gain little insight

from this [Chak01].

The sequence we have obtained by the Human Genome Project fortunately was not obtained

through making that kind of choice: Genetic variation among humans can--in almost every

case--be traced back to variations that occur within a single nucleotide. Such a site where

there are two different nucleotides to be found in two different DNAs, is commonly referred

to as a Single Nucleotide Polymorphism (SNP, pronounced "snip") [HGSC01]. A simple

definition is given by [Ston01]:

". . . DNA is a linear combination of four nucleotides; compare two sequences,

position by position, and wherever you come across different nucleotides at the

same position, that's a SNP."

A thorough introduction to genetic terminology including SNPs is given in Chapter 2

CHAPTER 1. INTRODUCTION

During the Human Genome Project, 1.4 million sites of genetic variations have been mapped

[SNP01] along a reference sequence composed of hundreds of different genomes. The reason

why it was possible to combine such a multitude of different individuals' genomes into a

coherent map of the human genome lies in the fact that DNA is mostly conserved around

SNPs . The importance of SNPs is outlined e.g. in [Ston01] who refers to them as the "bread

and butter of DNA sequence variation" for they are witnesses of unique past mutations in

our genetic markup. Therefore, SNPs can give valuable hints about common evolutionary

ancestors. But there is an even more severe economic influence: As genes are widely held

responsible for the likeliness for the acquisition of certain diseases and the responsiveness

to various medical treatments, SNPs can either be made directly responsible for such a

variation or they may at least aid in the identification of the corresponding gene.

In this work, we shall deal with topics from the field of theoretical bioinformatics that are

connected to SNPs. Recent research [LBILS01, EHK03] has shown that all the useful appli-

cations and prospects of SNP data come at a price: Many computational problems arising

during the acquisition and application of SNP data have been proven to be computationally

"hard", meaning that they are widely believed to be impossible to solve in reasonable time .

However, there are techniques such as fixed-parameter tractability and data-reduction (both

to be introduced in more detail throughout this work) that allow even "hard" problems to

be solved efficiently in practical applications. This work explores the possible use of these

techniques for computationally hard problems connected to SNP analysis.

1.2

Overview of this Work

The main part of this work (Chapters 2 to 7) can be divided into three parts:

· Part 1 (Chapters 2 and 3): Introducing the Terminology

This work brings together two areas of science--biology and informatics--that have

only recently been connected in the emerging (and vastly growing) research field of

bioinformatics. In order to achieve a common basis for Parts 2 and 3 of this work,

Part 1 intends to introduce the computer scientist to the relevant biological background

and terminology (Chapter 2), and to familiarize the biologist with the relevant topics

from theoretical computer science (Chapter 3).

Chapter 2 first introduces some terminology from the field of genetics, thereby defining

SNPs. We then motivate the analysis of SNPs by two applications: The analysis

of evolutionary development and the field of pharmacogenetics. Especially the field

of pharmacogenetics is capable of having an enormous impact on medicine and the

pharmaceutical industry in the near future by using SNP data to predict the efficacy

of medication.

Chapter 3 gives a brief introduction to the field of computational complexity. We will

see and motivate how algorithms are analyzed in theoretical computer science. This

will lead to the definition of "complexity classes", introducing the class NP which

includes computationally hard problems. Some of the hard problems in the class NP

can be solved efficiently using the tool of fixed-parameter tractability, introduced at

the end of this chapter.

Section 2.3 gives a detailed introduction to the prospects of SNP mapping and analysis.

Chapter 3 introduces the topic of computational hardness in more detail.

CHAPTER 1. INTRODUCTION

· Part 2 (Chapters 4 and 5): Applying SNP Data (Perfect Phylogenies)

An important application of SNP data is in the analysis of the evolutionary history of

species development (phylogenetic analysis). As will be made plausible in Chapter 5,

using SNP data is--in many ways--superior to previous approaches of phylogenetic

analysis. In order to analyze the development of species using SNP data, an under-

lying model of evolution must be specified. A popular model is the so-called perfect

phylogeny, but the construction of this phylogeny is a computationally hard problem

when there are inconsistencies (such as read-errors or an imperfect fit to the model of

perfect phylogeny) in the underlying data.

Chapter 4 analyzes the problem of "forbidden submatrix removal" which is closely

connected to constructing perfect phylogenies--we will see in Chapter 5 that its com-

putational complexity is directly related to that of constructing a perfect phylogeny

from data which is partially erroneous. In this chapter, we analyze the algorithmic

tractability of "forbidden submatrix removal", characterizing cases where this problem

is NP-complete (being fixed-parameter tractable in general).

Chapter 5 introduces the concept, motivation, and some known results for phylogenetic

analysis. We then apply the results from Chapter 4 to perfect phylogeny problems,

i.e., the problem of dealing with data-inconsistencies with respect to the underlying

evolutionary model of perfect phylogeny. It will be shown that these problems are all

fixed-parameter tractable and can be efficiently solved using existing algorithms.

· Part 3 (Chapters 6 and 7): Obtaining SNP Data

Basically, obtaining SNP data requires sequencing two DNA strands and comparing

them to each other. The problems lie in the details: Firstly, current techniques only

allow sequences of at most 500 base pairs in length to be sequenced as a whole, and

secondly, it is--in terms of cost and labor--often only possible to detect the presence

of SNP sites rather than being able to tell which of the two DNAs contained which

base. Part 3 of this work analyzes the computational complexity of these two problems

by relating them to a graph-theoretic problem

called Graph Bipartization.

Chapter 6 introduces the computationally hard problem of Graph Bipartization,

stating some known results and showing the relative hardness of the two Graph

Bipartization-problem variants Edge Bipartization and Vertex Bipartization

(the latter one of which is proven to be at least as hard as the former one). Following

this introduction, we develop and test practical algorithms for Graph Bipartization.

These algorithms--although they require a long time even for medium-sized general

graphs--prove to be efficient for the Graph Bipartization problems that arise during

the acquisition of SNPs, even if these graphs contain a few hundred vertices.

Chapter 7 introduces a formal definition of the computational problems of SNP analysis

and proves their close relationship to Graph Bipartization. The last section of this

chapter shows that the algorithms developed in Chapter 6 can be used to efficiently

solve the presented problems by solving their corresponding Graph Bipartization

problem.

This work is concluded by Chapter 8, presenting a summary of results and suggestions for

future research related to this work.

Graphs are introduced in Chapter 3

Chapter 2

Biological Background and

Motivation

In this chapter, we establish some basic terminology from the field of genetics used through-

out this work

. Afterwards, we introduce SNPs and current techniques used to detect and

map them. The last section of this chapter provides an introduction to pharmacogenetics--

the area that sparked economic and scientific interest in SNPs.

2.1

Basic Genetic Terminology

All living organisms encode their genetic information in the form of deoxyribonucleic acid

(DNA, for short). DNA is a double-helix polymer where each strand is a long chain of

polymerized monomer nucleotides.

Basically, these are four different nucleotides; to a

deoxyribose sugar bonded with a phosphate, one of four possible bases is attached to form a

nucleotide; these possible bases include the two purines (adenine and guanine) and the two

pyrimidines (cytosine and thymine).

For abbreviation, the nucleotides in a strand of DNA

are denoted by the first letter of their respective base (adenine by A, guanine by G, cytosine

by C, and thymine by T). The nucleotides are joined to form a single strand of DNA by

covalently

bonding a phosphate of one nucleotide with the sugar of the next, a strand starts

with a sugar (this end is called the 3'-end) and ends with a phosphate (called 5'-end).

Two

single strands of DNA are held together by hydrogen bonds between the bases of opposing

nucleotides in the double-strand. These bonds specifically bind adenine with thymine and

cytosine with guanine. The structure of DNA is shown in Figure 2.1. Although being

For a more thorough introduction on genetics, see, e.g., [GMS00], [GGL02] or the chapters on genetics

in biochemistry books such as [VoVo95] or [BJS02].

In general, polymer designates the class of very large molecules (macromolecules) that are multiples of

simpler chemical units called monomers.

Actually, DNA has a lot less homogenous buildup than this because the nucleotides can be further

modified by an organism at their deoxyribose sugar. For example, bacteria use such a modification to be

able to distinguish their own DNA from foreign DNA coming, e.g., from a phage (bacterial virus). However,

such modifications will not have to concern us in this work because even in the presence of them, the basic

principles of DNA replication and translation hold.

A covalent bond is the interatomic linkage that results from two atoms forming a common electron

orbital.

The terms 3' and 5' are due to the enumeration of the carbon atoms in the deoxyribose sugar.

CHAPTER 2. BIOLOGICAL BACKGROUND AND MOTIVATION

5' end

3' end

5' end

3' end

3'--ACCGATATCGGA--5'

5'--TGGCTATAGCCT--3'

Figure 2.1: Chemical structure of DNA (above) and its abbreviated notation (below). The

letters within the molecular structure above stand for phosphate (P), deoxyribose (D), ade-

nine (A), guanine (G), cytosine (C), and thymine (T). The dashed vertical lines indicate

hydrogen bonds.

just two strands of bases attached to a phosphate-sugar backbone, DNA can be extremely

long by molecular measures.

In order for the DNA to fit into a single cell

whilst still

being accessible for replication and transcription into RNA to make proteins, human DNA

is organized into very dense complexes of proteins and DNA, called chromosomes. Humans

have 22 pairs of autosomes

and one pair of sex chromosomes (with females carrying two

"X" chromosomes and men one "X" and one "Y" chromosome) within--almost--each cell.

The complete DNA sequence of an organism is called its genome, its genetic constitution

as a whole is called genotype

. Genetic areas of interest in a genome are called loci

. A

gene is a unit of DNA that encodes hereditary information, i.e., the sequence of all proteins

expressed by an organism, on a locus of an individual's chromosome.

Any one of two

or more genes that may occur alternatively at a given locus on a chromosome is called an

allele. A combination of alleles that is likely to be inherited as a whole and may be found on

one chromosome is called a haplotype. The sequence of DNA within a gene determines the

synthesis of proteins, experiments indicating that each gene is responsible for the synthesis

of one protein. Each one of the 20 proteinogenic amino acids

is encoded by one or more

triplets of bases. Mutations, disruptions altering the genetic information (and therefore in

most cases the corresponding protein as well), may be due to deleting, inserting, replacing,

or rearranging nucleotides in a gene; they are responsible for the unique individual genetic

markup of organisms. As already mentioned above, a human cell contains two copies of

every chromosome (excluding the gender-specific chromosomes X and Y), where one copy

is inherited from each parent. Since each parent has its unique genetic markup, equivalent

E.g., the diploid DNA of a human being has a total length of approximately 1.8m [VoVo95].

With a few exceptions, every cell in a living organism contains its whole hereditary information.

Autosomes are those chromosomes that control the inheritance of all characteristics except sex-linked

ones

The genotype of an organism is the basis for its phenotype, where phenotype denotes the two or more

distinct forms of a characteristic within a population.

Loci is the plural form of "locus".

Note that the majority of DNA is presumed not to contain any genetic information [VoVo95].

Proteins are basically chains of polymerized amino acids.

CHAPTER 2. BIOLOGICAL BACKGROUND AND MOTIVATION

3'--ACAAAACTGAGCACTATTGGATCTACGACTGT --5'

SNP map

First DNA strand

Second DNA strand

3'--ACTATACTCAGCACTCTAGCATCTACGACTCT --5'

SNP sites

Figure 2.2: Mapping SNPs by comparison of two individuals' DNA sequence. Note that as

mentioned in the text, a single nucleotide variation must occur in at least 1% of a population's

individuals in order to be called "polymorphism" instead of "substitution".

genes in the two chromosomes may differ. Identical alleles on both chromosomes are referred

to as being homozygous, different alleles are denoted heterozygous.

2.2

An Introduction to SNPs

A polymorphism is a region of the genome that varies between different individuals.

Con-

sequently, a single nucleotide polymorphism (SNP, pronounced "Snip") is a genetic variation

caused by the change of one single nucleotide (see Figure 2.2). These variations occur quite

frequently among humans--on average, a SNP may be found approximately every 1.91 · 10

bases ("1.91 kilobases"), implying that over 90% of sequences longer than 20 kilobases will

contain a SNP [Chak01]. SNPs are not evenly distributed across chromosomes, most genes

contain just one or two SNPs. Currently, 93% of all genes are known to contain at least

one SNP [SNP01]. Depending on whether they are found within genes or not, SNPs are

either labeled cSNPs (coding SNPs) or ncSNPs (non-coding) SNPs. Generally, ncSNPs ap-

pear more frequently than cSNPs [Mu02]. Recall from the last section that more than one

triplet of bases may encode a certain amino acid. Often, triplets that encode the same

amino acid differ in a single nucleotide from each other. If a cSNP does not introduce an

amino acid change in the encoded protein, it is named sSNP (synonymous SNP), and nsSNP

(non-synonymous SNP) otherwise.

In the human genome, the ratio of sSNPs to snSNPs

is approximately one to one [Carg99].

Before outlining some prospects and the scientific as well as economic impact of SNP analysis,

we will now give a brief overview as to how SNPs are identified. In his survey on the usage

of SNPs as a tool in human genetics, Gray [GCS00] names four methods for SNP detection:

More precisely, a polymorphism has been defined as "the least common allele occurring in 1% or greater

of a population" [Mare97], thereby distinguishing a polymorphism from a substitution that may occur in

less than 1% of a population's individuals.

For example, there is an nsSNP in the gene for the HLA-H protein, where a crucial disulfide bond is

disrupted by changing the 282nd amino acid from cysteine to tyrosine, causing a metabolic disorder known

as hereditary hemochromatosis [PSS97].

CHAPTER 2. BIOLOGICAL BACKGROUND AND MOTIVATION

· Identification of single strand conformation polymorphisms (SSCPs): In this technique,

DNA fragments of a locus containing the presumed SNP are amplified (e.g., multiplied

into many identical fragments) using PCR amplification.

These fragments are then

put on a polyacrylamide gel to which a current of diluting liquid is applied. Due to

different folding of DNA fragments with different sequences, the speed of fragments

will differ if they contain SNPs. The presence of SNPs may afterwards be confirmed

by sequencing the respective patterns. This method is widely deprecated because of

its low throughput and sometimes poor detection rate of about 70%.

· Heteroduplex Analysis: During PCR amplification of an individual that is heterozy-

gous for a SNP, a heteroduplex

may be formed between two strands that are com-

plementary to each other with exception of the SNP site. These heteroduplexes can

then be detected either as a gel band (analogously to SSCP detection) or using high-

performance liquid chromatography (HPLC). This SNP detection method combines

reasonable throughput rates of 10 minutes per sample with a high detection rate be-

tween 95% and 100%.

· Direct DNA sequencing: This is the currently favored high-throughput method for

detecting SNPs. According to [Carg99], almost a million base pairs can be analyzed

in 48 hours with detection rates for heterozygotes ranging between 95% (using cheap

"dye-terminator sequencing") and 100% (using a more expensive and laborious method

known as "dye-primer sequencing"). Dye-terminator sequencing has been used by the

SNP Consortium [Hold02] which published over 1.4 million SNPs in human DNA

[SNP01]. Comparing equal loci in different versions of high-quality DNA sequences

has recently led to an increase of in silico detection of SNPs.

· Variant detector arrays (VDAs): In this technique, glass chips with arrays of oligonu-

cleotides are used to bind specific sequences derived in PCR amplification. VDA has

a quality comparable to dye-terminator sequencing and is especially useful in rapidly

scanning large amounts of DNA sequence.

It should also be stressed that for SNP detection, an appropriate set of alleles from which

SNPs are to be inferred needs to be chosen, as the different alleles occur with quite different

frequencies in different populations (such as human ethnic groups).

In Chapter 7, we will be concerned with the algorithmic tractability of two problems that

arise during the identification of SNPs: First, the sequencing of chromosomes in order to

obtain haplotypes has to deal with some errors in the reading and assembling process of the

DNA sequences (this will be discussed in more detail in Chapter 7). Second, haplotypes

are--due to prohibitively high cost and labor--seldomly identified by sequencing single

chromosomes. Rather, genotype information (both copies of a chromosome) is obtained,

from which haplotypes can be inferred under certain assumptions. We will see in Chapter 7

that both problems are closely related (in a certain way even equivalent) to a problem called

"graph bipartization" for which we will develop efficient algorithms in Chapter 6.

The polymerase chain reaction (PCR, for short) can quickly and accurately make numerous identical

copies of a specific DNA fragment. A PCR machine is capable of producing billions of copies of a DNA

fragment in just a few hours. PCR is a widely used technique in diagnosing genetic diseases, detecting low

levels of viral infection and for genetic fingerprinting in forensic medicine.

A heteroduplex may either be a piece of DNA in which the two strands are different, or it is the product

of annealing a piece of mRNA and the corresponding DNA strand.

SNP identification through arrays is a rapidly growing market responsible for the recent development

of biotechnology companies such as Affymetrix, Applied Biosystems, Marligen Biosciences, and Orchid

Biosciences, among many others.

CHAPTER 2. BIOLOGICAL BACKGROUND AND MOTIVATION

2.3

Importance and Prospects of SNP Mapping

SNPs are mainly useful for two areas of research: the study of population history (e.g.,

see [BBNE03]) and--an area of great economical significance--pharmacogenetics (e.g., see

[Rose00]). In this section, we will introduce both applications. The algorithmic tractability

of some problems in the study of population history using SNPs is dealt with in Chapter 5.

2.3.1

SNPs in the Study of Population History

SNPs often are a basis for various studies of population history (e.g., see [Tish96], [Tish00],

and [Mu02]). Historically, such studies employed gene trees of non-recombining loci inherited

from one parent such as mitochondrial DNA or the Y chromosome [Avis94]. A disadvantage

of this approach is that such loci are subject to a lot of stochastic parameters, which in

turn caused the requirement of vast amounts of loci to be analyzed to gain confidence over

the results. The preference for SNPs as genetic markers arose from this problem, as SNPs

provide a broad range of unlinked nuclear genetic markers and are thus able to capture "a

genome-wide picture of population history" [Niel00]. Furthermore, SNPs are advantageous

over previous methods such as using microsatellites

for population history studies because

they show a very favorable mutation pattern and greatly simplify the task of unbiased

sampling of genetic variation [BBNE03]:

· Mutation pattern: Microsatellites have a mutation rate of 10

-4

per generation as

opposed to the rate of 10

-8

displayed by SNPs. This makes multiple mutations for a

single SNP unlikely, therefore only two alleles exist of most SNPs. Such a property

greatly facilitates populational analysis--e.g., we will make use of it in the algorithmic

inference of haplotypes from genotypes in Chapter 7. Furthermore, mutations in SNPs

are more evenly distributed than in microsatellites, where the mutation rate is often

hard to estimate [BBNE03].

· Unbiased Sampling: Due to their uniform mutation rates, SNPs may be selected at

random in populational studies, avoiding previous bias that arose due to the fact that

often, only loci with well known mutation rates would be chosen for analysis. Fur-

thermore, [BBNE03] suggests that cross-species analysis of SNPs can provide greater

insight into the natural occurring rate of genome-wide variation than biased loci such

as microsatellites.

SNPs have been of great interest in populational studies due to the phenomenon that often

there is a collection of SNP sites where the individual SNPs are not independent of each

other; rather, a phenomenon called linkage disequilibrium is observed.

This phenomenon

refers to the fact that often, haplotype combinations of alleles at different loci are correlated

in their appearance frequency, forming a so-called linkage disequilibrium block (LD block)

[DRSHL01]. The size of LD blocks is often debated and ranges from size suggestions of a few

kilobases (in empirical research [Dunn00] and computer simulation experiments [Krug99])

to more than a hundred thousand kilobases [Abec01]. Independent of the discussed size, the

presence of linkage disequilibrium is believed to reflect the fact that haplotypes descended

Microsatellites are stretches of DNA consisting of tandem repeats of a small, simple sequence of nu-

cleotides, e.g., GCTGCTGCT. . . GCT. Often in literature, microsatellites are also referred to simple tandem

repeats (STRs).

For the human genome, linkage disequilibrium was studied, e.g., in [Reic01].

CHAPTER 2. BIOLOGICAL BACKGROUND AND MOTIVATION

"populational bottleneck"

population narrows

e.g., by selection process

States of first, fifth and seventh SNP site correlate due to

the respective common ancestral haplotype

SNP maps for the individuals of a large initial population:

states of different SNP sites show no correlation

population redevelops

from bottleneck haplotypes

Figure 2.3: Development of linkage disequilibria in SNP sites: An initial, genetically diverse

population is drastically reduced in its number of individuals (e.g., by selection processes

in a unique environment). This causes a "populational bottleneck" where only a very few

different haplotypes remain within a population. Redevelopment of a population based on

this non-diverse genetic material causes linkage disequilibrium in the alleles, which decays

over time due to mutations and recombination.

from common ancestral chromosomes; linkage disequilibrium may therefore also be an in-

dicator for populational bottlenecks

[Mu02]. The phenomenon of linkage disequilibrium

relating to SNPs is illustrated in Figure 2.3.

Linkage disequilibrium of individuals' genes within a population "decays" with population

history due to recombination [HaCl97]. It is believed that linkage disequilibrium around

common alleles is a lot less frequent than around rare alleles, which are generally younger

and thus less decayed by recombination [Watt77]. Using these assumptions, Reich et al.

[Reic01] have shown that they can relate some linkage disequilibria to events such as the

last glacial maximum 30 000-15 000 years ago, migration patterns in ancient Europe, or

the dispersal of anatomically modern humans in Africa. The article mentions that for some

A populational bottleneck is a period in population history where there are very few individuals in

the population. These individuals then gave rise to the haplotypes found in a population today--conserved

genetic patterns in the haplotypes can therefore be backwardly related to the respective ancestral individuals

that lived during the bottleneck.

CHAPTER 2. BIOLOGICAL BACKGROUND AND MOTIVATION

populations not as large as Europeans ([Reic01] mentions Yorubans as an example), the

resolution of their linkage disequilibrium blocks is too coarse, but nevertheless--referring to

studies such as [Tish96], [Tish00], or [Mate01]--" [. . . ] simultaneous assessment [of linkage

disequilibria] at multiple regions of the genome provides an approach for studying history

with potentially greater sensitivity to certain aspects of history than traditional methods

[. . . ]" [Reic01]

Additionally to being able to gain deep biological insights into species development and

population history, the study of linkage disequilibria and the search for common ancestors

of species might also have high economical and health political impact. One example for this

is the recent study of the malaria parasite Plasmodium Falciparum in [Mu02], this parasite

has been of intense interest since it infects hundreds of millions of people each year, being

responsible for almost 3 million annual deaths [Brem01].

An effective vaccine against

malaria, for example, must trigger an immune response that is equivalent or superior to

the one gained by contact with natural antigens. By finding some common SNP regions in

different Plasmodium Falciparum populations, it is the hope of current research to build an

accurate map of the ancestral relationship of various Plasmodium Falciparum strains. Such a

map of ancestral relationships could help in identifying common antigens for immunizations

[Gard98].

It was conjectured in [RLHA98] that the human malaria parasite experienced a populational

bottleneck about 5000 years ago, further sparking the hope that it would be possible to find

some common drug targets among malaria parasites. Although an extensive study of SNPs

on the Plasmodium Falciparum genome carried out by Mu et al. [Mu02] have shown that

this is probably not the case and Plasmodium Falciparum is rather a "quite ancient and

diverse" population (with the most recent common ancestor being a few hundred thousand

years old), it is still hoped that some more recent common ancestors of different strains can

be found in order to obtain an assay of promising drug targets and vaccines:

"For the first time, a wealth of information is available [. . . ] that comprise the

life cycle of the malaria parasite, providing abundant opportunities for the study

of [the . . . ] complex interactions that result in disease." [Gard98]

Although the genetic sequence and the insights gained by SNPs alone are no cure for malaria

and other widespread diseases, they seem to be a promising start.

The study of populational history based on traits of individuals (which--among others--may

be the presence of highly correlated SNP sites) and its algorithmic tractability is studied in

Chapter 5 of this work, where a special model of analysis called perfect phylogeny will be

employed. As will be seen in Chapter 5, SNPs provide very good data for this model due to

their very low mutation rate.

2.3.2

SNPs and Pharmacogenetics

The understanding of SNPs is believed to be a key to the research area known as pharmaco-

genetics. Using SNPs in pharmacogenetics is of immense economical interest to pharmaceu-

tical companies. It has led to the founding and funding (hundreds of millions of US dollars)

of the SNP Consortium [Hold02], a joint effort of major pharmaceutical companies such as

Bayer, Bristol-Myers Squibb, Glaxo Wellcome, Aventis, Novartis, Pfizer, Roche, SmithKline

2001-2010 has been named the "Malaria Rollback Decade" by the WHO [WHO03] to emphasize efforts

being made in limiting the widespread of malaria.

CHAPTER 2. BIOLOGICAL BACKGROUND AND MOTIVATION

Beecham, and Zeneca. Interdisciplinary connections of the SNP Consortium include IBM

and Motorola.

The problem with a lot of drug therapies is the possibility of adverse drug reactions by

patients: Research by Lazarou, Pomeranz, and Corey [LPC98] suggests that, in 1994, such

reactions were responsible for millions of hospitalizations and almost a hundred thousand

deaths. This value is not likely to have improved lately and is hindering the introduction of

new medications that are effective in most patients but pose unbearable risks: For example,

the quite effective anticonvulsant drug Lamictal

by Glaxo Wellcome is only reluctantly

prescribed because of a potentially fatal skin rash that arises as a side effect in five percent

of all patients taking the drug [Maso99]. The problem of the different effects drugs exert

on patients has long been known and studied, already over a hundred years ago Sir William

Osler

reflected:

"If it were not for the great variability among individuals, medicine might as well

be a science and not an art." (as cited by [Rose00])

Pharmacogenetics is an area of research that studies how genetic variation influences a

patients responsiveness and responses to drugs (a good introduction to pharmacogenetics

is, e.g., [Rose00]), thereby trying to give physicians the possibility of using objective data

about a patient's likeliness to react to prescribed drugs in a predictive way. The basic idea in

pharmacogenetics is to build a profile of an individual's genetic variations in order to predict

effectiveness and side-effects of drugs. As was discussed above--since genetic variation is

mainly due to SNPs--a hope of pharmacogenetics relies on building an accurate map of

SNP haplotypes.

Roughly speaking, the hope is to identify linkage disequilibrium loci around certain genes

that are susceptible for causing a certain adverse reaction to drugs. The same technique has

already been applied in the study of individuals' susceptibility to certain complex genetic

diseases such as Alzheimer's disease: In an analysis of polymorphisms on the ApoE gene

locus on chromosome 19, Martin et al. [Mart00] reported the detection of those SNPs

in linkage equilibrium that are associated with Alzheimer's disease. A number of other

studies successfully related the susceptibility for complex genetic diseases such as migraine

with aura, psoriasis, and insulin-independent diabetes mellitus to certain SNPs in linkage

disequilibrium [Rose00]. Now, just as linkage disequilibria can be related to the susceptibility

for diseases, they can also be related to certain drug reactions. Two good examples for this

are, e.g., patient's reactions to nortriptyline and beta-2-agonists:

· Nortriptyline is a medication against depression which is converted to an inactive

compound within the body by drug metabolizing enzymes called the cytochrome P450

enzymes. Specifically, an enzyme labeled CYP2D6 is a key in inactivating nortryptiline

and removing the inactivated substance from the body--except in some people that

have variations in their CYP2D6 encoding gene. These variations may lead to two

undesired effects [DeVa94]: People referred to as "ultra metabolizers" have a variation

that causes the synthesis of too much CYP2D6 in their body, thus inactivating so

much nortriptyline that these people are likely to receive insufficient antidepressant

effects from nortriptyline. A more dangerous variation is found in people referred to

as "poor metabolizers" who do not synthesize sufficient amounts of CYP2D6--these

Osler, Sir William, Baronet (*1849, 1919). Canadian physician and professor of medicine who played a

key role in the transformation of the curriculum of medical education in the late 19th and early 20th century.

CHAPTER 2. BIOLOGICAL BACKGROUND AND MOTIVATION

SNP genotype profile of

patients in clinical trial

medicine effective?

ineffective

effective

effective vs. ineffective

SNP genotype profile of

predictors for efficacy

SNP profile for prediction

"effective"

SNP profile for prediction

"ineffective"

Figure 2.4: Profiling SNPs in pharmacogenetics: If there is a section of the SNP genotype

profile that proves to be different in patients where a drug is effective as opposed to patients

where a drug shows no efficacy or undesired side-effects, this region can be used to predict the

effectiveness and potential risks due to side effects in a patient before a drug is prescribed.

are likely to experience toxic side effects due to accumulation of nortriptyline in their

body. Genetic testing for variation in the gene for the CYP2D6 enzyme could avoid

both scenarios.

· Beta-2-antagonists such as albuterol are important to the treatment of asthma. In-

teracting with the beta-2-adrenergic receptors in the lung, they cause the freeing of

airways by inducing muscle relaxation in the lung muscles. A SNP in the gene encoding

the beta-2-adrenergic receptor causes the carriers of one SNP variant to express fewer

of these receptors, therefore receiving little relief of asthma symptoms upon a standard

dose of albuterol [Ligg97]. Testing for presence of the specified SNP in patients can

allow to clearly identify those 45% in the North American population

who can only

poorly control their asthma by beta-2-antagonists.

It is clear that the prospects of being able to predict the efficacy of a drug whilst minimizing

the risk of side-effects is of great interest to the pharmaceutical industry, which could then--

as Roses proposes in [Rose00]--create efficacy profiles for patients (see Figure 2.4) already in

phase II

clinical trials of medication. Abbreviated SNP profiles

could be used to record

This figure should be similar for Europeans.

Clinical trials are divided into five steps: Preclinical Research includes controlled experiments using a

new substance in animals and test tubes and may take several years. Phase I trials are first tests of the

investigated drug on humans. Doses are gradually increased to ensure safety. Phase II trials will gather

information about the actual efficacy of a drug. Phase III trials studies a drugs effects with respect to

gender, age, race, etc. A successful phase III trial leads to the admission of a drug to the public market.

Occasionally, phase IV trials are conducted that are--in principle--phase III trials on an even broader

variety of patients.

CHAPTER 2. BIOLOGICAL BACKGROUND AND MOTIVATION

adverse reactions in patients and thus be able to predetect even the most rare adverse events.

Furthermore, the development of new, more effective drugs can be facilitated: The parallel

developments of drugs targeting specific symptoms is facilitated because patients who do not

respond to a certain medication can be profiled in early clinical trial stages. Additionally,

the development of medications which are highly effective in only a comparably small part

of a population (e.g., a medication with 30% response rate) become profitable as they may

be specifically prescribed to patients to whom the respective substance will be effective.

Pharmacogenetics relying on SNP linkage analysis seems to be a promising start to replacing

trial-and-error prescriptions with specifically targeted medical therapies.

Abbreviated SNP profiles contain only the SNP information of a patient that is relevant for a drug

efficacy prediction. The introduction of abbreviated profiles plays an important role in the discussion about

the fear of "individual DNA profiling" because they cannot be backwardly related to a patient.

Chapter 3

Computer Science Preliminaries

and Notation

The first section of this chapter introduces the notation used throughout this work, followed

by a brief introduction to those ideas in computational complexity that are important to this

work. Especially, the last section focuses on fixed-parameter tractability, laying a foundation

for the computational complexity analysis in the following chapters.

3.1

Notation for Matrices and Graphs

Matrices. By an n × m matrix A we are referring to a rectangular arrangement of n · m

elements into n rows and m columns. By a

we designate the element in A that may be

found at the jth position of the ith row. We will use the terms A and (a

) synonymously.

Graphs. A graph consists of vertices and edges, where a vertex is an object with a name

and other properties (such as a color) and an edge is the connection of two vertices. We

will denote a graph G with n vertices and m edges by (V, E) where V = {v

, . . . , v

}

and E = {e

, . . . , e

} V × V . In this work, only simple, undirected graphs are considered,

meaning

· a vertex cannot be connected to itself by an edge,

· no two vertices may be connected by more than one edge, and

· an edge leading from a vertex u to a vertex w also leads from w to u.

By the term subgraph G = (V , E ) we are referring to a graph with V V and E =

(V × V ). Given a set V , G = (V , E ) with E = E (V × V ) is called the subgraph

induced by V in G.

A vertex v is said to have degree d--denoted by degree(v) = d--if there are exactly d edges

in G that are adjacent to v.

A path p of length in a graph G = (V, E) is a sequence of + 1 distinct vertices v

. . . v

in G such that for each 1 i , v

and v

are connected by an edge. A cycle of

length

in G is a sequence of

vertices v

. . . v v

in G such that v

. . . v

is a path in G

CHAPTER 3. COMPUTER SCIENCE PRELIMINARIES AND NOTATION

and {v , v

} E. We call G = (V, E) a tree if it contains no cycles; a tree containing a

specially designated node

--called the root of the tree--is called rooted. Nodes of degree 1 in

a rooted tree are called leafs. A graph G = (V, E) is called connected if any two vertices u, v

are connected by a path in G. A subgraph of G that is maximally connected with respect

to its number of vertices is called a connected component.

A graph G = (V, E) is called bipartite if we can divide the set V of vertices into two

disjoint subsets V

and V

such that E contains neither edges between vertices in V

nor

edges between vertices in V

. The graph G is called planar if it can be embedded into an

(Euclidian) plane without any intersecting edges.

In this work--especially in Chapter 6--we will be using the following set of operations on

graphs:

· Subgraph removal. Let V V be a subgraph of G = (V, E). By G \ V we will

denote the subgraph that is induced in G by V \ V .

· Vertex deletion. Let u be a vertex in a graph G = (V, E). By G \ {u}, we denote

the graph that is induced in G by V \ {u}.

· Edge deletion. Let e be an edge in a graph G = (V, E). By G \ {e}, we denote the

graph G = (V, E ) with E = E \ {e}.

A vertex separator in a graph G = (V, E) is a set V V of vertices in G such that G \ V is

not connected. If |V| = k, we call V a vertex separator of order k. The definition of an edge

separator E E of order k is analogous.

3.2

Crash Course in Computational Complexity Theory

Generally speaking, "an algorithm is a procedure to accomplish a specific task. It is the

idea behind any computer program" [Skie98]. The first goal for any algorithm is to be

effective, i.e. providing correct solutions to a given problem, however, an effective algorithm

is of little use if it is not efficient, i.e., requires more resources (especially time) to solve a

problem than can be provided. Computational complexity theory deals with the amount

of resources--the two most important of which are time and memory

--required to solve

a certain computational problem by an algorithm. A brief introduction to analyzing the

time complexity of algorithms is given in [Skie98], a very thorough treatment of complexity

theory may be found, e.g., in [Papa94]. This section will introduce some basic terminology

from computational complexity theory that will be used throughout this work.

3.2.1

Machine-Independent Analysis

Imagine that we are given an algorithm called A (in any programming language) that solves

a certain problem P when given an input I, called an instance of P. We would now like to

analyze the performance--especially concerning speed--of this algorithm. The most obvious

way of this would be to run A on a lot of instances of P and measure the time it takes for A

In order to distinguish tree from graphs more easily throughout this work, we will use the term "vertex"

for general graphs and the synonymous "node" for vertices in trees.

Since only the notion of time complexity is important for this work, we shall omit space complexity in

the following introduction.

CHAPTER 3. COMPUTER SCIENCE PRELIMINARIES AND NOTATION

to complete its task each time. However, with this approach, we quickly run into a multitude

of problems, the most crucial of which are that the absolute time measured is influenced

by the actual machines computer architecture

, absolute time values are only useful for one

particular type of machine, we can seldomly test the algorithm on all conceivable instances,

and a purely practical analysis provides no indication about an algorithm's maximal (worst-

case) running time.

Complexity theory tries to avoid these problems arising from a direct machine analysis by

analyzing computational problems in a more formal and mathematical way. This analysis

is machine-independent whilst still trying to incorporate the fundamental workings of a

modern computer. Traditionally, complexity theory relies on the Turing Machine as its

model of computation which is, however, a quite abstract model of computation lacking

any close relationship to modern computers.

For this work, we do not require the many

special features offered by Turing Machines and shall therefore rely on another model of

computation that is sufficiently precise for our analysis, far more intuitive and more closely

related to a "real" computer than a Turing Machine.

This model is the RAM model of

computation, which Skiena [Skie98] describes quite vividly as a computer

"where each `simple' operation (+,-,*,=,IF,call) takes exactly 1 time step [and]

loops and subroutines are not considered simple operations. Instead, they are

the composition of many single-step operations. . . . Each memory access takes

exactly one time step, and we have as much memory as we need."

Using this model, the computational time of an algorithm is given by simply counting the

number of time steps it takes the RAM machine to execute it on a given problem instance.

The advantage of the RAM model lies in the fact that it captures the essential behavior of

a modern computer without introducing any fickle parameters such as memory bandwidth,

actual processor speed, and memory access time, just to name a few. The next subsection

demonstrates the usage of this model in the analysis of an algorithm's time complexity.

This, however, requires a last step of formalization: Besides the machine model, the term

"problem" needs to be specified.

Computational problems may be stated in many ways, the most important of which--at least

in the context of this work--are decision problems (the output can be just either "Yes" or

"No") and optimization problems (the output is a solution which is minimal/maximal in

some respect). Most of computational complexity theory solely deals with decision problems

because almost any "reasonable" way of stating a problem can be transformed into a decision

problem.

Although a whole branch of theoretical informatics--computability--has evolved

concerning the existence of decision problems that are undecidable (i.e., not algorithmically

A modern computer's performance is, e.g., influenced by its processor, memory bandwidth, techniques

such as pipelining and caching, the operating system, the programming language and its compiler, etc. Due

to these many factors it is sometimes even difficult to obtain consistent results on a single, defined machine.

There are many reasons why Turing Machines are nevertheless used in computational complexity theory:

For example, requirements such as memory and time are very easy to define for a Turing Machine and can

be analyzed with great accuracy. Furthermore, Turing Machines can simulate other machine models--one

Turing Machine can, in theory, even simulate an infinite number of Turing Machines. The simulation of

an algorithm on the RAM model (which will be introduced shortly) by a Turing Machine requires only

polynomially more time than its execution on directly on the RAM (the term "polynomially more time" will

also be defined more precisely later on in this chapter).

A quite thorough analysis of different machine models can be found in Chapter 2 of [Papa94].

An algorithm is by definition already given in a formal fashion.

E.g., instead of asking "What are all prime numbers between 0 and 280581?" we can solve the decision

problems "Is 1 a prime?" ("No"), "Is 2 a prime?" ("Yes"), . . . , "Is 280580 a prime?" ("No") separately.

CHAPTER 3. COMPUTER SCIENCE PRELIMINARIES AND NOTATION

solvable) by computers, we can assume for this work that we are always given decidable

decision problems.

Analyzing decision problems is closely related to the fact that in complexity theory, a pro-

blem is generally formulated as a language L and asking (i.e., deciding by an algorithm)

whether a given instance I is part of that language. Both the language L and the instance I

are a subset of

for an alphabet , where is a finite set of symbols and

is the set of

all words that may be generated by concatenation of symbols from , including the empty

word

which contains no symbols at all.

Expressing a given problem as a language is--in

most practical cases--quite straightforward.

For this work, the computational complexity

for solving a problem can be seen as equivalent to the complexity of answering the cor-

responding decidability question. It should be noted that stating a problem in form of a

language presents this particular problem in a very abstract form. Neither an algorithm for

solving the problem is given nor any obvious hint about the time complexity of solving this

problem. In order to deal with this, we will introduce the model of complexity classes and

reductions later on in this chapter.

For the sake of simplifying the discussion in this work, we will refrain from stating problems

in the form of a language. Instead, we will simply assume that the given problems may be

stated in the form of a language. Furthermore, instead of asking whether an instance I is

in L, we shall directly deal the object x that I represents (such as a word, number, graph,

etc.)

. We then call x an instance of the problem P = "I L?".

3.2.2

Running Time--Keeping Score

We discussed at the beginning of the last subsection that, given an algorithm for a problem P,

knowing how this algorithm performs on certain instances x of P is of little use. Rather,

in order to understand the quality of an algorithm, it is vital to know how it performs on

any conceivable instance x of P and express this performance in an intuitive way. This

is done by introducing three new ideas: Analyzing how the running time of algorithms

scales with the problem size, distinguishing between worst, best and average-case complexity

(emphasizing on worst-case complexity), and analyzing the scaling of the algorithm in its

asymptotic behavior.

The first idea is based on the intuitive observation that an algorithm should generally take

longer time to run as the presented instance becomes larger. For instance, a graph problem

on a general

graph consisting of ten vertices should be easier to solve than the same

problem for a general graph with a thousand vertices. Therefore, if x is an instance of a

problem P, the running time t of an algorithm is expressed as a mathematical function f

For example, if = {0, 1},

= { , 0, 1, 00, 01, 10, 11, 000, 001, 010, 011, . . . }.

E.g., the problem of deciding whether a given number is a prime number would have to be expressed

as L

prime

= {p {0, 1}

| p is the binary representation of a prime number} (with = {0, 1}) and then

asking "given I

, is I L

prime

?".

We shall assume for this work that such a representation, i.e, the encoding of x as I is always a valid

one, meaning we do not need to worry about any cases where I does not encode a valid object x.

By "general" we mean that the given graph has no special properties that greatly simplify the solution

of the respective problem.

CHAPTER 3. COMPUTER SCIENCE PRELIMINARIES AND NOTATION

(x)

· g(x)

(x)

· g(x)

(x)

· g(x)

(x) = O(g(x))

(x) = (g(x))

Figure 3.1: A function f (x) and its bounds in O-notation: From left to right, g(x) is an

upper, lower and tight bound on f (x). Note how the respective bounding property of g only

needs to be true for all x > x .

of n := |x|, the size of x:

algorithm

(x) = f (|x|) = f (n)

Given a fixed size n for the input instance x and an algorithm A that runs with x as an

input, we distinguish between the best-case, average-case, and worst-case running time:

best

(A, n) =

min

: |x|=n

(x), t

avg

(A, n) =

: |x|=n

(x)

|{x | |x| = n}|

and t

worst

(A, n) =

max

: |x|=n

(x).

Most of the time, only the worst-case complexity of an algorithm is interesting since average-

case and--especially--best-case complexity provide no information whatsoever about the

running time that A might have when presented with any instance x. E.g., for average-

time complexity the problem here lies in the definition of "average": There may be some

problems which are rather easy to solve on many instances, but this is of no use if we

should--consciously or not--be dealing just with hard instances during the application of the

algorithm.

Albeit open to criticism about being too pessimistic, the worst-case complexity

of an algorithm seems to be the most useful measure for its performance.

The computational RAM model introduced in the last subsection provided a way to measure

the running time of a given algorithm A exact to a single time unit. This degree of accuracy

is not useful as for such exact counts the function f that measures the running time of A will

often get very complicated and unintuitive to analyze.

Moreover, as we are interested in

the performance of an algorithm on a real-world machine and not on the hypothetical RAM,

measuring the running time of A down to the last time unit finds no application. Therefore,

Later on, we will analyze algorithms in more detail using various parameters of the input. For example,

the running time of a graph algorithm may depend on the number of edges as well as on the number of

vertices in the graph. The number of edges in the graph is lower than |V |

, but explicitly using the number

of edges provides a better analysis. In order to simplify the discussion, however, we will for now assume that

there is just a single input size parameter given.

Moreover, the corresponding mathematical analysis of average-case complexity even for simple algo-

rithms and a clear definition of "average case" is often highly involved and complicated.

Moreover, there are often trivial steps depending on the notation of the algorithm (such as initialization

of variables) that require just a little constant amount of time and are thus not interesting for a general

performance-analysis.

CHAPTER 3. COMPUTER SCIENCE PRELIMINARIES AND NOTATION

computational complexity theory, instead of directly analyzing f , rather analyzes the upper

and lower bounds of f using the O-Notation (pronounced "Big Oh Notation"):

Definition 3.1 (O-Notation):

Given two functions f :

and g :

. We will say that

· f = O(g) if there exist a constant c

R and an x R such that for all x > x , f(x)

· g(x) (g is an upper bound for f ).

· f = (g) if there exist a constant c

R and an x R such that for all x > x , f(x)

· g(x) (g is a lower bound for f ).

· f = (g) if there exist two constants c

, c

R with c

and an x

R such that

for all x > x , c

· g(x) f (x) c

· g(x) (g is a tight bound for f ).

This notation is illustrated by Figure 3.1.

Analogously to preferring the worst-case complexity over the best- and average-case com-

plexities when analyzing the performance of an algorithm, it is common practice to provide

an upper bound for an algorithm's running time instead of a lower or tight one.

To provide an example on how to determine the running time of an algorithm and express it

in O-Notation, we will now analyze an algorithm for a well-known problem in computational

complexity, called Vertex Cover.

Definition 3.2 (Vertex Cover)

Input: A graph G = (V, E) and a parameter k.

Question: Is it possible to choose a set V V with |V | k such that every edge in E has

at least one endpoint in V ?

In Figure 3.2, a graph and one of its vertex covers is given in order to illustrate this definition.

A very trivial algorithm A

VCtrivial

for this would be to simply try all possible solutions of

size k and see whether one of these hypothetical solutions is indeed a vertex cover for the

given graph:

Algorithm: Trivial algorithm A

VCtrivial

for Vertex Cover

Input: A graph G = (V, E) with V = {v

, . . . , v

}

Output: "Yes" if G has a vertex cover of size k, "No" otherwise

for every k-sized subset V of V do

if V is a vertex cover for G

return "Yes"

return "No"

For a concrete example, consider the function f (x) = x

+ x

+ x - ln x + 1234. If x 6, we have

(x) = x

+ x

+ x - ln x + 1234 < x

+ x

+ x - ln x + 6

5 · x

=: 5 · g(x),

(x) = x

+ x

+ x - ln x + 1234 > x

1 · x

=: 1 · g(x), and

1 · g(x) := 1 · x

< f

(x) = x

+ x

- ln x + 1234 < 5 · x

= 5 · g(x)

and therefore f (x) = O(x

), f (x) = (x

), and f (x) = (x

This is mainly due to the fact that tight bounds on the running time of algorithms are often neither

intuitive nor easy to grasp mathematically.

CHAPTER 3. COMPUTER SCIENCE PRELIMINARIES AND NOTATION

A Vertex Cover for G

A graph G = (V, E)

(black vertices)

Figure 3.2: A graph G = (V, E) and a vertex cover of size 17 (black vertices) for G. Note

how for each edge in G, at least one of its endpoints is in the given vertex cover. The shown

vertex cover for G is optimal in the sense that there is no vertex cover for G with fewer than

17 vertices (this was verified using a computer program).

We will now analyze the running time of A

VCtrivial

in terms of the number of vertices (|V |)

and the number of edges (|E|) in G.

Let us start with lines

and

: Since both

terminate A

VCtrivial

, they are executed at most once, and thus do not play a role in the

asymptotic running time of A

VCtrivial

. Line

can be executed by calling the following

subroutine: Iterate over all edges of G, and check for every edge whether at least one of its

endpoints is in V . If we have been clever enough and marked those vertices that are in V

during the execution of line

, executing this line only requires O(|E|) running time. For the

seemingly most difficult line to analyze, line

, we make use of the machine-independency

of our analysis by using an algorithm for generating subsets from the extensive available

literature on algorithms (e.g., [CLRS01], [Knut97]).

In [Knut03], we can find an algorithm

that generates all k-sized subsets V of V in

(

|V |

) = O

|V |!

!(|V | - k)!

= O

|V | · (|V | - 1) · · · (|V | - k + 1)

k factors

=O(|V|

)

time on a RAM-like machine. For finding the total running time of A

VCtrivial

, it is now

sufficient to observe that line

causes line

to be executed once for each of the at most

|V |

subsets generated. Taking into account the time requirements of line

, the total running

A quick glance at the algorithm demonstrates the advantage of all the conventions we have introduced

above. E.g., if we were not to use the O-Notation for the worst-case bound we are about to determine, we

would explicitly have to look at the exact number of steps a RAM needs to generate a subset in line 01, to

store G, to determine whether V is a vertex cover of G, and so on.

Note that the RAM models in literature do not necessarily need to be defined precisely the way we

have. E.g., in the RAM model used in [Knut97], some computing steps take more than one unit of time.

However, this is not important for the performance of an algorithm in O-notation: Assume, for example,

that each simple computational operation would consume four time units instead of one on a machine RAM

as opposed to our RAM model. If there is an algorithm that, e.g., requires O(n

) time on the RAM, it

would require O((4n)

) = O(64n

) = O(n

) time on the RAM .

CHAPTER 3. COMPUTER SCIENCE PRELIMINARIES AND NOTATION

time for A

VCtrivial

is therefore bounded by

(|V |

· |E|).

This upper bound is quite unsatisfactory for practical applications, for it implies an enormous

worst-case running time even for small graphs and small k.

Note that from the discussion

so far, it is not clear wether this is due to a poorly designed algorithm we have come up with

or it is a result of some "inherent complexity" of Vertex Cover. The next subsection and

the following section will demonstrate that actually both is true, that is, Vertex Cover

is believed to be "hard to solve" (we will define this more precisely in the next subsection)

but there are ways of "taming" this inherent complexity, as will be shown in Section 3.3.

3.2.3

Complexity Classes

In the previous subsection, we have given an algorithm to solve Vertex Cover that was

quite impractical for large input graphs. However, it was not clear whether this problem is

hard to solve in general or if we just haven't come up with a good algorithm. We would

now like to know wether there is a better algorithm for Vertex Cover than the one

presented, or--even better--know the fastest possible algorithm for Vertex Cover (i.e.,

the minimum time complexity of Vertex Cover). The first request is comparably easy to

come by, we just have to look for an algorithm with a better worst-case running time than

the one presented. The latter however is a lot harder to deal with, because in finding a lower

bound for the time complexity of Vertex Cover it is necessary to consider every thinkable

algorithm for Vertex Cover--even algorithms that have not yet been found.

However,

there is another way to approach the problem of complexity bounds using reductions and

complexity classes.

Reductions will allow us to divide problems into different "classes of difficulty". The idea

behind this is the following: Although not knowing how hard an individual problem might

be, we can relate problems to each other so that we know they are both "equally hard" to

solve, meaning if there is a fast algorithm for one problem, there must be one for the other

problem, too. A collection of such related problems is called a complexity class (a more

formal definition will follow shortly). Problems are grouped together in complexity classes

by finding a computationally "cheap"

transformation between instances of one problem

and the other. Then, loosely speaking, if we know that if one of the two problems turns out

to be easy to solve, we also know that the second problem is easy to solve, because we can

apply the algorithm for the easy problem to transformed instances of the other one. In a

more formal fashion:

Definition 3.3 (Polynomial Time Reduction):

Given two languages L

and L

. We call L

polynomial-time reducible

to L

(designated L

poly

) if there is a function R from

that can be

computed in polynomial time on any x

and

R(x) L

For example, finding out if a graph with 75 vertices and 200 edges has a vertex cover of size 10 would

require c · 10

steps on our RAM where c is some constant 1 omitted on the O-notation.

Except for a few very rare cases of problems (such as sorting), the question of lower complexity bounds

therefore generally remains unanswered.

In our context, this will imply a polynomial running time with respect to the original instance's size.

CHAPTER 3. COMPUTER SCIENCE PRELIMINARIES AND NOTATION

In complexity theory, there are a lot of more specialized reductions, some of which we will

get to know in Section 3.3, that are more "delicate" in the sense that they impose stricter

requirements on R than just being computable in polynomial time. However, we shall work

just with polynomial time reductions for the rest of this section.

The concept of polynomial time reduction may be used to build a hierarchy of computational

problems. This hierarchy consists of classes C. In each class, we can find those problems that

are solvable using the resources allowed by the respective class. Furthermore, we introduce

the concept of completeness to identify those problems in a class that are computationally

as hard to solve as any other problem in that class. In this way, if a problem that is complete

for a class should prove to be "easy" to solve, we know the same to be true for all other

problems that are in C.

Definition 3.4 (Complexity Class Hardness and Completeness):

Let C be a complexity class. A language L is called C-hard if all languages in C can be

reduced in polynomial time to L. We call L C-complete, if L is C-hard and in C.

There is a vast number of complexity classes known today (see, for example, [Aaro03]), each

of them grouping together problems with various properties. Two of the first classes that

were developed and are of much interest for this work are P and NP.

Definition 3.5 (P and NP):

The complexity class P is the class of computational problems that can be solved in polyno-

mial time on a deterministic Turing Machine.

The complexity class NP is the class of computational problems that can be solved in poly-

nomial time on a nondeterministic Turing Machine.

Although our definition uses the term "polynomial" to describe all problems in NP, it

is widely believed that all NP-complete problems are only solvable in exponential time.

The reason for this is the computational model underlying the definition: A nondetermin-

istic Turing Machine is a very unrealistic model of computation, being able to--vaguely

speaking--correctly "guess" the solution to a problem and then only needing to verify its

correctness (the process of checking must then be done in polynomial time). However, all

computers known today are deterministic, and therefore they have to "emulate" the guessing

steps of the nondeterministic Turing Machine in order to find a solution to an NP-complete

problem. This emulation is done by simply checking all possibilities for a possible solution

which takes--in worst-case complexity--an exponential amount of time.

It must be stressed that no proof whatsoever has been given that problems in NP are at

best solvable in exponential time. All we have stated is a plausibility argument: There are

thousands of problems known to be NP-complete (even the rather outdated list of [GaJo79]

lists hundreds of NP-complete problems), finding a polynomial time algorithm for just one

NP-complete problem would show that all NP-complete problems are polynomially solvable,

but this has not happened in spite of over 25 years of research so far. There are therefore

two important things to be remembered throughout this work:

· When saying that a problem is harder than another one, we are always referring to

relative complexity bounds, i.e., we are saying that if a "harder" problem should turn

out to be efficiently solvable, so will the easier problem (but not vice-versa).

CHAPTER 3. COMPUTER SCIENCE PRELIMINARIES AND NOTATION

· Sometimes we will use an argument such as ". . . --since this would imply that P = NP"

in our proofs, which is based on the unlikeliness of P = NP. It would, however, be

more correct--albeit unusual--to write "Unless P = NP, the following holds true:

. . . ".

The Vertex Cover problem posed at the beginning of the previous subsection has been

proven to be NP-complete in [GaJo79], where it is shown that Vertex Cover is even NP-

complete for planar graphs where each vertex has a degree of at most 3

. A long time, an

NP-completeness proof for a problem was taken as a synonym for "unsolvable already for

moderate input sizes" (coining the term "intractable"). However, this is not true in general,

as the next section demonstrates.

3.3

Fixed-Parameter Tractability (FPT)

We have seen in the previous section how algorithms can be analyzed machine-independently

by observing how they scale with the size of their respective input. This size we named n. We

have also seen the class NP, reasoning that problems complete for this class most probably

have a worst-case running time that is exponential, i.e., an NP-complete problem can only

be solved in

)

time for some a > 1. Since this usually implies unreasonably high running times for large n,

problems that are NP-hard are also referred to as being intractable. We have also stated

in the last section--citing from [GaJo79]--that the problem Vertex Cover (see Defini-

tion 3.2) is NP-complete. The NP-completeness of Vertex Cover implies that it is most

probably only solvable in O(a

) time where n is the size of the input instance and a is some

constant. However, this definition provides us with two loopholes:

· We have made no statement about the size of a. A small a could lead to algorithms

that are fast enough even for a fairly large n.

· We have made no good use of the fact that--besides the size of the input graph--any

instance of Vertex Cover contains a parameter k that might be restricted to a small

value. What if we could restrict the exponential complexity of Vertex Cover to the

parameter k which is given along with the input graph according to Definition 3.2 ?

It might seem at first that observing these two "loopholes" is just splitting hairs in an

imprecise definition, but in this section, we shall use exactly them to develop a more efficient

algorithm for Vertex Cover than the trivial O(|V |

· |E|) algorithm used as an example

for complexity analysis in the last section. After that, a short introduction to parameterized

complexity theory is given.

3.3.1

An Efficient Algorithm for Vertex Cover

At the end of this section, we will have improved the O(|V |

· |E|) algorithm for Vertex

Cover given in the previous section to an O(2

· |E|) algorithm. The strategy for this will

be quite straightforward.

Recall the definitions of "planar" and "degree" from Section 3.1.

CHAPTER 3. COMPUTER SCIENCE PRELIMINARIES AND NOTATION

Recall Definition 3.2. If a given graph G has a vertex cover of size k, then we can choose k

vertices in G such that every edge in G includes at least one vertex from the cover. This

means, if we were to search for a vertex cover V for G we can simply pick any edge e =

, v

} in G, and then, knowing that either v

or v

must be in the vertex cover, consider

two distinct cases for V : Either V contains v

or V contains v

. For each of these cases,

we would then look at the uncovered edges, pick one, and again consider the two cases for

putting a vertex of that edge into V (the common term for this is to branch into those two

cases). This recursive algorithm leads to a tree-like structure searching for vertex covers of

size k for G--depicted in Figure 3.3--that is commonly referred to as a search tree. Note

that for each level down the search tree, we have one vertex less left to form a vertex cover

for G. If we cannot find a vertex cover for G in the kth level of the search tree, then, as we

have tried all possibilities of a vertex cover for G, G has no vertex cover of size k.

The described algorithm can be rewritten in a more formal fashion:

Algorithm: search tree algorithm A

tree

for Vertex Cover

Input: A graph G = (V, E) with V = {v

, . . . , v

} and a parameter k

Output: "Yes" if G has a vertex cover of size k, "No" otherwise

if G contains no edges then

return "Yes"

if G contains edges and k = 0 then

return "No"

pick an edge e = {v

, v

} from G

V \ {v

}

E \ {e E | v

is an endpoint of e}

tree

with G

:= (V

, E

) and k - 1 as inputs returns "Yes" then

return "Yes"

V \ {v

}

E \ {e E | v

is an endpoint of e}

tree

with G

:= (V

, E

) and k - 1 as inputs returns "Yes" then

return "Yes"

return "No"

This algorithm is illustrated in Figure 3.3. So what is the running time of this algorithm?

Without the recursion (i.e., calling A

tree

as a subprocedure), A

tree

would require O(|E|)

time to generate G

and G

, since each edge must be looked at to see if it is adjacent to v

or v

, respectively (we shall assume that deleting a vertex from G takes constant time). The

algorithm A

tree

calls A

tree

(with a different input, especially, k is decreased by one) at most

two times. Each of those calls again calls A

tree

at most two times, and so on, until the

algorithm is called with k = 0. This means, in a worst-case analysis, A

tree

is called

initial k

· 2

-1

· 2

-2

· · ·

-k+1

-k=0

= 2

times. Each call itself takes--as mentioned above--O(|E|) time which means in total, A

tree

requires at most

|E|)

time to solve a given instance (G, k) of Vertex Cover, a fairly large improvement compared

to the trivial algorithm proposed in the last section, and moreover, the exponential part in

the running time of A

tree

is independent of the size of G. This makes Vertex Cover a

CHAPTER 3. COMPUTER SCIENCE PRELIMINARIES AND NOTATION

. . .

initial k

- 1

- 2

- 3

Figure 3.3: The search tree for finding a vertex cover of size k for a given graph: Given

the graph G and a parameter k > 3, the above figure demonstrates how a search tree

algorithm would try to find a vertex cover of size k for G. In each node of the search tree,

an edge e = {u, v} that is not yet covered is chosen from G (designated by coloring the

adjacent vertices of e grey) and the algorithm branches into two cases: Either, u is in the

vertex cover or v. The respective vertex from G is chosen into the vertex cover (and can

then, along with its adjacent edges which are now covered, be removed from G), k decreased,

and the algorithm proceeds if no vertex cover for G has been found yet and we have not yet

chosen k vertices into the cover.

fixed-parameter tractable problem, because, as long as k is constant, the time required to

solve Vertex Cover on (G, k) using A

tree

is polynomial (for Vertex Cover, even linear)

with respect to the size of the input graph G.

Note that A

tree

is not the optimal fixed-parameter algorithm for Vertex Cover known to-

day. In [CKJ01] and [NiRo03

], O(1.29

)-algorithms for solving Vertex Cover are given.

This is done by optimizing the search tree: Instead of branching into two subcases and

decreasing k by one each time the algorithm is called recursively, the algorithm may branch

into more complex cases, allowing it to decrease k by more than 1 in some branches of

the tree. Using the mathematical tool of recursion analysis, it can be analyzed how these

complex cases decrease the base of the exponent in the algorithm. It should furthermore

be noted that the algorithm uses the technique of problem kernel reduction

on Vertex

Kernel reductions are based on the idea that using the parameter k, we can already decide for some

parts of the input instance how they will add to the solution of the problem. A problem kernel reduction

causes the input instance to be smaller than f (k) for some f whilst being computable in polynomial time.

Details

Seiten
Erscheinungsform: Originalausgabe
Erscheinungsjahr: 2003
ISBN (eBook): 9783832474812
ISBN (Paperback): 9783838674810
Dateigröße: 1.5 MB
Sprache: Englisch
Institution / Hochschule: Eberhard-Karls-Universität Tübingen – Informatik 17, Wilhelm-Schickard-Institut für Informatik
Erscheinungsdatum: 2014 (April)
Note: 1,0
Schlagworte: human genom phylogen graph haplotyp
Produktsicherheit: Diplom.de

Autor

Sebastian Wernicke (Autor:in)

On the Algorithmic Tractability of Single Nucleotide Polymorphism (SNP) Analysis and Related Problems

Zusammenfassung

Leseprobe

Inhaltsverzeichnis

Details

Autor

Sebastian Wernicke (Autor:in)