Computational genomics

Central dogma

DNA is the genetic code
DNA is converted, via transcription, to messenger RNA (mRNA)
mRNA is converted to proteins
Proteins are the things that do the work
Proteins are comprised of amino acids
In mRNA, three letters are a codon, and they encode for one amino acid
mRNA is read sequentially, one codon at a time
There is a step called reverse transcription which is done by viruses
mRNA to amino acid code is quite standard
DNA has an alphabet ACGT
mRNA has ACGU
DNA has a direction. Top strand goes left to right, bottom strand goes from right to left. The reverse compliment
DNA transcribed to mRNA
$A \to U$
$C \to G$
$G \to C$
$T \to A$
Coding strand and template strand: read from the template strand, but matches the coding strand (apart from $T$ being replaced by $U$)

What does it mean to sequence a genome?

Human genome is 3.1 billion bp, that is 3.1 Gbp
Bacteria genome is 2 million bp, that is 2 Mbp
Sequence the genome means to understand the order of $\{A, C, G, T\}$
In this course we focus on bacterial genomes because they are smaller
- You can analyse the data on a laptop
- For the human genome you’d need more compute

What is genome sequencing

Bacteria has a circular genome
Maybe a few smaller circles called plasmids
Vibrio cholera (among others) has two chromosomes in its genome
- The difference between a chromosome and a plasmid is that a plasmid can be removed and the cell will still function
The genome is all the chromosomes and plasmids (for a particular cell)
In genome sequencing we want to understand the whole genetic make-up of a cell
Best sequencing technologies allow us to sequence kbp, but most only do hbp (h = “hecto”, 10E2)
Could go around the circle doing fragments
Instead, take random fragments and do it many times
Goal is to take fragments and convert them into one piece of DNA
What coverage? For each position, how many times do we need to see it to think that we’ve seen the whole distribution
You can use the Poisson distribution, and estimate that you need around 10x coverage to recover the whole thing
- Typically aim for 10x - 100x coverage, where the top end is driven by the amount of sequencing you can get off of the machine [TODO: Can I confirm this with a simple simulation?]
- For longer reads, you don’t need as much coverage

Steps in genome sequencing

Steps
1. Grow some bacteria
2. Extract DNA
3. Library prep
4. Sequencer
5. fastq-format DNA sequences
Next
1. Assemble fragments into contigs
2. Identify open reading frames (ORFs)
3. Identify tRNA genes: transfer RNA
4. Identify rRNA: ribosomal RNA
5. Functionally annotate the genome
6. Metabolic modelling

Sanger sequencing

1970s Fred Sanger (two Noble prize winner for two different discoveries) invented dideoxychain termination
Adding a ddNTP would block adding dNTP
1970 - 2002
Advantages
- Long reads 1kbp
- High accuracy
Disadvantages
- How throughput (394 sequences at once)
Sanger sequencing is still in use if you’re just interested in a short piece
Costs about $3

454 sequencing

454 pyrosequencing
DNA molecule, as you add new base, you release some inorganic phosphate
There is an enzyme called luciferase when when it finds phosphate releases some light
They use pico titre plates which have a small well 25$\mu$m across

Introduction to Illumina sequencing

PCR reaction to amplify DNA in local regions
Fluorescent dNTPs (N means any)
MiSeq: 1Gb ($2-5,0000)
HiSeq: 10Gb (?)

Illumina paired end sequencing

300bp sequences
Add adapters to each end…

… skipped a few

Definitions

Open reading frame (ORF): a stretch of DNA that can be translated without a stop codon
Coding sequences (CDS): e.g. protein encoding gene (PEG); regions of the genome which are actively transcribed and or translated
Hypothetical / putative proteins: has not been experimentally proven (trying to move away from this)
Polypeptide: a stretch of 20-100 consecutive amino acids

An overview of DNA sequence assembly

Most genomes are around 2Mb
Sequencing technology: 100 - 50kb
How to go from short base pair fragments to longer fragments
Reads to contigs to scaffolds

Naive DNA sequence assembly

Naive approach, greedy approach, overlap layout consensus, de Bruign graphs
Sometimes we have an error in the base
Fred scores

#' If the distribution of sequences is random...
bases <- 4
possible_kmers <- function(k) bases^k

possible_kmers(3)

## [1] 64

possible_kmers(8)

## [1] 65536

possible_kmers(20)

## [1] 1.099512e+12

The distribution of bases is NOT random!

Greedy assembly of DNA sequences

Choose one sequence at random
Match it to every other sequence, and keep extending until nothing matches any more and you get a contig
Then pick another sequence at random and keep extending that and so on
It’s a standard greedy type of algorithm
Does not work very well for (microbial) genomes

Overlap layout consensus sequence assembly approaches

All read vs all read mapping (with short k-mers)
Layout to overlap graph
Resolve inconsistencies wit a multiple sequence alignment approach
newbler (Roche)
Doesn’t scale well to the size of datasets we have now

de Brujin graph

Most modern approaches focus on this
https://en.wikipedia.org/wiki/De_Bruijn_graph
Directed graph representing overlaps between sequences of symbols
It has $m^n$ vertices where $m$ is the number of symbols and $n$ is the subsequence length

all_kmers <- function(seq, k) {
  substring(seq, 1:(nchar(seq) - k + 1), k:nchar(seq))
}

kmers1 <- all_kmers("AACCGGTTA", 4)
kmers2 <- all_kmers("GGTTATAC", 4)

intersect(kmers1, kmers2)

## [1] "GGTT" "GTTA"

spades uses a combination of kmers from $k = 31 \to 127$

Why does paired end sequencing help assembly?

Challenge with repeat regions, especially if that region is longer than your sequence size
Some people in comments say he’s talking about “mate pairs” not “paired ends”

BLAST

Basic Local Alignment Search Tool
Rapidly search through a database to match query sequence to known sequence
Scales with length of query sequence and number of elements of database
Use heuristics to make the search go faster

string_matches <- function(string1, string2) {
  strsplit(string1, "")[[1]] == strsplit(string2, "")[[1]]
}

string_matches("CGACTAGATC", "GCTCTAGAGG")

##  [1] FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE

string_matches("CGACTAGATC", "CCAGTTGTTA")

##  [1]  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE

Both have 5/10 TRUE but in the first case only two changes need to be made to switch the sequence whereas five need to be changed in the second case

… skipped more

k-mers

k-mer is a string of length $k$
Used to make de Bruijn graph
Error correction on HTS
Look at underrepresented k-mers, which probably represent errors
Digital normalisation

kmer_output <- all_kmers("ATTGACATTAGAT", 3)
table(kmer_output)

## kmer_output
## ACA AGA ATT CAT GAC GAT TAG TGA TTA TTG 
##   1   1   2   1   1   1   1   1   1   1

How to count the k-mers in an efficient way?
Use multithreading
Locking doesn’t work, every thread waits for the lock
Compare and swap
- Guaranteed to be atomic
- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3051319/

compare_and_swap <- function(location, old_value, new_value) {
 current_value <- read(location)
 if(current_value == old_value) location <- new_value
 return(current_value)
}

Bloom filters

Data structure that guarantee that if you see something it’s reported as being there
Do not guarantee that if report something as being true that it really is
In other words, zero false negatives, but maybe some false positives

hashfn0
hashfn1
hashfn2
hashfn3

hashfni for i = 0:3 independent and return different values
- e.g. kmer1 outputs c(5, 12, 7, 3)

Computational genomics

Adam Howes

Central dogma

What does it mean to sequence a genome?

What is genome sequencing

Steps in genome sequencing

Sanger sequencing

454 sequencing

Introduction to Illumina sequencing

Illumina paired end sequencing

Definitions

An overview of DNA sequence assembly

Naive DNA sequence assembly

Greedy assembly of DNA sequences

Overlap layout consensus sequence assembly approaches

de Brujin graph

Why does paired end sequencing help assembly?

More assembly problems

BLAST

k-mers

Bloom filters