Central dogma

  • DNA is the genetic code
  • DNA is converted, via transcription, to messenger RNA (mRNA)
  • mRNA is converted to proteins
  • Proteins are the things that do the work
  • Proteins are comprised of amino acids
  • In mRNA, three letters are a codon, and they encode for one amino acid
  • mRNA is read sequentially, one codon at a time
  • There is a step called reverse transcription which is done by viruses
  • mRNA to amino acid code is quite standard
  • DNA has an alphabet ACGT
  • mRNA has ACGU
  • DNA has a direction. Top strand goes left to right, bottom strand goes from right to left. The reverse compliment
  • DNA transcribed to mRNA
  • \(A \to U\)
  • \(C \to G\)
  • \(G \to C\)
  • \(T \to A\)
  • Coding strand and template strand: read from the template strand, but matches the coding strand (apart from \(T\) being replaced by \(U\))

What does it mean to sequence a genome?

  • Human genome is 3.1 billion bp, that is 3.1 Gbp
  • Bacteria genome is 2 million bp, that is 2 Mbp
  • Sequence the genome means to understand the order of \(\{A, C, G, T\}\)
  • In this course we focus on bacterial genomes because they are smaller
    • You can analyse the data on a laptop
    • For the human genome you’d need more compute

What is genome sequencing

  • Bacteria has a circular genome
  • Maybe a few smaller circles called plasmids
  • Vibrio cholera (among others) has two chromosomes in its genome
    • The difference between a chromosome and a plasmid is that a plasmid can be removed and the cell will still function
  • The genome is all the chromosomes and plasmids (for a particular cell)
  • In genome sequencing we want to understand the whole genetic make-up of a cell
  • Best sequencing technologies allow us to sequence kbp, but most only do hbp (h = “hecto”, 10E2)
  • Could go around the circle doing fragments
  • Instead, take random fragments and do it many times
  • Goal is to take fragments and convert them into one piece of DNA
  • What coverage? For each position, how many times do we need to see it to think that we’ve seen the whole distribution
  • You can use the Poisson distribution, and estimate that you need around 10x coverage to recover the whole thing
    • Typically aim for 10x - 100x coverage, where the top end is driven by the amount of sequencing you can get off of the machine [TODO: Can I confirm this with a simple simulation?]
    • For longer reads, you don’t need as much coverage

Steps in genome sequencing

  • Steps
    1. Grow some bacteria
    2. Extract DNA
    3. Library prep
    4. Sequencer
    5. fastq-format DNA sequences
  • Next
    1. Assemble fragments into contigs
    2. Identify open reading frames (ORFs)
    3. Identify tRNA genes: transfer RNA
    4. Identify rRNA: ribosomal RNA
    5. Functionally annotate the genome
    6. Metabolic modelling

Sanger sequencing

  • 1970s Fred Sanger (two Noble prize winner for two different discoveries) invented dideoxychain termination
  • Adding a ddNTP would block adding dNTP
  • 1970 - 2002
  • Advantages
    • Long reads 1kbp
    • High accuracy
  • Disadvantages
    • How throughput (394 sequences at once)
  • Sanger sequencing is still in use if you’re just interested in a short piece
  • Costs about $3

454 sequencing

  • 454 pyrosequencing
  • DNA molecule, as you add new base, you release some inorganic phosphate
  • There is an enzyme called luciferase when when it finds phosphate releases some light
  • They use pico titre plates which have a small well 25\(\mu\)m across

Introduction to Illumina sequencing

  • PCR reaction to amplify DNA in local regions
  • Fluorescent dNTPs (N means any)
  • MiSeq: 1Gb ($2-5,0000)
  • HiSeq: 10Gb (?)

Illumina paired end sequencing

  • 300bp sequences
  • Add adapters to each end…

… skipped a few

Definitions

  • Open reading frame (ORF): a stretch of DNA that can be translated without a stop codon
  • Coding sequences (CDS): e.g. protein encoding gene (PEG); regions of the genome which are actively transcribed and or translated
  • Hypothetical / putative proteins: has not been experimentally proven (trying to move away from this)
  • Polypeptide: a stretch of 20-100 consecutive amino acids

An overview of DNA sequence assembly

  • Most genomes are around 2Mb
  • Sequencing technology: 100 - 50kb
  • How to go from short base pair fragments to longer fragments
  • Reads to contigs to scaffolds

Naive DNA sequence assembly

  • Naive approach, greedy approach, overlap layout consensus, de Bruign graphs
  • Sometimes we have an error in the base
  • Fred scores
#' If the distribution of sequences is random...
bases <- 4
possible_kmers <- function(k) bases^k

possible_kmers(3)
## [1] 64
possible_kmers(8)
## [1] 65536
possible_kmers(20)
## [1] 1.099512e+12
  • The distribution of bases is NOT random!

Greedy assembly of DNA sequences

  • Choose one sequence at random
  • Match it to every other sequence, and keep extending until nothing matches any more and you get a contig
  • Then pick another sequence at random and keep extending that and so on
  • It’s a standard greedy type of algorithm
  • Does not work very well for (microbial) genomes

Overlap layout consensus sequence assembly approaches

  • All read vs all read mapping (with short k-mers)
  • Layout to overlap graph
  • Resolve inconsistencies wit a multiple sequence alignment approach
  • newbler (Roche)
  • Doesn’t scale well to the size of datasets we have now

de Brujin graph

  • Most modern approaches focus on this
  • https://en.wikipedia.org/wiki/De_Bruijn_graph
  • Directed graph representing overlaps between sequences of symbols
  • It has \(m^n\) vertices where \(m\) is the number of symbols and \(n\) is the subsequence length
all_kmers <- function(seq, k) {
  substring(seq, 1:(nchar(seq) - k + 1), k:nchar(seq))
}

kmers1 <- all_kmers("AACCGGTTA", 4)
kmers2 <- all_kmers("GGTTATAC", 4)

intersect(kmers1, kmers2)
## [1] "GGTT" "GTTA"
  • spades uses a combination of kmers from \(k = 31 \to 127\)

Why does paired end sequencing help assembly?

  • Challenge with repeat regions, especially if that region is longer than your sequence size
  • Some people in comments say he’s talking about “mate pairs” not “paired ends”

More assembly problems

  • Sequences are not random
  • Genomes are riddled with repeat sequences
  • 16S gene is around 1.6kb, and is repeated often in some organisms
  • The longer read technologies help with this

BLAST

  • Basic Local Alignment Search Tool
  • Rapidly search through a database to match query sequence to known sequence
  • Scales with length of query sequence and number of elements of database
  • Use heuristics to make the search go faster
string_matches <- function(string1, string2) {
  strsplit(string1, "")[[1]] == strsplit(string2, "")[[1]]
}

string_matches("CGACTAGATC", "GCTCTAGAGG")
##  [1] FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE
string_matches("CGACTAGATC", "CCAGTTGTTA")
##  [1]  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE
  • Both have 5/10 TRUE but in the first case only two changes need to be made to switch the sequence whereas five need to be changed in the second case

… skipped more

k-mers

  • k-mer is a string of length \(k\)
  • Used to make de Bruijn graph
  • Error correction on HTS
  • Look at underrepresented k-mers, which probably represent errors
  • Digital normalisation
kmer_output <- all_kmers("ATTGACATTAGAT", 3)
table(kmer_output)
## kmer_output
## ACA AGA ATT CAT GAC GAT TAG TGA TTA TTG 
##   1   1   2   1   1   1   1   1   1   1
compare_and_swap <- function(location, old_value, new_value) {
 current_value <- read(location)
 if(current_value == old_value) location <- new_value
 return(current_value)
}

Bloom filters

  • Data structure that guarantee that if you see something it’s reported as being there
  • Do not guarantee that if report something as being true that it really is
  • In other words, zero false negatives, but maybe some false positives
hashfn0
hashfn1
hashfn2
hashfn3
  • hashfni for i = 0:3 independent and return different values
    • e.g. kmer1 outputs c(5, 12, 7, 3)