Central dogma
- DNA is the genetic code
- DNA is converted, via transcription, to messenger RNA (mRNA)
- mRNA is converted to proteins
- Proteins are the things that do the work
- Proteins are comprised of amino acids
- In mRNA, three letters are a codon, and they encode for one amino
acid
- mRNA is read sequentially, one codon at a time
- There is a step called reverse transcription which is done by
viruses
- mRNA to amino acid code is quite standard
- DNA has an alphabet ACGT
- mRNA has ACGU
- DNA has a direction. Top strand goes left to right, bottom strand
goes from right to left. The reverse compliment
- DNA transcribed to mRNA
- \(A \to U\)
- \(C \to G\)
- \(G \to C\)
- \(T \to A\)
- Coding strand and template strand: read from the template strand,
but matches the coding strand (apart from \(T\) being replaced by \(U\))
What does it mean to sequence a genome?
- Human genome is 3.1 billion bp, that is 3.1 Gbp
- Bacteria genome is 2 million bp, that is 2 Mbp
- Sequence the genome means to understand the order of \(\{A, C, G, T\}\)
- In this course we focus on bacterial genomes because they are
smaller
- You can analyse the data on a laptop
- For the human genome you’d need more compute
What is genome sequencing
- Bacteria has a circular genome
- Maybe a few smaller circles called plasmids
- Vibrio cholera (among others) has two chromosomes in its genome
- The difference between a chromosome and a plasmid is that a plasmid
can be removed and the cell will still function
- The genome is all the chromosomes and plasmids (for a particular
cell)
- In genome sequencing we want to understand the whole genetic make-up
of a cell
- Best sequencing technologies allow us to sequence kbp, but most only
do hbp (h = “hecto”, 10E2)
- Could go around the circle doing fragments
- Instead, take random fragments and do it many times
- Goal is to take fragments and convert them into one piece of
DNA
- What coverage? For each position, how many times do we need to see
it to think that we’ve seen the whole distribution
- You can use the Poisson distribution, and estimate that you need
around 10x coverage to recover the whole thing
- Typically aim for 10x - 100x coverage, where the top end is driven
by the amount of sequencing you can get off of the machine [TODO: Can I
confirm this with a simple simulation?]
- For longer reads, you don’t need as much coverage
Steps in genome sequencing
- Steps
- Grow some bacteria
- Extract DNA
- Library prep
- Sequencer
- fastq-format DNA sequences
- Next
- Assemble fragments into contigs
- Identify open reading frames (ORFs)
- Identify tRNA genes: transfer RNA
- Identify rRNA: ribosomal RNA
- Functionally annotate the genome
- Metabolic modelling
Sanger sequencing
- 1970s Fred Sanger (two Noble prize winner for two different
discoveries) invented dideoxychain termination
- Adding a ddNTP would block adding dNTP
- 1970 - 2002
- Advantages
- Long reads 1kbp
- High accuracy
- Disadvantages
- How throughput (394 sequences at once)
- Sanger sequencing is still in use if you’re just interested in a
short piece
- Costs about $3
454 sequencing
- 454 pyrosequencing
- DNA molecule, as you add new base, you release some inorganic
phosphate
- There is an enzyme called luciferase when when it finds phosphate
releases some light
- They use pico titre plates which have a small well 25\(\mu\)m across
Introduction to Illumina sequencing
- PCR reaction to amplify DNA in local regions
- Fluorescent dNTPs (N means any)
- MiSeq: 1Gb ($2-5,0000)
- HiSeq: 10Gb (?)
Illumina paired end sequencing
- 300bp sequences
- Add adapters to each end…
… skipped a few
Definitions
- Open reading frame (ORF): a stretch of DNA that can be translated
without a stop codon
- Coding sequences (CDS): e.g. protein encoding gene (PEG); regions of
the genome which are actively transcribed and or translated
- Hypothetical / putative proteins: has not been experimentally proven
(trying to move away from this)
- Polypeptide: a stretch of 20-100 consecutive amino acids
An overview of DNA sequence assembly
- Most genomes are around 2Mb
- Sequencing technology: 100 - 50kb
- How to go from short base pair fragments to longer fragments
- Reads to contigs to scaffolds
Naive DNA sequence assembly
- Naive approach, greedy approach, overlap layout consensus, de Bruign
graphs
- Sometimes we have an error in the base
- Fred scores
#' If the distribution of sequences is random...
bases <- 4
possible_kmers <- function(k) bases^k
possible_kmers(3)
## [1] 64
possible_kmers(8)
## [1] 65536
possible_kmers(20)
## [1] 1.099512e+12
- The distribution of bases is NOT random!
Greedy assembly of DNA sequences
- Choose one sequence at random
- Match it to every other sequence, and keep extending until nothing
matches any more and you get a contig
- Then pick another sequence at random and keep extending that and so
on
- It’s a standard greedy type of algorithm
- Does not work very well for (microbial) genomes
Overlap layout consensus sequence assembly approaches
- All read vs all read mapping (with short k-mers)
- Layout to overlap graph
- Resolve inconsistencies wit a multiple sequence alignment
approach
- newbler (Roche)
- Doesn’t scale well to the size of datasets we have now
de Brujin graph
- Most modern approaches focus on this
- https://en.wikipedia.org/wiki/De_Bruijn_graph
- Directed graph representing overlaps between sequences of
symbols
- It has \(m^n\) vertices where \(m\) is the number of symbols and \(n\) is the subsequence length
all_kmers <- function(seq, k) {
substring(seq, 1:(nchar(seq) - k + 1), k:nchar(seq))
}
kmers1 <- all_kmers("AACCGGTTA", 4)
kmers2 <- all_kmers("GGTTATAC", 4)
intersect(kmers1, kmers2)
## [1] "GGTT" "GTTA"
- spades uses a combination of kmers from \(k = 31 \to 127\)
Why does paired end sequencing help assembly?
- Challenge with repeat regions, especially if that region is longer
than your sequence size
- Some people in comments say he’s talking about “mate pairs” not
“paired ends”
More assembly problems
- Sequences are not random
- Genomes are riddled with repeat sequences
- 16S gene is around 1.6kb, and is repeated often in some
organisms
- The longer read technologies help with this
BLAST
- Basic Local Alignment Search Tool
- Rapidly search through a database to match query sequence to known
sequence
- Scales with length of query sequence and number of elements of
database
- Use heuristics to make the search go faster
string_matches <- function(string1, string2) {
strsplit(string1, "")[[1]] == strsplit(string2, "")[[1]]
}
string_matches("CGACTAGATC", "GCTCTAGAGG")
## [1] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE
string_matches("CGACTAGATC", "CCAGTTGTTA")
## [1] TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
- Both have 5/10
TRUE
but in the first case only two
changes need to be made to switch the sequence whereas five need to be
changed in the second case
… skipped more
k-mers
- k-mer is a string of length \(k\)
- Used to make de Bruijn graph
- Error correction on HTS
- Look at underrepresented k-mers, which probably represent
errors
- Digital normalisation
kmer_output <- all_kmers("ATTGACATTAGAT", 3)
table(kmer_output)
## kmer_output
## ACA AGA ATT CAT GAC GAT TAG TGA TTA TTG
## 1 1 2 1 1 1 1 1 1 1
- How to count the k-mers in an efficient way?
- Use multithreading
- Locking doesn’t work, every thread waits for the lock
- Compare and swap
compare_and_swap <- function(location, old_value, new_value) {
current_value <- read(location)
if(current_value == old_value) location <- new_value
return(current_value)
}
Bloom filters
- Data structure that guarantee that if you see something it’s
reported as being there
- Do not guarantee that if report something as being true that it
really is
- In other words, zero false negatives, but maybe some false
positives
hashfn0
hashfn1
hashfn2
hashfn3
hashfni
for i = 0:3
independent and return
different values
- e.g.
kmer1
outputs c(5, 12, 7, 3)