Tuesday, April 26, 2011

enrich templates - capture method - hybrid selection method

http://www.ncbi.nlm.nih.gov/pubmed/19182786

Targeting genomic loci by massively parallel sequencing requires new methods to enrich templates to be sequenced. We developed a capture method that uses biotinylated RNA 'baits' to fish targets out of a 'pond' of DNA fragments. The RNA is transcribed from PCR-amplified oligodeoxynucleotides originally synthesized on a microarray, generating sufficient bait for multiple captures at concentrations high enough to drive the hybridization. We tested this method with 170-mer baits that target >15,000 coding exons (2.5 Mb) and four regions (1.7 Mb total) using Illumina sequencing as read-out. About 90% of uniquely aligning bases fell on or near bait sequence; up to 50% lay on exons proper. The uniformity was such that approximately 60% of target bases in the exonic 'catch', and approximately 80% in the regional catch, had at least half the mean coverage. One lane of Illumina sequence was sufficient to call high-confidence genotypes for 89% of the targeted exon space.




Figure 1. Overview of hybrid selection method. Illustrated are steps involved in the preparation of a complex pool of biotinylated RNA capture probes (“bait”; top left), whole-genome fragment input library (“pond”; top right) and hybrid-selected enriched output library (“catch”; bottom). Two sequencing targets and their respective baits are shown in red and blue. Thin and thick lines represent single and double strands, respectively. Universal adapter sequences are grey. The excess of single-stranded non-self-complementary RNA (wavy lines) drives the hybridization. See main text and Methods for details.

genome sequencing using reversible terminator chemistry (Illumina)

http://www.nature.com/nature/journal/v456/n7218/full/nature07517.html




a, DNA fragments are generated, for example, by random shearing and joined to a pair of oligonucleotides in a forked adaptor configuration. The ligated products are amplified using two oligonucleotide primers, resulting in double-stranded blunt-ended material with a different adaptor sequence on either end. b, Formation of clonal single-molecule array. DNA fragments prepared as in a are denatured and single strands are annealed to complementary oligonucleotides on the flow-cell surface (hatched). A new strand (dotted) is copied from the original strand in an extension reaction that is primed from the 3' end of the surface-bound oligonucleotide; the original strand is then removed by denaturation. The adaptor sequence at the 3' end of each copied strand is annealed to a new surface-bound complementary oligonucleotide, forming a bridge and generating a new site for synthesis of a second strand (dotted). Multiple cycles of annealing, extension and denaturation in isothermal conditions result in growth of clusters, each ~1um in physical diameter. This follows the basic method outlined in ref. 33. c, The DNA in each cluster is linearized by cleavage within one adaptor sequence (gap marked by an asterisk) and denatured, generating single-stranded template for sequencing by synthesis to obtain a sequence read (read 1; the sequencing product is dotted). To perform paired-read sequencing, the products of read 1 are removed by denaturation, the template is used to generate a bridge, the second strand is re-synthesized (shown dotted), and the opposite strand is then cleaved (gap marked by an asterisk) to provide the template for the second read (read 2). d, Long-range paired-end sample preparation. To sequence the ends of a long (for example, >1 kb) DNA fragment, the ends of each fragment are tagged by incorporation of biotinylated (B) nucleotide and then circularized, forming a junction between the two ends. Circularized DNA is randomly fragmented and the biotinylated junction fragments are recovered and used as starting material in the standard sample preparation procedure illustrated in a. The orientation of the sequence reads relative to the DNA fragment is shown (magenta arrows). When aligned to the reference sequence, these reads are oriented with their 5' ends towards each other (in contrast to the short insert paired reads produced as shown in ac). See Supplementary Fig. 17a for examples of both. Turquoise and blue lines represent oligonucleotides and red lines represent genomic DNA. All surface-bound oligonucleotides are attached to the flow cell by their 5' ends. Dotted lines indicate newly synthesized strands during cluster formation or sequencing. (See Supplementary Methods for details.)

pooled sequencing

http://www.nature.com/ng/journal/v42/n10/full/ng.659.html

> When you did your PCR amplifications, was this one exon per well? 

Yes -- each of the ~900 amplifications was performed separately in order to avoid non-specific amplification. Then each well was separately (robotically) quantified & diluted (3 times) before being combined together -- to ensure very even representation from each exon. If there's uneven representation between exons then you'll get very different seq coverage per exon and lose power to detect variants. 

> That is, the genomic DNA was pooled but the PCR wasn’t multiplexed, is that correct? Is there a reason not to just design all the amplicons ~ 250 bp rather than concatenating and shearing (I realize that some exons would require multiple amplicons, etc.)? 

The current Pooled Sequencing protocol at Broad no longer does concat-and-shear. This easier protocol seems to work well -- but needs to be carefully designed with the length of reads in mind. The reason concat-and-shear is better (although more laborious) is that you get even coverage across the entire region. Without concat & shear, you get coverage bumps at each end and dips in the middle of each exon (depending on fragment size & read length). However, with concat-and-shear it's annoying on the bioinformatic side to deal with reads that contain the NotI site (these create many false SNP calls). 

> Can I ask why you limited the pool size to ~ 20? We have pooled DNA from ~ 200 – 500 subjects (per phenotype) and were hoping, with that read depth of 3400X or so, that we could get away with bigger pool sizes (i.e., more complex starting material). But maybe not..? 

Most projects at the Broad are pooling 50 individuals, based on pilot data that suggests that there is 80% power to detect a singleton variant in a pool of 100 chromosomes at 1500X coverage. Of course if you have deeper coverage then you'll have more power. But the key is EVENNESS of coverage across all target bp and all samples. It's difficult to combine DNA perfectly equally -- so any variance in the pooling will be reflected in poorer power to detect rare alleles. 

So, it really depends on whether you're trying to find extremely rare alleles (like we were) or more common alleles (eg 1/1000 - 1/100 minor allele frequency). If you're looking for recurrent alleles in your pools, pooling many individuals will be fine. 

We chose 20 individuals per pool because this let us run our entire project in 1 flowcell, and gave up excellent power to detect singletons. 

> Your PCR primers had the NotI sites incorporated – but was there any other additional sequence information (besides the exon-specific regions, of course)?

I don't think so. 

> You referenced the Gnirke Nature Biotechnology report from 2009, but I assume that no Illumina primer sequence was needed until you actually sheared and made the library. And how far up- or down-stream of the exons did you go with your primers? 

25bp from the exon boundary. This let us capture splice sites.

There are some very good reasons to do pooled sequencing. However, analyzing the pooled data was extremely labor intensive -- and much harder than analyzing single patient-per-lane or barcoded data. In addition, the genotyping step was expensive, labor-intensive, and not as accurate for extremely rare variants as we hoped.

In retrospect, I would have skipped this pooled study, and instead done barcoding & hybrid selection (which is what we're currently doing) for a larger # of gene loci. While this is more expensive on the sequencing side, it is less work (thus less money) on the analysis side -- and computational biology time/costs are not always considered in the cost-benefit equation. 

It really depends on the size of your target region. For very small targets, hybrid selection is very inefficient -- so PCR is still be the better choice. For large targets (eg whole exome), it is >80% efficient (% of sequenced reads on target).