Tuesday, April 26, 2011

pooled sequencing

http://www.nature.com/ng/journal/v42/n10/full/ng.659.html

> When you did your PCR amplifications, was this one exon per well? 

Yes -- each of the ~900 amplifications was performed separately in order to avoid non-specific amplification. Then each well was separately (robotically) quantified & diluted (3 times) before being combined together -- to ensure very even representation from each exon. If there's uneven representation between exons then you'll get very different seq coverage per exon and lose power to detect variants. 

> That is, the genomic DNA was pooled but the PCR wasn’t multiplexed, is that correct? Is there a reason not to just design all the amplicons ~ 250 bp rather than concatenating and shearing (I realize that some exons would require multiple amplicons, etc.)? 

The current Pooled Sequencing protocol at Broad no longer does concat-and-shear. This easier protocol seems to work well -- but needs to be carefully designed with the length of reads in mind. The reason concat-and-shear is better (although more laborious) is that you get even coverage across the entire region. Without concat & shear, you get coverage bumps at each end and dips in the middle of each exon (depending on fragment size & read length). However, with concat-and-shear it's annoying on the bioinformatic side to deal with reads that contain the NotI site (these create many false SNP calls). 

> Can I ask why you limited the pool size to ~ 20? We have pooled DNA from ~ 200 – 500 subjects (per phenotype) and were hoping, with that read depth of 3400X or so, that we could get away with bigger pool sizes (i.e., more complex starting material). But maybe not..? 

Most projects at the Broad are pooling 50 individuals, based on pilot data that suggests that there is 80% power to detect a singleton variant in a pool of 100 chromosomes at 1500X coverage. Of course if you have deeper coverage then you'll have more power. But the key is EVENNESS of coverage across all target bp and all samples. It's difficult to combine DNA perfectly equally -- so any variance in the pooling will be reflected in poorer power to detect rare alleles. 

So, it really depends on whether you're trying to find extremely rare alleles (like we were) or more common alleles (eg 1/1000 - 1/100 minor allele frequency). If you're looking for recurrent alleles in your pools, pooling many individuals will be fine. 

We chose 20 individuals per pool because this let us run our entire project in 1 flowcell, and gave up excellent power to detect singletons. 

> Your PCR primers had the NotI sites incorporated – but was there any other additional sequence information (besides the exon-specific regions, of course)?

I don't think so. 

> You referenced the Gnirke Nature Biotechnology report from 2009, but I assume that no Illumina primer sequence was needed until you actually sheared and made the library. And how far up- or down-stream of the exons did you go with your primers? 

25bp from the exon boundary. This let us capture splice sites.

There are some very good reasons to do pooled sequencing. However, analyzing the pooled data was extremely labor intensive -- and much harder than analyzing single patient-per-lane or barcoded data. In addition, the genotyping step was expensive, labor-intensive, and not as accurate for extremely rare variants as we hoped.

In retrospect, I would have skipped this pooled study, and instead done barcoding & hybrid selection (which is what we're currently doing) for a larger # of gene loci. While this is more expensive on the sequencing side, it is less work (thus less money) on the analysis side -- and computational biology time/costs are not always considered in the cost-benefit equation. 

It really depends on the size of your target region. For very small targets, hybrid selection is very inefficient -- so PCR is still be the better choice. For large targets (eg whole exome), it is >80% efficient (% of sequenced reads on target).

No comments:

Post a Comment