Lee H, O'Connor BD, Merriman B, Funari VA, Homer N, Chen Z, Cohn DH, Nelson SF
BACKGROUND: The emergence of next-generation sequencing technology presents tremendous opportunities to accelerate the discovery of rare variants or mutations that underlie human genetic disorders. Although the complete sequencing of the affected individuals' genomes would be the most powerful approach to finding such variants, the cost of such efforts make it impractical for routine use in disease gene research. In cases where candidate genes or loci can be defined by linkage, association, or phenotypic studies, the practical sequencing target can be made much smaller than the whole genome, and it becomes critical to have capture methods that can be used to purify the desired portion of the genome for shotgun short-read sequencing without biasing allelic representation or coverage. One major approach is array-based capture which relies on the ability to create a custom in-situ synthesized oligonucleotide microarray for use as a collection of hybridization capture probes. This approach is being used by our group and others routinely and we are continuing to improve its performance. RESULTS: Here, we provide a complete protocol optimized for large aggregate sequence intervals and demonstrate its utility with the capture of all predicted amino acid coding sequence from 3,038 human genes using 241,700 60-mer oligonucleotides. Further, we demonstrate two techniques by which the efficiency of the capture can be increased: by introducing a step to block cross hybridization mediated by common adapter sequences used in sequencing library construction, and by repeating the hybridization capture step. These improvements can boost the targeting efficiency to the point where over 85% of the mapped sequence reads fall within 100 bases of the targeted regions. CONCLUSIONS: The complete protocol introduced in this paper enables researchers to perform practical capture experiments, and includes two novel methods for increasing the targeting efficiency. Coupled with the new massively parallel sequencing technologies, this provides a powerful approach to identifying disease-causing genetic variants that can be localized within the genome by traditional methods.