Biohacking Primer: DNA Sequencing
DNA was fist discovered in the 1860’s as a mildly acidic substance found in the nuclei of pus cells taken from wounded soldiers. Since the substance was acidic and nuclear, it was called “nucleic acid” In 1944 Oswald Avery first provided definitive proof that DNA was the hereditary material, when he demonstrated that pure DNA preparations were able to transfer genetic information to Pneumococcal bacteria in a “predictable, type-specific, and heritable” manner. Initially Avery’s experiments were discounted, as there was no known mechanism by which the four repeating DNA units could code for the twenty different amino acids found in proteins. Additionally, many researchers stated that Avery’s “pure DNA” was probably contaminated with proteins that actually carried the genetic information. In 1953 James Watson and Francis Crick discovered the basic structure of DNA, in part based on possibly misappropriated X-ray crystallographic data created by Rosalind Franklin. Following this and the discovery that triple nucleotide sequences in DNA (called “codons”) coded for specific amino acids, DNA was accepted as the genetic material. Immediately new techniques were sought that would allow large amounts to DNA to be sequenced quickly, cheaply, accurately, and efficiently.
Initially, DNA sequencing proved to be very difficult, as unlike proteins DNA molecules are; 1) chemically extremely similar, 2) incredibly long polymers (often billions of base pairs in length), and 3) no enzymes that could cut DNA at specific base pair sequences were initially available. The first nucleic acids to be sequenced where short RNA sequences. For example the twenty-four base pair “lac operon” bacterial RNA molecule was sequenced in 1973, a process that took months and 700 grams of bacteria as an RNA source. In 1977 by Maxam and Gilbert developed a chemical cleavage method to sequence radioactively labeled DNA in four nucleotide-specific reactions. The cut DNA was then run into a gel by an electric field and the DNA fragments were separated by length and molecular weight. The DNA sequence was “read” according to where the radioactive DNA fragments migrated in the gel. This sequencing method is slow, and uses toxic chemicals and radiation, but can read up to 400-500 DNA base pairs/reaction. It can however, solve sequencing problems other techniques cannot solve, such as identifying epigenetic changes and the specific sites where proteins attach to DNA (called “footprinting”). Thirty years later this technique still has occasional research applications.
In 1977 Frederick Sanger developed the “chain termination method” of DNA sequencing, also known as Sanger sequencing. In this technique the enzyme which synthesizes DNA (DNA polymerase) is used to make DNA in the presence of DNA precursor nucleotides, low concentrations of DNA bases which block DNA chain lengthening (called “dideoxynucleotides”), and a partially singled stranded DNA molecule to be sequenced. Four different dideoxynucleotides are used, each labeled with a different fluorescent dye attached to one of the four DNA dideoxynucleotides. The result of this sequencing method is fluorescently labeled DNA fragments of different lengths, each fragment labeled with one dideoxynucleotide carrying a fluorescent dye that identifies the specific base type. The fragments can be separated based on length and read on a gel or by “capillary electrophoresis” with laser excitation to identify the specific fluorescent labels. This technique does not require radiation or toxic chemicals and has been extensively automated, allowing many DNA sequences to be processed and read simultaneously (or run in “parallel”). As a result thousands of base pairs can be read in a short time for about $0.50/thousand base pairs. Much of the human genome was sequenced by this method, with accuracy as high as 99.999%. Sanger sequencing is currently often considered the “Gold Standard” for many molecular applications and most DNA sequencing is done by this method, although other sequencing techniques now work well and for a lower cost. Basic Sanger sequencing technology is depicted in below.
Pyrosequencing is another common DNA sequencing technology developed in 1996 by Ronaghi and Nyén at the Royal Institute of Technology in Stockholm. In this technique the DNA sequence is determined by light emission following the incorporation of a nucleotide in a growing DNA chain fixed to a solid support. Each of the four-nucleotide precursors are added sequentially to the sequencing reaction, and when (and only if) the correct nucleotide is incorporated into the growing DNA chain, a “pyrophosphate” is released. The pyrophosphate combines with an enzyme (called “luciferin”) in the reaction to produce light, which is measured by a photomultiplier tube, avalanche photodiode, or charge-coupled device camera. The reaction mix is washed and each of the four nucleotides is sequentially added until the sequencing is completed. Pyrosequencing can sequence about 300-500 base pairs in one run. This technique has been automated and by running multiple simultaneous reactions in automated sequencers, such as the 454 Genome Sequencer (Roche), the average bacterial genome can be sequenced in ten hours.
Although initially slow, cumbersome, and expensive, DNA sequencing techniques have vastly improved. In fact, over the past forty years the known DNA sequences have doubled every sixteen months, giving a logarithmic nine orders of magnitude database increase since 1965. The increasing pace of sequencing technology has been impressive. In 1973 it took months to sequence twenty-four RNA base pairs, while in 2003 the entire human genome was sequenced, and by the end of 2011 roughly 30,000 different human genomes and the genomes of over 180 different species had been fully sequenced. In the United Kingdom there are currently plans to sequence the genomes of up to 100,000 individuals with cancer and other rare diseases, to increase our understanding of these individuals genetic make-up and help develop new cancer treatments. Additionally, the cost of DNA sequencing has fallen enormously. In 2001 the cost of sequencing one million DNA base pairs was $5,292.39, while in October 2010 if was $0.32, a roughly 16,539-fold lower cost. Thus DNA sequencing, a technique that was originally prohibitively expensive, is now routinely performed in basic research and medicine.
Although the current DNA sequencing techniques work, they are limited to sequence “reads” of a few thousand base pairs. Currently, if a researcher wished to sequence one hundred typical genes from on hundred samples, the cost would be 0.3 to 1.0 million dollars. This prohibitively high cost makes projects such as the analysis of whole genomes difficult. For this reason “next generation” DNA sequencing technologies are being developed to allow millions and even billions of DNA base pairs to be sequenced accurately and at a low cost. The development of these new sequencing technologies has become possible due to new molecular sequencing methods becoming available, and most importantly, bioinformatics programs that can assemble billions of shorts DNA sequences (“short reads”) into a coherent genome. This later technology utilizes both the vastly increased computing power developed over the last twenty years and the pre-existing genomic data generated by Sanger sequencing.
Next generation DNA sequencing technologies employ very diverse sequencing technologies. They generally share certain features. Typically DNA to be sequenced is sheared to small fragments and has short uniform DNS sequences attached to each end. The DNA fragments are attached to a solid matrix, such as a glass slide or “chip”, and the DNA is converted to single-stranded DNA by “denaturuation” (often heating), and DNA polymerase or another enzyme is used to lengthen the DNA sequence. Each base is measured by fluorescent dye incorporation, often by pyrosequencing. Optical microscopy is used to measure each incorporation (Figure 2). Since it is possible to place many DNA fragments on a single chip, several billion DNA fragments can be read simultaneously, a volume over one hundred times greater than Sanger technology. Bioinformatics programs can generate a genomic sequence and estimate the accuracy of each base pair sequenced. Presently next generation DNA sequencing has an accuracy of about one-tenth of Sanger sequencing. Thus although it can process very long DNA sequences, its usefulness is limited to studies that do not required an absolutely correct DNA sequence.
Within the human genome, roughly one in one thousand bases are normally different between individuals. These are called “nucleotide polymorphisms” and can be one to several different DNA base pairs in length. Many of these slight sequence variations correlate with disease susceptibility, longevity, and even ones level of compassion and likelihood to become depressed. A major goal of DNA sequencing is a “personal genome”, where an individual’s entire genome is sequenced and analyzed for all sequence polymorphisms. A personal genome showing ones susceptibility or resistance to specific disease could allow an early intervention into the disease process, treating and stopping a disease before it caused death or even significant discomfort. Presently a major goal in molecular pathology is an entire personal human genome as part of routine medical diagnosis and treatment. Although next generation DNA sequencing is still too expensive and inaccurate for this application, within the next few years it should become possible. Presently a “thousand dollar genome” is a goal in medicine and should be achievable, especially with the history of DNA sequencing technology having advanced so far in forty years. In the more, but not very distant future, genomes as large as the human genome may be sequenced cheaply in a few hours with complete accuracy. This would allow the complete analysis and cataloging of the “world’s genome” in the same way that different species have identified and analyzed based on morphology. Such a database would have enormous benefits for the human race and would have absolutely profound effects on biotechnology and the way we live.