After the first discovery of DNA double helix by Watson and Crick in 1953, researchers sought for methods for reading DNA nucleotide sequences (DNA sequencing). In 1977, two groups developed DNA sequencing method. One is the method that is developed b...
After the first discovery of DNA double helix by Watson and Crick in 1953, researchers sought for methods for reading DNA nucleotide sequences (DNA sequencing). In 1977, two groups developed DNA sequencing method. One is the method that is developed by Allan Maxam and Water Gilbert which is based on the chemical modification of DNA that breaks DNA sequences at specific bases. The other is developed by Fredrick Sanger and his colleagues and dominated DNA sequencing field about 30 years.
In early 2000s, new sequencing methodology called next-generation sequencing(NGS) was introduced and now, the era of third-generation sequencing, represented by PacBio sequencing and Nanopore sequencing, has arrived. The biggest difference after next-generation sequencing is the introduction of the concept of “massively-parallel high-throughput” to sequencing technology (Figure 1). Using these technologies researchers discovered the genetic difference between people or population (population genomics), cancer driving mutations of human genome (cancer genomics), and genetic factors related the other human disorders such as autism and diabetes. Early of 2021, which is the 20 years after the first human “draft” genome sequencing, researchers found the “complete” human genome sequencing using all the technologies have developed. Now researchers are look forward to more precisely understand the human genome and genetics. So the improvement of DNA sequencing technologies has led the understanding of life science.
In addition, advances in sequencing technology are reducing the cost of sequencing at a rate faster than Moore’s law to the extent that the human genome can be sequenced analyzed for $1,000 (Figure 2A). Through this, many researchers have been able to produce large amounts of data more easily than before (Figure 2B).
The data produced in this way has become the driving force that transforms the current molecular biology into 'big data science'. Although molecular biology is transforming to big data science, the lab work of individual synthetic biology laboratory is still performed in low-throughput manners such as sequence verification of cloned plasmids one by one manners and this procedure is not only time consuming but also labor and cost intensive step when the number of samples to be analyzed gets larger.
Recently, as the concept of personalized medicine has been introduced in the field of molecular diagnosis, clinicians and medical researchers are sequencing and analyzing patient data. Liquid biopsy, that detects circulating tumor DNA (ctDNA) from of patient’s plasma sample which originated from various sources of cell deaths including apoptosis and necrosis has drawn attention. Because liquid biopsy do not required tumor tissue by obtained by surgery or needle biopsy, this settings is favorable for cancer patients compared to the traditional tumor biopsy based precision medicine. However, previous studies showed that there are two major challenges in liquid biopsy settings. The first challenge is the limit of input amount of material. Previous mathematical model has shown that in early stage of lung cancer patient has median 1.5 ctDNA molecules in 15mL of patient’s plasma samples which is typical blood draw amount [1]. To increase the tumor DNA from plasma samples, the only method is to obtain more blood from patients which is unfavorable. The second challenge is the distinguishing the true tumor DNA signal from background error which is introduced several sources of errors. Acquisition of sequencing data requires DNA extraction, library preparation, and a machine to perform DNA sequencing. In sequencing library preparation step, PCR (Polymerase Chain Reaction) step is required. Because polymerase has error rate (10^(-6)~ 10^(-4))[2], these errors are introduced to original DNA source molecules. Another source of error is sequencing instrument which is known to have 0.1% to 1% [3]. These sources of errors are accumulated in sequencing data. Plus, typical ctDNA fraction in patient’s blood contains less than 1% [4], which is very close to the errors. Because of these challenges, distinguishing true signal and background error requires strong bioinformatics.
Although now biology has moved to data rich science, sill the bioinformatics pipeline is very complicated and not user friendly. For sequencing data analysis, many steps are required (Figure 3). Several different software are required to process sequencing data and these software are not suitable for researchers not having backgrounds in bioinformatics. This hinders the researchers to discover valuable results such as new cancer drug discovery or early diagnosis of patients with disease.
To tackle these problems, I have devised bioinformatics software and pipelines to offer efficient and simple methods for researchers in the field of synthetic biology and precision medicine.
The first part, chapter 1 of this dissertation, I introduce an analysis platform called TnClone to provide synthetic biologists a paradigm shift of their work which reduce the time, cost and labor for the analysis of the various cloned plasmids unprecedented scale.
The second part, chapter 2 of this dissertation, I introduce analytical method that distinguishes sequencing error signals and true variant signal from targeted gene sequencing data of liquid biopsy sample of metastatic colorectal cancer patients. After calling variants from liquid biopsy samples, I investigated the clinical characteristics of patients in conjunction with called variants.