RISS 검색 - 학위논문 상세보기

다국어 초록 (Multilingual Abstract)

After the first discovery of DNA double helix by Watson and Crick in 1953, researchers sought for methods for reading DNA nucleotide sequences (DNA sequencing). In 1977, two groups developed DNA sequencing method. One is the method that is developed by Allan Maxam and Water Gilbert which is based on the chemical modification of DNA that breaks DNA sequences at specific bases. The other is developed by Fredrick Sanger and his colleagues and dominated DNA sequencing field about 30 years.
In early 2000s, new sequencing methodology called next-generation sequencing(NGS) was introduced and now, the era of third-generation sequencing, represented by PacBio sequencing and Nanopore sequencing, has arrived. The biggest difference after next-generation sequencing is the introduction of the concept of “massively-parallel high-throughput” to sequencing technology (Figure 1). Using these technologies researchers discovered the genetic difference between people or population (population genomics), cancer driving mutations of human genome (cancer genomics), and genetic factors related the other human disorders such as autism and diabetes. Early of 2021, which is the 20 years after the first human “draft” genome sequencing, researchers found the “complete” human genome sequencing using all the technologies have developed. Now researchers are look forward to more precisely understand the human genome and genetics. So the improvement of DNA sequencing technologies has led the understanding of life science.
In addition, advances in sequencing technology are reducing the cost of sequencing at a rate faster than Moore’s law to the extent that the human genome can be sequenced analyzed for $1,000 (Figure 2A). Through this, many researchers have been able to produce large amounts of data more easily than before (Figure 2B).
The data produced in this way has become the driving force that transforms the current molecular biology into 'big data science'. Although molecular biology is transforming to big data science, the lab work of individual synthetic biology laboratory is still performed in low-throughput manners such as sequence verification of cloned plasmids one by one manners and this procedure is not only time consuming but also labor and cost intensive step when the number of samples to be analyzed gets larger.
Recently, as the concept of personalized medicine has been introduced in the field of molecular diagnosis, clinicians and medical researchers are sequencing and analyzing patient data. Liquid biopsy, that detects circulating tumor DNA (ctDNA) from of patient’s plasma sample which originated from various sources of cell deaths including apoptosis and necrosis has drawn attention. Because liquid biopsy do not required tumor tissue by obtained by surgery or needle biopsy, this settings is favorable for cancer patients compared to the traditional tumor biopsy based precision medicine. However, previous studies showed that there are two major challenges in liquid biopsy settings. The first challenge is the limit of input amount of material. Previous mathematical model has shown that in early stage of lung cancer patient has median 1.5 ctDNA molecules in 15mL of patient’s plasma samples which is typical blood draw amount [1]. To increase the tumor DNA from plasma samples, the only method is to obtain more blood from patients which is unfavorable. The second challenge is the distinguishing the true tumor DNA signal from background error which is introduced several sources of errors. Acquisition of sequencing data requires DNA extraction, library preparation, and a machine to perform DNA sequencing. In sequencing library preparation step, PCR (Polymerase Chain Reaction) step is required. Because polymerase has error rate (10^(-6)~ 10^(-4))[2], these errors are introduced to original DNA source molecules. Another source of error is sequencing instrument which is known to have 0.1% to 1% [3]. These sources of errors are accumulated in sequencing data. Plus, typical ctDNA fraction in patient’s blood contains less than 1% [4], which is very close to the errors. Because of these challenges, distinguishing true signal and background error requires strong bioinformatics.
Although now biology has moved to data rich science, sill the bioinformatics pipeline is very complicated and not user friendly. For sequencing data analysis, many steps are required (Figure 3). Several different software are required to process sequencing data and these software are not suitable for researchers not having backgrounds in bioinformatics. This hinders the researchers to discover valuable results such as new cancer drug discovery or early diagnosis of patients with disease.
To tackle these problems, I have devised bioinformatics software and pipelines to offer efficient and simple methods for researchers in the field of synthetic biology and precision medicine.
The first part, chapter 1 of this dissertation, I introduce an analysis platform called TnClone to provide synthetic biologists a paradigm shift of their work which reduce the time, cost and labor for the analysis of the various cloned plasmids unprecedented scale.
The second part, chapter 2 of this dissertation, I introduce analytical method that distinguishes sequencing error signals and true variant signal from targeted gene sequencing data of liquid biopsy sample of metastatic colorectal cancer patients. After calling variants from liquid biopsy samples, I investigated the clinical characteristics of patients in conjunction with called variants.

번역하기

국문 초록 (Abstract)

왓슨과 크릭이 1953년에 DNA의 이중 나선 구조를 밝히고 난 이후, 연
구자들은 DNA의 서열을 읽는 방법(시퀀싱)을 개발하고자 했다. 1977년
두 곳의 그룹에서 시퀀싱 방법을 제안한다. 한 방법론은 맥삼-길버트
그룹에서 DNA에 화학적 처리를 통해 시퀀싱을 하는 방법으로 이 방법
은 특정 DNA 염기에서 화학적 처리된 것이 DNA를 자르는 것을 이용
하는 방법이다. 다른 방법으로는 프레드릭 생어 그룹에서 만든 방법으
로 이 방법이 개발이후 30년동안 시퀀싱 분야를 지배했다. 이후 2000년
초반 차세대 시퀀싱이 도입이 되며 현재에 이르러서는 PacBio 시퀀싱과
Nanopore 시퀀싱으로 대표되는 3세대 시퀀싱의 시대가 도래하였다. 차
세대 시퀀싱 이후의 가장 큰 차이는 “초 병렬적 고속 대량”의 개념이
시퀀싱에 도입이 된 것이다. 연구자들은 시퀀싱 기술을 이용하여 사람
혹은 집단간의 유전적인 차이(집단 유전체학), 인간 지놈에서 발생한 종
양의 발생을 촉진하는 변이(종양 유전체학), 그리고 자폐증이나 당뇨병
과 같은 유전성 질환에 영향을 주는 유전적 요인 등에 대한 연구를 수
행하였다. 첫 인간의 지놈의 “초본”이 도입된 지 20년이 지난 2021년
초에 앞서 말한 모든 시퀀싱 기술을 이용하여 인간의 “완전한” 지놈의
서열이 공개되었다. 이제 연구자들은 더 정밀하게 인간의 지놈과 유전
학에 대한 이해를 하기를 기대하고 있다. 이처럼 시퀀싱의 발전이 생명
현상을 이해하는 계기가 되었다. 또한 시퀀싱 기술의 발전으로 $1,000으
로 인간 게놈을 시퀀싱하여 분석할 수 있을 정도로 무어의 법칙보다 빠
른 속도로 시퀀싱 가격이 절감되고 있다. 이를 통해 다수의 연구자가
기존보다 손 쉽게 대용량의 데이터를 생산하게 되었다. 이렇게 생산된
데이터들은 현재의 분자 생물학을 ‘빅 데이터 과학’으로 변모해 주는
힘이 되었다. 비록 많은 분자 생물학 연구가 빅 데이터 과학으로 이전
해 가고 있지만 많은 합성 생물학 연구실에서는 여전히 실험실에서 낮
은 처리량의 실험 기법을 이용하고 있다. 그 예시가 DNA 클로닝이 종
료된 후에 생거의 시퀀싱 방법을 통해 DNA서열을 확인하는 방법으로
이는 매우 노동 집약적이며 비용이 들어가는 일이다.
최근 분자 진단 분야에서 개인 맞춤형 의료의 개념이 도입됨에 따라 많
은 의료계열의 종사자들이 환자들의 데이터를 시퀀싱하여 분석을 진행
하고 있다. 이 중 세포 포식, 자가사멸 등에 의해 생성되는 순환 종양
DNA(ctDNA)를 검출하는 액체 생검분야가 각광을 받고 있다. 기존에 환
자의 수술이나 바늘 생검을 통해 진행하는 종양 생검 기반 개인 맞춤형
의료와는 달리 액체 생검은 환자의 조직을 이용하지 않기 때문에 환자
에게 보다 접근성이 좋다. 그러나 앞선 연구들에서 액체 생검에는 두
가지의 제약 사항이 있다. 첫 번째는 시료의 양이 제한된다는 것이다.
기존의 수학적인 모델 연구에 따르면 15mL의 환자의 혈액에 존재하는
ctDNA의 양은 평균 1.5개 정도로 매우 극 미량의 시료이다. 시료의 양
을 증가시키는 유일한 방법은 환자의 혈액을 더 많이 채취하는 것인데,
이는 환자에게 좋은 방법이 아니다. 두 번째 제약사항은 종양 DNA의
참된 신호를 여러 원인에서 발생한 주변의 에러 신호와 구분하기가 어
렵다는 점이다. 시퀀싱 데이터를 얻기 위해서는 DNA 추출, 시퀀싱 라
이브러리 제작 그리고 시퀀싱의 일련의 단계를 거친다. 시퀀싱 라이브
러리 제작 단계에서 PCR이 필수적으로 진행되어야 한다. 하지만 PCR에
이용되는 폴리머레이즈의 에러가 (10−6~ 10−4
) 존재하고 이 에러는 원
본 DNA분자에 남게된다. 다른 에러의 원인은 시퀀싱 기계가 만드는 것
으로 약 0.1% 에서 1% 정도의 에러율을 가진다. 이러한 에러들이 시퀀
싱 데이터에 포진하게 된다. 그런데 혈액 내의 ctDNA의 분율은 1%가
되지 않으며 매우 에러와 근접한 수준이다. 이 때문에 매우 정밀한
이 중에서 혈액을 이용하여 암을 진단하는 액체 생검 분야가 매우 각광
받고 있다. 그러나 액체 생검의 시퀀싱 데이터에서 종양에 해당하는 신
호와 시퀀싱 에러에서 기인한 신호를 구분하는 것이 매우 어려운 일로
알려져 있다. 이를 구분하기 위해서는 엄밀하고 정교한 생물정보학 방
법론이 요구된다.
비록 현대의 생물학이 데이터가 풍족한 과학으로 변화하고 있지만, 현
재의 데이터 분석을 위한 생물정보학적 접근은 매우 복잡하며 유저 친
화적이지 않다. 특히 시퀀싱 데이터 분석에는 많은 절차가 수반되며, 각
절차마다 해당 절차에 필요한 소프트웨어가 다르고 이러한 소프트웨어
를 선별하는 것은 비 전공자에게는 어려움이 따른다.
이러한 복잡성이 연구자들이 새로운 치료제나 종양의 조기진단과 같은
가치있는 연구 성과를 내는 것을 막는 장애물이 된다.
이러한 문제를 해결하기 위해 나는 생물정보학 소프트웨어와 파이프라
인을 고안해서 합성 생물학 분야와 정밀의료 분야에 도움이 되는 방법
론을 개발 했다.
논문의 첫 번째 파트인 챕터1에서는 기존 합성 생물학자들이 고강도의
노동과 시간을 투입하여 만들어 낸 다양한 클로닝된 프라스미드 벡터를
분석하는 방법에 패러다임 시프트를 줄 수 있도록 차세대 시퀀싱을 이
용하여 대용량의 플라스미드를 한번에 분석하는 플랫폼인 TnClone을 소
개한다.
두 번째 파트인 챕터2 에서는 한국인 전이가 있는 대장암 환자들의 혈
액을 타겟 유전자 시퀀싱한 데이터에서 에러와 신호를 구분하는 분석
방법론을 통해 대장암 환자들의 변이를 검출하고 그와 함께 약물 치료
에 대한 반응성, 그리고 생존 분석과 같은 임상 분석에 그 초점을 맞추
었다.

번역하기

왓슨과 크릭이 1953년에 DNA의 이중 나선 구조를 밝히고 난 이후, 연 구자들은 DNA의 서열을 읽는 방법(시퀀싱)을 개발하고자 했다. 1977년 두 곳의 그룹에서 시퀀싱 방법을 제안한다. 한 방법론...

참고문헌 (Reference)

1 Li , H. et al, "The Sequence Alignment/Map format and SAMtools", 25 , 2078-2079, 2009

2 Chang , K. et al ., "The Cancer Genome Atlas Pan-Cancer analysis project .", 45 (, 2013

3 Yaeger , R. et al ., "Clinical Sequencing Defines the Genomic Landscape of Metastatic Colorectal Cancer", 33 , 125-136.e123, 2018

4 Luo , R. et al ., "SOAPdenovo2 : an empirically improved memory-efficient shortread de novo assembler", 1 , 2047-2217X-2041-2018, 2012

5 Lai , Z. et al ., "VarDict : a novel and versatile variant caller for next-generation sequencing in cancer research", 44 , e108-e108, 2016

1 Li , H. et al, "The Sequence Alignment/Map format and SAMtools", 25 , 2078-2079, 2009

2 Chang , K. et al ., "The Cancer Genome Atlas Pan-Cancer analysis project .", 45 (, 2013

3 Yaeger , R. et al ., "Clinical Sequencing Defines the Genomic Landscape of Metastatic Colorectal Cancer", 33 , 125-136.e123, 2018

4 Luo , R. et al ., "SOAPdenovo2 : an empirically improved memory-efficient shortread de novo assembler", 1 , 2047-2217X-2041-2018, 2012

5 Lai , Z. et al ., "VarDict : a novel and versatile variant caller for next-generation sequencing in cancer research", 44 , e108-e108, 2016

상세검색

RISS 보유자료

상세검색

해외전자자료

Development of high-throughput DNA analysis platforms using next-generation sequencing

부가정보

분석정보

연관 공개강의(KOCW)

이 자료와 함께 이용한 RISS 자료

나만을 위한 추천자료