From sequencing reads to diagnostics

NGS in medical genetics

Sounkou Mahamane Toure

Malian Data Science and Bioinformatics Network (MD-BioNet)

February 21, 2024

Introduction

What is DNA/RNA

Deoxyribonucleic acid, commonly referred to as DNA, is a sophisticated molecule housing all the essential information required for the construction and sustenance of of all living organisms.

Source : https://www.geeksforgeeks.org/difference-between-dna-and-rna/

Introduction

What is DNA/RNA Sequencing

  • The determination of the order of nucleotides within a DNA molecule is called DNA sequencing (DNASeq)

  • RNA sequencing or RNASeq is the same process for mRNA molecules

  • Any process or technology that is used to achieve this goal

Introduction

History of DNA Sequencing

source: Aimin Yang, Wei Zhang, Jiahao Wang, Ke Yang, Yang Han and Limin Zhang - doi:10.3389/fbioe.2020.01032

Introduction

Features of NGS

  • Next-generation sequencing (NGS), alternatively labeled as high-throughput sequencing, refers to a variety of contemporary sequencing technologies.
  • What distinguish them from previous techniques:
    • Fast
    • Cheaper, More volume
  • They revolutionize
    • Study of biological systems and their history
    • Relationship between organisms
    • Link between heredity and health and disease (Koboldt et al. 2013)

Introduction

Sequencing Generations

Source : Ronholm, Jennifer & Nasheri, Neda & Petronella, Nicholas & Pagotto, Franco. (2016). Navigating Microbiological Food Safety in the Era of Whole-Genome Sequencing. Clinical Microbiology Reviews. 29. 10.1128/CMR.00056-16.

Introduction

Sequencing Generations

source: https://twitter.com/PacBio/status/1233091102800011266/photo/1

Introduction

Cost of sequencing

source: https://www.genome.gov/sequencingcosts/

Introduction

Scale of data

Increasing amount of data generated (Stephens et al. 2015)

source : https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195

Introduction

Challenges and opportunities

  • Data storage and processing

  • Intepretaton

  • Integration between omics and other sources of information (López de Maturana et al. 2019)

    source: https://www.mdpi.com/2073-4425/10/3/238#

Introduction

Clinical application

  • DNASeq
    • Germline
    • Somatic
    • Human or microbial
  • RNASeq
    • Germline
    • Somatic

In these applications the approach may be targeted or not genome/transcriptom wide.

General NGS workflow

source : 2019 ACE-B NGS Intro Course , Dr. Ghedira Kais

General NGS workflow

Library Preparation

This will depend on the type of application (Head et al. 2014) . But in general

  • DNA/cDNA extraction and purification

  • Enrichment : targeted or not

  • adapters ligation and amplification

    source : https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4351865/figure/F1/

General NGS workflow

Sequencing

source: https://wp.unil.ch/gtf/illumina-short-read-sequencing/

General NGS workflow

Sequencing Output

source : https://biocorecrg.github.io/RNAseq_course_2019/fastq.html

General NGS workflow

Secondary analysis and tertiary analysis

Example of the possibilities for secondary analysis depending on the application (Garcia et al. 2020)

source : https://github.com/nf-core/sarek

General NGS workflow

Secondary analysis and tertiary analysis

source : https://uofabioinformaticshub.github.io/Intro-NGS-Sept-2017/notes/variant_calling.html

General NGS workflow

Clinical interpretations

Interpretation of genomic variation is complex (Quintáns et al. 2014)

DNASeq workflow

Quality Control of reads

The goal here is to access the overall quality of the base calls made

by the sequencer and detect possible anomalies. There are several tools to perform this, here we used fastp (Chen et al. 2018)

DNASeq workflow

Quality Control of reads

DNASeq workflow

Quality Control of reads

DNASeq workflow

Quality Control of reads

After this, decision can be made to do additional quality filtering such as :

  • Further adapter trimming

  • Quality trimming

  • Quality filtering

GC content is a very important parameters here.

DNASeq workflow

Mapping or denovo-assembly

we get individual reads but they usually come from similar regions of the targeted DNA/RNA molecules when we are doing short reads sequencing

DNASeq workflow

Mapping or denovo-assembly

You can either assemble the puzzle from the reads alone (denovo-assembly), use a reference or mix these strategies

DNASeq workflow

Mapping

Several tools exist for the mapping to a reference genome. Note on the mapping process:

  • choice of the reference is important (Which Genome to choose ? , Heng Li)
  • The alignment software and algorithm
    • additional processing required or not ?(duplication marking, base quality recalibration)
    • processing speed
    • Processing requirement in compute
    • Variant calling/RNA quantification algorithm used downstream

DNASeq workflow

Mapping

The ouput of this step is a SAM/BAM file whose specification can be found https://github.com/samtools/hts-specs.

source : https://www.samformat.info/sam-format-flag

DNASeq workflow

Mapping quality check

Here you can detect potential issues with the sequencing reads. One should check :

  • mapping percentage to the reference genome
  • quality of the mapping
  • insert size if paired sequencing
  • duplication rates
  • coverage on the targeted regions for target captures

view mutliqc file for illustration

DNASeq workflow

Variant calling

DNASeq workflow

Variant calling

There are several possibilities to call variants from BAM files. Here is an illustration of GATK best practices for exome variant calling (Alganmi and Abusamra 2023)

source : https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10399881/

DNASeq workflow

Variant calling

Alternative small nucleotide variation (SNV) workflow for this tutorial. GATK based workflow are usable on plateforms like https://app.terra.bio/ or https://seqera.io/.

DNASeq workflow

Variant calling : Importance of benchmarking

  • Test your workflow against established datasets

  • Compare it to alternatives for precision and sensitivity

  • Accurary for single nucleotide variants

  • Accurancy Insertion/deletion

    source : https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10399881/

DNASeq workflow

Variant effect prediction

Several established methods as well as machine learning models

(Cheng et al. 2023)

DNASeq workflow

Variant effect prediction

DNASeq workflow

Variant effect prediction Tools

Considerations here are :

  • Accuracy

  • Usability

  • Cost

  • Access : open source, commercial

For example services like VEP, OpenCravat, Wannovar are free for academic use. Commercial services like Varsome have restrictions.

DNASeq in medical genomics

Challenges

  • Misdiagnosis

  • Phenotypic complexity

  • Lack of knowledge of molecular action of predicted deleterious variants

DNASeq in medical genomics

Opportunities

NGS does increase the diagnostic yield for poorly characterized diseases

(Yska et al. 2019) . An Machine learning models can make contributions (O’Brien et al. 2022)

Summary

Considerations for bioinformaticians

  • Select methods and techniques given
    • use case
    • Available resources
    • sustainability

Summary

Considerations for bioinformaticians

  • Test the methods
  • Look for ameliorations
  • Stay informed

Thank You !

References

Alganmi, Nofe, and Heba Abusamra. 2023. “Evaluation of an Optimized Germline Exomes Pipeline Using BWA-MEM2 and Dragen-GATK Tools.” Edited by Alvaro Galli. PLOS ONE 18 (8): e0288371. https://doi.org/10.1371/journal.pone.0288371.
Chen, Shifu, Yanqing Zhou, Yaru Chen, and Jia Gu. 2018. “Fastp: An Ultra-Fast All-in-One FASTQ Preprocessor.” Bioinformatics 34 (17): i884–90. https://doi.org/10.1093/bioinformatics/bty560.
Cheng, Jun, Guido Novati, Joshua Pan, Clare Bycroft, Akvilė Žemgulytė, Taylor Applebaum, Alexander Pritzel, et al. 2023. “Accurate Proteome-Wide Missense Variant Effect Prediction with AlphaMissense.” Science 381 (6664). https://doi.org/10.1126/science.adg7492.
Garcia, Maxime, Szilveszter Juhos, Malin Larsson, Pall I. Olason, Marcel Martin, Jesper Eisfeldt, Sebastian DiLorenzo, et al. 2020. “Sarek: A Portable Workflow for Whole-Genome Sequencing Analysis of Germline and Somatic Variants.” F1000Research 9 (September): 63. https://doi.org/10.12688/f1000research.16665.2.
Head, Steven R., H. Kiyomi Komori, Sarah A. LaMere, Thomas Whisenant, Filip Van Nieuwerburgh, Daniel R. Salomon, and Phillip Ordoukhanian. 2014. “Library Construction for Next-Generation Sequencing: Overviews and Challenges.” BioTechniques 56 (2): 61–77. https://doi.org/10.2144/000114133.
Koboldt, Daniel C., Karyn Meltz Steinberg, David E. Larson, Richard K. Wilson, and Elaine R. Mardis. 2013. “The Next-Generation Sequencing Revolution and Its Impact on Genomics.” Cell 155 (1): 27–38. https://doi.org/10.1016/j.cell.2013.09.006.
López de Maturana, Evangelina, Lola Alonso, Pablo Alarcón, Isabel Adoración Martín-Antoniano, Silvia Pineda, Lucas Piorno, M. Luz Calle, and Núria Malats. 2019. “Challenges in the Integration of Omics and Non-Omics Data.” Genes 10 (3): 238. https://doi.org/10.3390/genes10030238.
O’Brien, Timothy D., N. Eleanor Campbell, Amiee B. Potter, John H. Letaw, Arpita Kulkarni, and C. Sue Richards. 2022. “Artificial Intelligence (AI)-Assisted Exome Reanalysis Greatly Aids in the Identification of New Positive Cases and Reduces Analysis Time in a Clinical Diagnostic Laboratory.” Genetics in Medicine 24 (1): 192–200. https://doi.org/10.1016/j.gim.2021.09.007.
Quintáns, B., A. Ordóñez-Ugalde, P. Cacheiro, A. Carracedo, and M. J. Sobrido. 2014. “Medical Genomics: The Intricate Path from Genetic Variant Identification to Clinical Interpretation.” Applied & Translational Genomics 3 (3): 60–67. https://doi.org/10.1016/j.atg.2014.06.001.
Stephens, Zachary D., Skylar Y. Lee, Faraz Faghri, Roy H. Campbell, Chengxiang Zhai, Miles J. Efron, Ravishankar Iyer, Michael C. Schatz, Saurabh Sinha, and Gene E. Robinson. 2015. “Big Data: Astronomical or Genomical?” PLOS Biology 13 (7): e1002195. https://doi.org/10.1371/journal.pbio.1002195.
Yska, Hemmo A. F., Kim Elsink, Taco W. Kuijpers, Geert W. J. Frederix, Mariëlle E. van Gijn, and Joris M. van Montfrans. 2019. “Diagnostic Yield of Next Generation Sequencing in Genetically Undiagnosed Patients with Primary Immunodeficiencies: A Systematic Review.” Journal of Clinical Immunology 39 (6): 577–91. https://doi.org/10.1007/s10875-019-00656-x.