In life sciences, massive amounts of omics data are produced, available as genome, transcriptome, proteome and metabolome. These data have led to great insights and scientific breaktrouhgs in various fields such as developments in pharmaceuticals and biotech, gene therapy treatments, and virology, agriculture and climate change.
But as we expect the volume of omics data to be around 40 exabytes by 2025, the generation of more data is not without issues - especially when it comes to analysing and integrating that data. So, there is a need to improve the complete analysis process from start to finish, so that data acquisition, evaluation, comparison, and results can occur much faster, and more easily.
1. DNA and RNA Sequencing
The human genome project is most known and was a large-scale genome project conducted in collaboration with many different countries across the globe with an aim to sequence the whole genome of humans. It was groundbreaking research that took 13 years to finish, nevertheless provided us with vast valuable information to progress the field of human genetic research. It also paved the path of not just other scientific areas including crop improvement, forensics, population genetics etc. but also in advancement in sequencing technologies. In recent years, sequencing technology has become more accessible and affordable with a wide range of possibilities in scientific research
1.1. What is sequencing?
The fundamental unit of DNA is the base pairs located on the opposite side of the double helix structure by four different nucleotides adenine (A), cytosine (C), guanine (G), and thymine (T). These long sequences of these pairs form chromosomes and DNA and the purpose of the sequencing is to determine the exact order of these base pairs in our genetic code. So sequencing basically is to identify the nucleotides order determines the functioning of our genes, phenotype, metabolism, genetic risks, diseases etc. Sequencing could help determine a wide range of variants like insertions, deletions, mutations in the DNA resulting in different functions and traits.
For decades, Sanger sequencing has been gold standard sequencing technology fluorescently labelled terminating nucleotides and electrophoresis. In the traditional Sanger sequencing, the molecules were first cloned inside the bacteria, amplified, and then sequenced by adding fluorescently labelled nucleotides by enzyme DNA polymerase. With the advancement of technologies, first-generation sequencing has been replaced by next-generation sequencing (NGS) technologies which were potentially high throughput, affordable and accessible. NGS technologies methods have been developed in the last decade including
- Sequencing by synthesis (SBS) by Illumina
- Single-molecule real-time sequencing by Pac Bio
- Nanopore technology sequencing by Oxford Nanopore
- Sequencing by ligation or SOLiD sequencing by ThermoFisher Scientific
- Combinatorial probe anchor synthesis cPAS- BGI/MGI
- Ion semiconductor or Ion Torrent sequencing by ThermoFisher Scientific
Sequencing could be DNA sequencing or RNA sequencing depending on the downstream applications which could be biomarkers, assisting in molecular diagnostics.
Why is sequencing important?
The usage and benefits of NGS have no limitations with the robustness of the technology that developed over years. NGS can effectively support a wide range of genetic analysis research applications such as:
- Whole-genome sequencing, sequencing, and studying the entire genomes,
- Genotyping, studying variation in the sequences,
- Transcriptome and gene expression, analyzing the differential expression of the transcripts and genes in each set of conditions,
- Epigenetics, identifying heritable changes in regulating the genes.
NGS technology has no limitations and allows researchers to study other areas including microbial diversity and evolution with increased precision. It has become an indispensable tool in genomic research with many valuable insights into complex biological systems ranging from cancer genomics to diverse microbial communities.
1.2. Evolution of sequencing technologies
The field of sequencing has been quite a fast field to keep up as there was a rapid evolution among all the platforms and technologies in the market.
1.2.1. First generation sequencing
DNA sequencing was first studied in the late 1970s and further developed the gel-based method that combines DNA polymerase with a mixture of standard and chain-terminating nucleotides ddNTPs. This random early termination by mixing dNTPs with ddNTPs during PCR is visualized with gel electrophoresis and sorted by length. The technique has revolutionized at that time enabling the sequencing of 500-1000 bp fragments. Later in the 1980s, Sanger sequencing has been automated replacing the radioactive dNTPs were replaced with dye-labelled nucleotides and acrylic-finer capillaries instead of large gels. By the mid-2000s, the cost of sequencing has been dramatically reduced facilitating even the Human genome project.
1.2.2. Second-generation sequencing
The advent of next generations sequencing was started with Solexa which was later acquired by Illumina. Bridge amplification allows the formation of densely clustered amplified fragments on a silicon chip is a key innovation of this platform. Amplification of the molecule to a large cluster of multiple copies to detect the fluorescent signals as a single dNTP is added one at a time as they are synthesized. Illumina became the first commercially available massively parallel sequencing technology with a backend principle ‘Sequencing by Synthesis’. Over time, other tools were developed including the Ion Torrent platform reducing the sequencing costs with reading length ~50-500bp in length and an excellent fit for multiple applications (SNP calling, target sequencing).
1.2.3. Third-generation sequencing
This is new generation sequencing addressing the limitation of the short reads which are not suitable for all the sequencing projects. The approach was Single-molecule, Real-Time (SMRT) sequencing from PacBio. The technique has revolutionized by using miniature wells in which a single polymerase incorporates labelled nucleotides and light emission is measured in real-time during this process. One more approach to long-read sequencing was adopted by Oxford Nanopore technologies. It is a different single-molecule approach using pore-forming proteins and electrical detection for long-read sequencing. SMRT sequencing is the most notable and has the ability to produce long reads with good precision. The challenges of repetitive regions in the short-read sequencer because of their short snippets were addressed by this sequencing approach. Further added advantages include sequencing the extreme-GC at AT regions that cannot be amplified after the cluster generation on short-read platforms and also being able to detect the DNA methylation during the process as no amplification is done on the instrument.
1.3. What is single-cell RNA sequencing?
Individual cells coordinate together and function to determine the complex biological function or traits. Conventional methods like RNAseq provides an overview of the differential expression but are unable to reveal the cellular heterogeneity that drives the complexity. Single-cell RNA (ScRNA) sequencing is an NGS method that examines the individual cells providing a high-resolution view of cell-to-cell variation.
The conventional bulk population sequencing provides the average expression signal of the group of cells and with the increasing evidence suggesting the heterogenous expression in similar cell types and this stochastic expression is reflected in the cell composition and cell fate decisions. The sequencing at single-cell resolution was pioneered by James Eberwine et al., and Iscove and colleagues to be commercially used for high-density DNA microarray chips and eventually adapted for ScRNA sequencing. The first ScRNA study was published in 2009 describing the characterization of early development stages of cells in mice. This study drew the interest of researchers because of the high-resolution views of single-cell heterogeneity on a global scale.
1.3.1. Why is single cell sequencing important?
ScRNA sequencing is quickly becoming a standard tool in scientific research providing the scale and depth of insights in diverse cell populations. Microfluidic technology has transformed the capability of researchers to isolate single cells and further study them using ScRNA-seq. Consistency and reproducibility are the need of the hour and critical for advancements in drug development and precision medicine. The variability and reproducibility issues have been introduced to the conventional bulk sequencing methods by the users. The automated single-cell gene expression pipelines address these critical concerns and would unlock considerable time and resources.
1.3.2. What are the advantages of single-cell RNA sequencing?
With minimal input in an unbiased manner, researchers were able to extract high-resolution cellular data by this single-cell and ultra-low input RNAseq. With the advent of ScRNA-seq, researchers were able to provide a robust transcriptome analysis at a single-cell input level. This high-resolution analysis enables the researchers to discover the cellular differences which are masked by the bulk sampling methods. ScRNA-seq is very helpful in analyzing the rare cell types efficiently compared to the other techniques where they have so many limitations. They could characterize the hidden population effectively to measure the gene expression.
Cancer is one of the complex diseases to understand given its heterogeneous nature. To ensure effective diagnosis and treatment of different cancers, it is important we understand the early stages of their development from the cancer stem cells. ScRNA proved to be effective to understand the intra-tumoral heterogeneity, mapping the clone in the tumor. There has been increasing usage to understand in research of different types of cancers including lung, breast, renal, hepatocellular carcinoma and more. ScRNA also enables in exploring the complex networks beyond the different cell types by integrating and applying to functional genomics, immunology, oncology, as well as stem cell biology.
Apart from the given benefits, ScRNA sequencing technology has been constantly evolving and applied to various applications. The emerging and deeply focused studies strengthen our knowledge to provide novel insights into biological systems and create new opportunities for therapeutic development.
2. An overview of sequencing techniques
2.1. Chain termination method or Sanger sequencing
This method was developed by Fredrick Sanger and was a major technological breakthrough. Based on this technology the human genome project was completed in 2003. Sanger sequencing is based on PCR (polymerase chain reaction) to make multiple copies of a target DNA region. The ingredients needed for the whole process are a DNA polymerase enzyme, a primer that serves as a starter for the PCR process, the 4 DNA nucleotides and unique to the Sanger method, modified DNA nucleotides, called dideoxyribonucleotide triphosphates (ddNTPs), which are chain terminating and labeled with a specific fluorescence dye.
Once a ddNTP has been added to the chain, the reaction stops. The PCR process is repeated a number of times to make sure that a ddNTP is incorporated in every position of the target DNA. After that, all fragments pass capillary gel electrophoresis. The longer the fragments, the slower they move through this tube filled with gel matrix. Each sequence length thus has a typical speed. At the end of the tube a laser illuminates the passing fragment, and the attached dye is detected. From the colour of the dyes, the original DNA template can be reconstructed.
Sanger sequencing produces high quality results for DNA lengths of appr 900 base pairs. Although next generation sequencing techniques with high throughput volumes are now widely available, Sanger sequencing is still used as a method to confirm sequence variants identified by NGS. Sanger sequencing can also be used to solve some NGS coverage problems e.g., regions rich in GC content that might be poorly covered by NGS.
2.2. Next-generation sequencing
Characteristic for NGS is the high throughput volumes that can be achieved at relatively low costs. A whole NGS workflow consists of library preparation, sequencing and data analysis.
2.2.1. Short-read sequencing
Short-read sequencing technologies typically produce reads of 250-800 base pairs long.
---- DNA- and RNA-seq library preparation
DNA-seq can include Whole Genome Sequencing (WGS), Whole Exome Sequencing (WES), epigenome sequencing and targeted sequencing (TS). Two methods for template preparation are mainly used: PCR and hybridisation capture.
As in any PCR, the same ingredients are included (a template, primer, dNTP’s, DNA polymerase and buffer). All reagents are mixed together in a tube that goes into a thermal cycler. The PCR reaction consists of 3 distinct steps: denaturation (separating the double stranded DNA), annealing (primers bind, anneal to the template DNA at specific positions) and extension (the DNA polymerase attaches to one end of the primer and synthesizes DNA complementary to the template DNA, by raising the temperature at the end of the process all double stranded DNA molecules denature to single strands). In each complete cycle the amount of DNA is doubled.
In hybridisation capture-based preparation of templates, long biotinylated oligonucleotide baits (probes) are used to hybridise the regions of interest. After that, streptavidin-coated magnetic beads are introduced to separate the bait/target fragment complex from fragments not bound to baits. It is in particular used for WS and TS.
DNA-seq library construction further involves fragmentation, end-repair, adaptor ligation and size selection. Fragmentation aims at shearing DNA to the optimal size range for the sequencing platform of choice. Three methods exist; physical by acoustic shearing, enzymatic and chemical (heath). The fragments are then end-repaired and ligated to adaptors. Adaptors have defined lengths and often include a barcode, a unique sequence, to identify samples in the case of multiplex sequencing (multiple samples are pooled and sequenced simultaneously in the same run). These barcodes allow afterwards in data analysis to assign reads to individual samples. Size selection might then be performed based on gel electrophoresis or using magnetic beads.
RNA-seq can include Whole Transcriptome Sequencing (WTS), mRNA sequencing (mRNA-seq) and small RNA sequencing (smRNA-seq). Sample preparation generally includes total RNA isolation, target RNA enrichment and reverse transcription of RNA into complementary DNA (cDNA).
---- Sequencing platforms
The sequencing principle used for short reads is sequencing by synthesis and involves two steps: clonal amplification and sequencing.
Prior to sequencing the DNA library must be attached to a solid surface. Amplification is necessary to increase the signal coming from each target during sequencing. The solid surface to which the unique DNA molecules bind are beads or flow cell surfaces. Depending on the sequencing platform emulsion PCR (Ion Torrent) or bridging PCR (Illumina) is used to amplify the anchored DNA fragments.
On the Ion Torrent Platform, during sequencing, a micro conductor chip (ion sensor) is flooded with unmodified A, C, T or G nucleotides one after another. Incorporation of a single nucleotide releases a hydrogen ion resulting in a pH change, which is measured by the ion sensor. If the next nucleotide that floods the chip is not a match, then no change is detected, and no base is called.
On the Illumina platform sequencing is based on the optical read out of incorporating fluorescent nucleotides by a DNA polymerase. Each nucleotide contains a fluorescent tag and a reversible terminator that blocks incorporation of the next base. The fluorescent signal indicates which nucleotide has been added. After each cycle the terminator is cleaved, allowing a next base to bind. In addition, Illumina NGS platforms are capable of paired- end sequencing, sequencing that occurs from both ends of a DNA fragment, which generates high-quality sequence data with in- depth coverage and high numbers of reads.
2.2.2. Long-read sequencing
Long-read sequencing technologies can produce reads > 10 kb directly from native DNA. These technologies circumvent the need for PCR, sequencing single molecules without prior amplification steps. This is an advantage as PCR can cause errors in the amplification process. Today, for long-read sequencing two main techniques are used.
---- SMRT Sequencing (PacBio)
Single-Molecule Real-Time sequencing is a third-generation sequencing method for DNA and RNA. The DNA to be sequenced is turned into a SMRTbell template. This template is created by ligation of hairpin adapters (SMRTbell adapters) to both ends of the double-stranded DNA. The sequencing reaction takes place in a SMRTcell chip with many small pores called zero-mode waveguides (ZMW). Each ZMW contains an individual DNA polymerase which enables the sequencing of a single SMRTbell template. During replication four fluorescently labeled nucleotides with unique emission spectra are used. As the anchored polymerase incorporates a labeled base, a signature light pulse is emitted measured in real-time. As the template is circular, the polymerase can continue sequencing through the hairpin adapter to replicate the second DNA strand. Sequencing of one strand is called a ‘pass’. The sequence obtained from each ZMW is called a continuous long read (CLR). The adapter sequences are removed to retain the DNA templates in between, resulting in what is called multiple ‘sub reads’ that are collapsed into a HiFi read (highly accurate long read).
PacBio technology can also be used for RNA sequencing by a technique termed Iso-Seq. Using the Iso-Seq method, entire transcripts, including any isoforms, can be sequenced. In this method, RNA is converted to cDNA, and HiFi sequencing is used to generate sequencing data.
---- Oxford Nanopore Technology (ONT)
ONT sequencing is based on the passage of single-stranded nucleic acid (DNA or RNA) through a protein nanopore. The DNA templates are loaded onto a flow cell containing a membrane embedded with hundreds to thousands of nanopores. A preloaded motor enzyme along with an applied ion current, moves the single strand through the pore. The passage of each nucleotide through the pore results in a characteristic disruption in ion current detected by sensors.
Beyond DNA sequencing, ONT may be used to sequence RNA and detect DNA and RNA modifications. Similar to PacBio, ONT can sequence full-length RNA as cDNA. However, ONT also has the ability to use native RNA.
3. Omics data analysis
3.1. Were are we today?
Although the core questions in genetic research are related to disentangling the associations between DNA, RNA and protein, the current tools and methods of data analysis are not oriented towards integration of knowledge.
Today, data analysis is characterised by fragmentation. Whether you are interested in finding similarities or in detecting variations, all omics data analysis is organised in silo’s with different applications for each analytical step. Along the different steps in the analysis process, outputs are generated in different formats. Although automated pipelines are available, the process of analysis remains time consuming and very complex. Only highly trained professionals are able to perform these analyses.
The integration of biological databases is also lacking. So, it is very difficult to query more than one database at a time. Currently, there is also no way to combine analysis in genomics, transcriptomics and proteomics which has proven to be a blocking factor. It is certainly not very helpful in maintaining oversight and easily detecting novel relationships.
Moreover, the current algorithms are far from flawless and result in an accumulation of errors during the analysis process. Additionally, most algorithms are computationally very intensive which results in slow processing times.
New developments in omics data analysis technologies should be aimed at integration of knowledge and at increasing precision of analysis. This would bring a high level of accessibility, efficiency and accuracy to the field.
Further downstream advanced analysis methods such as machine learning or graph reasoning can only produce meaningful insights and predictions when the data that serve as input are of high quality.There exists no classification or prediction algorithm that can compensate for the quality of input data. So, in order to make better models such as in relation to disease mechanisms, or for drug target development, we need algorithms for detection of similarities and variations in DNA, RNA and proteins that produce highly accurate results. Only then, we will be able to deliver better insights and better predictions leading to real advancements in precision medicine and other fields of science.
Integration of data analysis between genomics, transcriptomics and proteomics would not only expand the search field but also bridge the gap between isolated silo’s. It would facilitate the discovery of novel relationships such as between species, in gene transcription processes and other kinds of knowledge necessary for progression in medicine, and other life sciences.
To solve these challenges the BioStrand solution compresses multiple and often disparate stages of traditional omics data analysis into one simple, intuitive and user-friendly interface with the technology doing all the heavy lifting in the background. It eliminates all the usual challenges of building complex pipelines, finding access to multiple databases, and navigating the steep learning curve of a disparate tool environment. The solution actualizes the principle of ‘Data in, Results out’ to streamline and accelerate knowledge extraction and time-to-value.
Search is multi-domain and is as simple as inputting text or pasting bio-sequences with the results displayed on three levels: DNA, RNA, and AA. Drill down, filter, and extrapolate through the results and combine multiple dimensions, such as taxonomy or ontology, to quickly discover novelty functional relationships. Take a microscopic view down to the sequence level or discover other useful visual applications such as ontology maps, frequency tables, or multiple sequence alignment views.
In short, the BioStrand platform is designed to maximize researchers’ view of their data with integrated, comprehensive, and accurate results that accelerate time-to-insight and -value.
3.2. The data analysis steps
The whole data processing is mostly subdivided into 3 steps:
3.2.1. Primary analysis
Generally, primary analysis takes place inside the sequencing platform and consists of converting raw signals into nucleotide base calls. Furthermore, a quality check is performed. Based on base call quality scores (Phred score) and read length reads are filtered out. In case of multiplexing, i.e. multiple samples have being sequenced simultaneously, the separation of reads according to the barcode attached into different files is carried out. Also trimming is performed: removing the adaptor sequences and poor-quality bases at the ends of reads. The output of primary analysis is a FASTQ file.
3.2.2. Secondary analysis
In this step the reads are aligned to a reference genome or a de novo assembly is performed to then call all the variants. Typical file formats are produced: SAM (sequence alignment map), BAM (binary alignment map, a compressed version of SAM) and VCF (variant call format).
---- Multiple Sequence Alignment
A multiple sequence alignment (MSA) arranges protein sequences into a rectangular array with the goal that residues in a given column are homologous (derived from a single position in an ancestral sequence), superposable (in a rigid local structural alignment) or play a common functional role. To this extent, there is no right or wrong alignment; rather, there are different models that reflect different biological perspectives.
Two general ways of thinking about alignments involve consideration either of the degree of similarity shared across the full sequence lengths or of the similarity that’s confined to specific regions of the sequences: the former results in a global alignment, the latter produces a local alignment. Many tools exist to perform local or global alignments.
Two distinct computational methods are used. ‘Dynamic programming’ is a formally very correct and accurate method but lacks scalability and thus is only feasible for small sequences. To address this problem approximate (heuristic) algorithms were developed. Yet these heuristic methods cannot accommodate the big data volumes that need to be computed. BioStrand has developed a completely new algorithm around a very efficient sorting principle called HYFTsTM that addresses the big data and scalability problem.
On the left, the classical MSA method, which is computationally hard
On the right, the BioStrand MSA. The sequences marked as red and orange represent HYFTs that are identified and function as a very efficient sorting mechanism.
---- De Novo assembly
In de novo assembly no reference is used, and reads are aligned to each other based on their sequence similarity to create a long consensus sequence called a contig. In terms of complexity and time requirements, de novo assemblies are orders of magnitude slower and more memory-intensive than mapping-based assemblies. This is mostly due to the fact that the de novo assembly algorithm needs to compare every read with every other read, which is an operation that has a naive time complexity of O(n2) where n = string length.
A typical problem with short reads in de novo assembly is that they can sometimes align equally well to multiple locations in the genome, the longer the read the easier it is to find its position. Paired-end reads reduce this issue to a certain extent since a pair of reads has a known distance in between which is used to validate its alignment position. Therefore, it is crucial to remove reads that are too short prior to performing the alignment as misaligned reads will lead to false-positive variant calls.
---- Variant calling
After reads have been aligned and processed, the next step in the pipeline is to identify differences observed between the selected reference genome and the newly sequenced reads. In short, the aim of variant calling is to identify polymorphic sites where nucleotides are different from the reference. There are multiple tools for variant calling. As outcome of the variant calling step a VCF file is produced.
3.2.3. Tertiary analysis
This step addresses the important issue of making sense of the observed data. In the human genetics context, that is finding the fundamental link between variant data and the phenotype observed in a patient. Tertiary analysis begins with variant annotation, which adds additional information to the variants detected in the previous steps. Then variant interpretation is done, which in the context of human genetics is mostly performed by a a qualified individual such as a clinical geneticist and/or genetic counsellor. At the end of the interpretation process, a variant will be classified as pathogenic or benign for an individual and their phenotype. Variants may also be classified as a variant of unknown significance (VUS) which means that there is currently not enough evidence available to classify the variant as pathogenic or benign. As more evidence is gathered and further testing is performed these classifications may change.
3.3. Further downstream analysis
Depending on the biological context many other types of analysis can be performed on sequence data, e.g. gene expression analyses, various types of visualisations and clustering of data.
4. Protein structure prediction
One of the challenging tasks is protein structure prediction. Predicting a protein’s three-dimensional structure from its amino acid sequence remains an unsolved problem after several decades of efforts. Almost all structure prediction relies on the fact that, for two homologous proteins, structure is more conserved than sequence. If two protein sequences are similar, these two proteins are likely to have a very similar structure. If the query protein has a homolog of known structure, the task is relatively easy and high-resolution models can often be built by copying and refining the framework of the solved structure. However, a template-based modeling procedure does not help answer the questions of how and why a protein adopts its specific structure.
In particular, if structural homologs do not exist, or exist but cannot be identified, models have to be constructed from scratch. This procedure, called ab initio modeling, is essential for a complete solution to the protein structure prediction problem; it can also help us understand the physicochemical principle of how proteins fold in nature. Currently, the accuracy of ab initio modeling is low and the success is generally limited to small proteins (<120 residues). In general, less than 20% of the structure prediction is correct in the majority of models; even in the best cases at most 40% of the residues are modelled accurately.