In life sciences, massive amounts of omics data are produced, available as genome, transcriptome, proteome and metabolome. These data have led to great insights and scientific breaktrouhgs in various fields such as developments in pharmaceuticals and biotech, gene therapy treatments, and virology, agriculture and climate change.
But as we expect the volume of omics data to be around 40 exabytes by 2025, the generation of more data is not without issues - especially when it comes to analysing and integrating that data. So, there is a need to improve the complete analysis process from start to finish, so that data acquisition, evaluation, comparison, and results can occur much faster, and more easily.
1. An overview of sequencing techniques
1.1. Chain termination method or Sanger sequencing
This method was developed by Fredrick Sanger and was a major technological breakthrough. Based on this technology the human genome project was completed in 2003. Sanger sequencing is based on PCR (polymerase chain reaction) to make multiple copies of a target DNA region. The ingredients needed for the whole process are a DNA polymerase enzyme, a primer that serves as a starter for the PCR process, the 4 DNA nucleotides and unique to the Sanger method, modified DNA nucleotides, called dideoxyribonucleotide triphosphates (ddNTPs), which are chain terminating and labeled with a specific fluorescence dye.
Once a ddNTP has been added to the chain, the reaction stops. The PCR process is repeated a number of times to make sure that a ddNTP is incorporated in every position of the target DNA. After that, all fragments pass capillary gel electrophoresis. The longer the fragments, the slower they move through this tube filled with gel matrix. Each sequence length thus has a typical speed. At the end of the tube a laser illuminates the passing fragment, and the attached dye is detected. From the colour of the dyes, the original DNA template can be reconstructed.
Sanger sequencing produces high quality results for DNA lengths of appr 900 base pairs. Although next generation sequencing techniques with high throughput volumes are now widely available, Sanger sequencing is still used as a method to confirm sequence variants identified by NGS. Sanger sequencing can also be used to solve some NGS coverage problems e.g., regions rich in GC content that might be poorly covered by NGS.
1.2. Next-generation sequencing
Characteristic for NGS is the high throughput volumes that can be achieved at relatively low costs. A whole NGS workflow consists of library preparation, sequencing and data analysis.
1.2.1. Short-read sequencing
Short-read sequencing technologies typically produce reads of 250-800 base pairs long.
---- DNA- and RNA-seq library preparation
DNA-seq can include Whole Genome Sequencing (WGS), Whole Exome Sequencing (WES), epigenome sequencing and targeted sequencing (TS). Two methods for template preparation are mainly used: PCR and hybridisation capture.
As in any PCR, the same ingredients are included (a template, primer, dNTP’s, DNA polymerase and buffer). All reagents are mixed together in a tube that goes into a thermal cycler. The PCR reaction consists of 3 distinct steps: denaturation (separating the double stranded DNA), annealing (primers bind, anneal to the template DNA at specific positions) and extension (the DNA polymerase attaches to one end of the primer and synthesizes DNA complementary to the template DNA, by raising the temperature at the end of the process all double stranded DNA molecules denature to single strands). In each complete cycle the amount of DNA is doubled.
In hybridisation capture-based preparation of templates, long biotinylated oligonucleotide baits (probes) are used to hybridise the regions of interest. After that, streptavidin-coated magnetic beads are introduced to separate the bait/target fragment complex from fragments not bound to baits. It is in particular used for WS and TS.
DNA-seq library construction further involves fragmentation, end-repair, adaptor ligation and size selection. Fragmentation aims at shearing DNA to the optimal size range for the sequencing platform of choice. Three methods exist; physical by acoustic shearing, enzymatic and chemical (heath). The fragments are then end-repaired and ligated to adaptors. Adaptors have defined lengths and often include a barcode, a unique sequence, to identify samples in the case of multiplex sequencing (multiple samples are pooled and sequenced simultaneously in the same run). These barcodes allow afterwards in data analysis to assign reads to individual samples. Size selection might then be performed based on gel electrophoresis or using magnetic beads.
RNA-seq can include Whole Transcriptome Sequencing (WTS), mRNA sequencing (mRNA-seq) and small RNA sequencing (smRNA-seq). Sample preparation generally includes total RNA isolation, target RNA enrichment and reverse transcription of RNA into complementary DNA (cDNA).
---- Sequencing platforms
The sequencing principle used for short reads is sequencing by synthesis and involves two steps: clonal amplification and sequencing.
Prior to sequencing the DNA library must be attached to a solid surface. Amplification is necessary to increase the signal coming from each target during sequencing. The solid surface to which the unique DNA molecules bind are beads or flow cell surfaces. Depending on the sequencing platform emulsion PCR (Ion Torrent) or bridging PCR (Illumina) is used to amplify the anchored DNA fragments.
On the Ion Torrent Platform, during sequencing, a micro conductor chip (ion sensor) is flooded with unmodified A, C, T or G nucleotides one after another. Incorporation of a single nucleotide releases a hydrogen ion resulting in a pH change, which is measured by the ion sensor. If the next nucleotide that floods the chip is not a match, then no change is detected, and no base is called.
On the Illumina platform sequencing is based on the optical read out of incorporating fluorescent nucleotides by a DNA polymerase. Each nucleotide contains a fluorescent tag and a reversible terminator that blocks incorporation of the next base. The fluorescent signal indicates which nucleotide has been added. After each cycle the terminator is cleaved, allowing a next base to bind. In addition, Illumina NGS platforms are capable of paired- end sequencing, sequencing that occurs from both ends of a DNA fragment, which generates high-quality sequence data with in- depth coverage and high numbers of reads.
1.2.2. Long-read sequencing
Long-read sequencing technologies can produce reads > 10 kb directly from native DNA. These technologies circumvent the need for PCR, sequencing single molecules without prior amplification steps. This is an advantage as PCR can cause errors in the amplification process. Today, for long-read sequencing two main techniques are used.
---- SMRT Sequencing (PacBio)
Single-Molecule Real-Time sequencing is a third-generation sequencing method for DNA and RNA. The DNA to be sequenced is turned into a SMRTbell template. This template is created by ligation of hairpin adapters (SMRTbell adapters) to both ends of the double-stranded DNA. The sequencing reaction takes place in a SMRTcell chip with many small pores called zero-mode waveguides (ZMW). Each ZMW contains an individual DNA polymerase which enables the sequencing of a single SMRTbell template. During replication four fluorescently labeled nucleotides with unique emission spectra are used. As the anchored polymerase incorporates a labeled base, a signature light pulse is emitted measured in real-time. As the template is circular, the polymerase can continue sequencing through the hairpin adapter to replicate the second DNA strand. Sequencing of one strand is called a ‘pass’. The sequence obtained from each ZMW is called a continuous long read (CLR). The adapter sequences are removed to retain the DNA templates in between, resulting in what is called multiple ‘sub reads’ that are collapsed into a HiFi read (highly accurate long read).
PacBio technology can also be used for RNA sequencing by a technique termed Iso-Seq. Using the Iso-Seq method, entire transcripts, including any isoforms, can be sequenced. In this method, RNA is converted to cDNA, and HiFi sequencing is used to generate sequencing data.
---- Oxford Nanopore Technology (ONT)
ONT sequencing is based on the passage of single-stranded nucleic acid (DNA or RNA) through a protein nanopore. The DNA templates are loaded onto a flow cell containing a membrane embedded with hundreds to thousands of nanopores. A preloaded motor enzyme along with an applied ion current, moves the single strand through the pore. The passage of each nucleotide through the pore results in a characteristic disruption in ion current detected by sensors.
Beyond DNA sequencing, ONT may be used to sequence RNA and detect DNA and RNA modifications. Similar to PacBio, ONT can sequence full-length RNA as cDNA. However, ONT also has the ability to use native RNA.
2. Omics data analysis
2.1. Were are we today?
Although the core questions in genetic research are related to disentangling the associations between DNA, RNA and protein, the current tools and methods of data analysis are not oriented towards integration of knowledge.
Today, data analysis is characterised by fragmentation. Whether you are interested in finding similarities or in detecting variations, all omics data analysis is organised in silo’s with different applications for each analytical step. Along the different steps in the analysis process, outputs are generated in different formats. Although automated pipelines are available, the process of analysis remains time consuming and very complex. Only highly trained professionals are able to perform these analyses.
The integration of biological databases is also lacking. So, it is very difficult to query more than one database at a time. Currently, there is also no way to combine analysis in genomics, transcriptomics and proteomics which has proven to be a blocking factor. It is certainly not very helpful in maintaining oversight and easily detecting novel relationships.
Moreover, the current algorithms are far from flawless and result in an accumulation of errors during the analysis process. Additionally, most algorithms are computationally very intensive which results in slow processing times.
New developments in omics data analysis technologies should be aimed at integration of knowledge and at increasing precision of analysis. This would bring a high level of accessibility, efficiency and accuracy to the field.
Further downstream advanced analysis methods such as machine learning or graph reasoning can only produce meaningful insights and predictions when the data that serve as input are of high quality.There exists no classification or prediction algorithm that can compensate for the quality of input data. So, in order to make better models such as in relation to disease mechanisms, or for drug target development, we need algorithms for detection of similarities and variations in DNA, RNA and proteins that produce highly accurate results. Only then, we will be able to deliver better insights and better predictions leading to real advancements in precision medicine and other fields of science.
Integration of data analysis between genomics, transcriptomics and proteomics would not only expand the search field but also bridge the gap between isolated silo’s. It would facilitate the discovery of novel relationships such as between species, in gene transcription processes and other kinds of knowledge necessary for progression in medicine, and other life sciences.
To solve these challenges the BioStrand solution compresses multiple and often disparate stages of traditional omics data analysis into one simple, intuitive and user-friendly interface with the technology doing all the heavy lifting in the background. It eliminates all the usual challenges of building complex pipelines, finding access to multiple databases, and navigating the steep learning curve of a disparate tool environment. The solution actualizes the principle of ‘Data in, Results out’ to streamline and accelerate knowledge extraction and time-to-value.
Search is multi-domain and is as simple as inputting text or pasting bio-sequences with the results displayed on three levels: DNA, RNA, and AA. Drill down, filter, and extrapolate through the results and combine multiple dimensions, such as taxonomy or ontology, to quickly discover novelty functional relationships. Take a microscopic view down to the sequence level or discover other useful visual applications such as ontology maps, frequency tables, or multiple sequence alignment views.
In short, the BioStrand platform is designed to maximize researchers’ view of their data with integrated, comprehensive, and accurate results that accelerate time-to-insight and -value.
2.2. The data analysis steps
The whole data processing is mostly subdivided into 3 steps:
2.2.1. Primary analysis
Generally, primary analysis takes place inside the sequencing platform and consists of converting raw signals into nucleotide base calls. Furthermore, a quality check is performed. Based on base call quality scores (Phred score) and read length reads are filtered out. In case of multiplexing, i.e. multiple samples have being sequenced simultaneously, the separation of reads according to the barcode attached into different files is carried out. Also trimming is performed: removing the adaptor sequences and poor-quality bases at the ends of reads. The output of primary analysis is a FASTQ file.
2.2.2. Secondary analysis
In this step the reads are aligned to a reference genome or a de novo assembly is performed to then call all the variants. Typical file formats are produced: SAM (sequence alignment map), BAM (binary alignment map, a compressed version of SAM) and VCF (variant call format).
---- Multiple Sequence Alignment
A multiple sequence alignment (MSA) arranges protein sequences into a rectangular array with the goal that residues in a given column are homologous (derived from a single position in an ancestral sequence), superposable (in a rigid local structural alignment) or play a common functional role. To this extent, there is no right or wrong alignment; rather, there are different models that reflect different biological perspectives.
Two general ways of thinking about alignments involve consideration either of the degree of similarity shared across the full sequence lengths or of the similarity that’s confined to specific regions of the sequences: the former results in a global alignment, the latter produces a local alignment. Many tools exist to perform local or global alignments.
Two distinct computational methods are used. ‘Dynamic programming’ is a formally very correct and accurate method but lacks scalability and thus is only feasible for small sequences. To address this problem approximate (heuristic) algorithms were developed. Yet these heuristic methods cannot accommodate the big data volumes that need to be computed. BioStrand has developed a completely new algorithm around a very efficient sorting principle called HYFTsTM that addresses the big data and scalability problem.
On the left, the classical MSA method, which is computationally hard
On the right, the BioStrand MSA. The sequences marked as red and orange represent HYFTs that are identified and function as a very efficient sorting mechanism.
---- De Novo assembly
In de novo assembly no reference is used, and reads are aligned to each other based on their sequence similarity to create a long consensus sequence called a contig. In terms of complexity and time requirements, de novo assemblies are orders of magnitude slower and more memory-intensive than mapping-based assemblies. This is mostly due to the fact that the de novo assembly algorithm needs to compare every read with every other read, which is an operation that has a naive time complexity of O(n2) where n = string length.
A typical problem with short reads in de novo assembly is that they can sometimes align equally well to multiple locations in the genome, the longer the read the easier it is to find its position. Paired-end reads reduce this issue to a certain extent since a pair of reads has a known distance in between which is used to validate its alignment position. Therefore, it is crucial to remove reads that are too short prior to performing the alignment as misaligned reads will lead to false-positive variant calls.
---- Variant calling
After reads have been aligned and processed, the next step in the pipeline is to identify differences observed between the selected reference genome and the newly sequenced reads. In short, the aim of variant calling is to identify polymorphic sites where nucleotides are different from the reference. There are multiple tools for variant calling. As outcome of the variant calling step a VCF file is produced.
2.2.3. Tertiary analysis
This step addresses the important issue of making sense of the observed data. In the human genetics context, that is finding the fundamental link between variant data and the phenotype observed in a patient. Tertiary analysis begins with variant annotation, which adds additional information to the variants detected in the previous steps. Then variant interpretation is done, which in the context of human genetics is mostly performed by a a qualified individual such as a clinical geneticist and/or genetic counsellor. At the end of the interpretation process, a variant will be classified as pathogenic or benign for an individual and their phenotype. Variants may also be classified as a variant of unknown significance (VUS) which means that there is currently not enough evidence available to classify the variant as pathogenic or benign. As more evidence is gathered and further testing is performed these classifications may change.
2.3. Further downstream analysis
Depending on the biological context many other types of analysis can be performed on sequence data, e.g. gene expression analyses, various types of visualisations and clustering of data.
3. Protein structure prediction
One of the challenging tasks is protein structure prediction. Predicting a protein’s three-dimensional structure from its amino acid sequence remains an unsolved problem after several decades of efforts. Almost all structure prediction relies on the fact that, for two homologous proteins, structure is more conserved than sequence. If two protein sequences are similar, these two proteins are likely to have a very similar structure. If the query protein has a homolog of known structure, the task is relatively easy and high-resolution models can often be built by copying and refining the framework of the solved structure. However, a template-based modeling procedure does not help answer the questions of how and why a protein adopts its specific structure.
In particular, if structural homologs do not exist, or exist but cannot be identified, models have to be constructed from scratch. This procedure, called ab initio modeling, is essential for a complete solution to the protein structure prediction problem; it can also help us understand the physicochemical principle of how proteins fold in nature. Currently, the accuracy of ab initio modeling is low and the success is generally limited to small proteins (<120 residues). In general, less than 20% of the structure prediction is correct in the majority of models; even in the best cases at most 40% of the residues are modelled accurately.