In life sciences, massive amounts of omics data are produced, available as genome, transcriptome, proteome and metabolome. This data has led to great insights and scientific breakthroughs in various fields like pharmaceuticals, biotech, gene therapy, virology, agriculture and climate change.
With the expectation of the volume of omics data to be around 40 exabytes by 2025, the data generation is not without issues - especially when it comes to analysing and integrating. There is a need to improve the complete analysis process from start to finish, so that data acquisition, evaluation, comparison, and results can occur faster and easier.
1. DNA and RNA Sequencing
The human genome project is most known and was a large-scale genome project conducted in collaboration with many different countries across the globe with an aim to sequence the whole genome of humans. It was groundbreaking research that took 13 years and provided us with vast valuable information to progress the field of human genetic research. It also paved the way for other scientific areas including crop improvement, forensics, population genetics, and advances in sequencing technologies. In recent years, sequencing technology has become more accessible and affordable with a wide range of possibilities for scientific research.
1.1. What is sequencing?
The fundamental units of DNA are the base pairs located on the opposite sides of the double helix structure by four different nucleotides adenine (A), cytosine (C), guanine (G), and thymine (T). These long sequences of these pairs form chromosomes and DNA. The purpose of the sequencing is to determine the exact order of these base pairs in our genetic code. Sequencing is done to identify the nucleotides order, which determines the functioning of our genes, phenotype, metabolism, genetic risks, diseases etc. Sequencing could help determine a wide range of variants like insertions, deletions, mutations in the DNA resulting in different functions and traits.
For decades, Sanger sequencing has been gold standard sequencing technology fluorescently labelled terminating nucleotides and electrophoresis. In the traditional Sanger sequencing, the molecules were first cloned inside the bacteria, amplified, and then sequenced by adding fluorescently labelled nucleotides by enzyme DNA polymerase. With the advancement of technologies, first-generation sequencing has been replaced by next-generation sequencing (NGS) technologies which were potentially high throughput, affordable and accessible. NGS technologies methods have been developed in the last decade including
- Sequencing by synthesis (SBS) by Illumina
- Single-molecule real-time sequencing by Pac Bio
- Nanopore technology sequencing by Oxford Nanopore
- Sequencing by ligation or SOLiD sequencing by ThermoFisher Scientific
- Combinatorial probe anchor synthesis cPAS by BGI/MGI
- Ion semiconductor or Ion Torrent sequencing by ThermoFisher Scientific
Sequencing could be DNA sequencing or RNA sequencing depending on the downstream applications which could be biomarkers, assisting in molecular diagnostics.
Why is sequencing important?
The usage and benefits of NGS have no limitations with the robustness of the technology that developed over years. NGS can effectively support a wide range of genetic analysis research applications such as:
- Whole-genome sequencing, sequencing, and studying the entire genomes,
- Genotyping, studying variation in the sequences,
- Transcriptome and gene expression, analyzing the differential expression of the transcripts and genes in each set of conditions,
- Epigenetics, identifying heritable changes in regulating the genes.
NGS technology has no limitations and allows researchers to study other areas including microbial diversity and evolution with increased precision. It has become an indispensable tool in genomic research with many valuable insights into complex biological systems ranging from cancer genomics to diverse microbial communities.
1.2. Evolution of sequencing technologies
The field of sequencing has been quite a fast field to keep up as there was a rapid evolution among all the platforms and technologies in the market.
1.2.1. First generation sequencing
DNA sequencing was first studied in the late 1970s and the gel-based method was developed later, that combines DNA polymerase with a mixture of standard and chain-terminating nucleotides ddNTPs. This random early termination by mixing dNTPs with ddNTPs during PCR is visualized with gel electrophoresis and sorted by length. The technique was revolutionized at that time enabling the sequencing of 500-1000 bp fragments. Later in the 1980s, Sanger sequencing was automated, and the radioactive dNTPs were replaced with dye-labelled nucleotides and acrylic-finer capillaries instead of large gels. By the mid-2000s, the cost of sequencing has been dramatically reduced facilitating even the Human genome project.
1.2.2. Second-generation sequencing
The advent of the next generation's sequencing started with Solexa which was later acquired by Illumina. Bridge amplification allows the formation of densely clustered amplified fragments on a silicon chip is a key innovation of this platform. Amplification of the molecule to a large cluster of multiple copies to detect the fluorescent signals as a single dNTP is added one at a time as they are synthesized. Illumina became the first commercially available massively parallel sequencing technology with a backend principle ‘Sequencing by Synthesis’. Over time, other tools were developed including the Ion Torrent platform reducing the sequencing costs with reading length ~50-500bp in length and an excellent fit for multiple applications (SNP calling, target sequencing).
1.2.3. Third-generation sequencing
This is new generation sequencing addressing the limitation of the short reads which are not suitable for all the sequencing projects. The approach was single-molecule, real-time (SMRT) sequencing from PacBio. The technique has revolutionized sequencing by using miniature wells in which a single polymerase incorporates labelled nucleotides and light emission is measured in real-time during this process. Yet another approach to long-read sequencing was adopted by Oxford Nanopore technologies. It is a different single-molecule approach using pore-forming proteins and electrical detection for long-read sequencing. However SMRT sequencing is the most notable and has the ability to produce long reads with good precision. The challenges of repetitive regions in the short-read sequencer because of their short snippets were addressed by this sequencing approach. Further advantages include sequencing the extreme-GC at AT regions that cannot be amplified after the cluster generation on short-read platforms and also being able to detect the DNA methylation during the process without amplification on the instrument.
1.3. What is single-cell RNA sequencing?
Individual cells coordinate together and function to determine the complex biological function or traits. Conventional methods like RNAseq provide an overview of the differential expression but are unable to reveal the cellular heterogeneity that drives the complexity. Single-cell RNA (ScRNA) sequencing is an NGS method that examines the individual cells providing a high-resolution view of cell-to-cell variation.
The conventional bulk population sequencing provides the average expression signal of the group of cells and with increasing evidence suggesting the heterogenous expression in similar cell types, this stochastic expression is reflected in the cell composition and cell fate decisions. The sequencing at single-cell resolution was pioneered by James Eberwine et al., and Iscove and colleagues to be commercially used for high-density DNA microarray chips and eventually adapted for ScRNA sequencing. The first ScRNA study was published in 2009 describing the characterization of early development stages of cells in mice. This study drew the interest of researchers because of the high-resolution views of single-cell heterogeneity on a global scale.
1.3.1. Why is single cell sequencing important?
ScRNA sequencing is quickly becoming a standard tool in scientific research providing the scale and depth of insights in diverse cell populations. Microfluidic technology has transformed the capability of researchers to isolate single cells and further study them using ScRNA-seq. Consistency and reproducibility are the need of the hour and critical for advancements in drug development and precision medicine. The variability and reproducibility issues have been introduced to the conventional bulk sequencing methods by the users. The automated single-cell gene expression pipelines address these critical concerns and would unlock considerable time and resources.
1.3.2. What are the advantages of single-cell RNA sequencing?
With minimal input in an unbiased manner, researchers were able to extract high-resolution cellular data by this single-cell and ultra-low input RNAseq. With the advent of ScRNA-seq, researchers were able to provide a robust transcriptome analysis at a single-cell input level. This high-resolution analysis enables the researchers to discover the cellular differences which are masked by the bulk sampling methods. ScRNA-seq is very helpful in analyzing the rare cell types efficiently compared to the other techniques where they have so many limitations. They could characterize the hidden population effectively to measure gene expression.
Cancer is complex to understand given its heterogeneous nature. To ensure effective diagnosis and treatment of different cancers, it is important that we understand the early stages of its development from the cancer stem cells. ScRNA maps the clone in the tumour which proved to be an effective tool to understand intra-tumoral heterogeneity. This has been useful to understand and research different types of cancers including lung, breast, renal, hepatocellular carcinoma and more. ScRNA also enables the exploration of complex networks beyond the different cell types by integrating and applying to functional genomics, immunology, oncology, and stem cell biology.
Apart from its initial benefits, ScRNA sequencing technology has been constantly evolving and applied to diverse projects. These emerging and deeply focused studies strengthen our knowledge to provide novel insights into biological systems and create new opportunities for therapeutic development.
2. An overview of Sequencing Techniques
2.1. Chain termination method or Sanger sequencing
This method was developed by Fredrick Sanger and was a major technological breakthrough. Based on this technology the human genome project was completed in 2003. Sanger sequencing is based on PCR (polymerase chain reaction) to make multiple copies of a target DNA region. The ingredients needed for the whole process are a DNA polymerase enzyme, a primer that serves as a starter for the PCR process; the 4 DNA nucleotides and unique to the Sanger method; and modified DNA nucleotides, called dideoxyribonucleotide triphosphates (ddNTPs), which are chain terminating and labeled with a specific fluorescence dye.
Once a ddNTP has been added to the chain, the reaction stops. The PCR process is repeated a number of times to make sure that a ddNTP is incorporated in every position of the target DNA. After that, all fragments pass capillary gel electrophoresis. The longer the fragments, the slower they move through this tube filled with gel matrix. Each sequence length thus has a similar speed that is compatible with the program. At the end of the tube, a laser illuminates the passing fragment, and the attached dye is detected. The original DNA template can be reconstructed from the colour of the dyes.
Sanger sequencing produces high quality results for DNA lengths of appr 900 base pairs. Although next generation sequencing techniques with high throughput volumes are now widely available, Sanger sequencing is still relied upon to confirm sequence variants identified by NGS. Sanger sequencing can also be used to solve some NGS coverage problems e.g., regions rich in GC content that might be poorly covered by NGS.
2.2. Next-Generation Sequencing
Characteristic for NGS is the high throughput volumes that can be achieved at relatively low costs. The whole NGS workflow consists of library preparation, sequencing and data analysis.
2.2.1. Short-Read Sequencing
Short-read sequencing technologies typically produce reads of 250-800 base pairs long.
---- DNA- and RNA-seq library preparation
DNA-seq can include Whole Genome Sequencing (WGS), Whole Exome Sequencing (WES), epigenome sequencing and targeted sequencing (TS). Two methods for template preparation are mainly used: PCR and hybridisation capture.
As in any PCR, the same ingredients are included (a template, primer, dNTP’s, DNA polymerase and buffer). All reagents are mixed together in a tube that goes into a thermal cycler. The PCR reaction consists of 3 distinct steps: denaturation (separating the double stranded DNA), annealing (primers bind, anneal to the template DNA at specific positions) and extension (the DNA polymerase attaches to one end of the primer and synthesizes DNA complementary to the template DNA, by raising the temperature at the end of the process all double stranded DNA molecules denature to single strands). In each complete cycle the amount of DNA is doubled.
In hybridisation capture-based preparation of templates- long biotinylated oligonucleotide baits (probes)- are used to hybridise the regions of interest. Then, streptavidin-coated magnetic beads are introduced to separate the bait/target fragment complex from fragments not bound to baits. It is in particularly used for WS and TS.
DNA-seq library construction further involves fragmentation, end-repair, adaptor ligation and size selection. Fragmentation aims at shearing DNA to the optimal size range for the sequencing platform of choice. Three methods exist; physical by acoustic shearing, enzymatic and chemical (heath). The fragments are then end-repaired and ligated to adaptors. Adaptors have defined lengths and often include a barcode, a unique sequence, to identify samples in the case of multiplex sequencing (multiple samples are pooled and sequenced simultaneously in the same run). These barcodes allow afterwards in data analysis to assign reads to individual samples. Size selection might then be performed based on gel electrophoresis or using magnetic beads.
RNA-seq can include Whole Transcriptome Sequencing (WTS), mRNA sequencing (mRNA-seq) and small RNA sequencing (smRNA-seq). Sample preparation generally includes total RNA isolation, target RNA enrichment and reverse transcription of RNA into complementary DNA (cDNA).
---- Sequencing platforms
Sequencing by synthesis is the sequencing principle used for short reads and it involves two steps: clonal amplification and sequencing.
Prior to sequencing the DNA library must be attached to a solid surface. Amplification is necessary to increase the signal coming from each target during sequencing. The solid surface to which the unique DNA molecules bind are beads or flow cell surfaces. Depending on the sequencing platform emulsion PCR (Ion Torrent) or bridging PCR (Illumina) is used to amplify the anchored DNA fragments.
On the Ion Torrent Platform, during sequencing, a micro conductor chip (ion sensor) is flooded with unmodified A, C, T or G nucleotides one after another. Incorporation of a single nucleotide releases a hydrogen ion resulting in a pH change, which is measured by the ion sensor. If the next nucleotide that floods the chip is not a match, then no change is detected, and no base is called.
On the Illumina platform sequencing is based on the optical read out of incorporating fluorescent nucleotides by a DNA polymerase. Each nucleotide contains a fluorescent tag and a reversible terminator that blocks incorporation of the next base. The fluorescent signal indicates which nucleotide has been added. After each cycle the terminator is cleaved, allowing a next base to bind. In addition, Illumina NGS platforms are capable of paired- end sequencing, sequencing that occurs from both ends of a DNA fragment, which generates high-quality sequence data with in- depth coverage and high numbers of reads.
2.2.2. Long-Read Sequencing
Long-read sequencing technologies can produce reads > 10 kb directly from native DNA. These technologies circumvent the need for PCR, sequencing single molecules without prior amplification steps. This is an advantage as PCR can cause errors in the amplification process. Today, long-read sequencing uses two main techniques: SMRT Sequencing and Nanopore Technology.
---- SMRT Sequencing (PacBio)
Single-Molecule Real-Time sequencing is a third-generation sequencing method for DNA and RNA. The DNA to be sequenced is turned into a SMRTbell template. This template is created by ligation of hairpin adapters (SMRTbell adapters) to both ends of the double-stranded DNA. The sequencing reaction takes place in a SMRTcell chip with many small pores called zero-mode waveguides (ZMW). Each ZMW contains an individual DNA polymerase which enables the sequencing of a single SMRTbell template. During replication four fluorescently labeled nucleotides with unique emission spectra are used. As the anchored polymerase incorporates a labeled base, a signature light pulse is emitted measured in real-time. As the template is circular, the polymerase can continue sequencing through the hairpin adapter to replicate the second DNA strand. Sequencing of one strand is called a ‘pass’. The sequence obtained from each ZMW is called a continuous long read (CLR). The adapter sequences are removed to retain the DNA templates in between, resulting in what is called multiple ‘sub reads’ that are collapsed into a HiFi read (highly accurate long read).
PacBio technology can also be used for RNA sequencing by a technique termed Iso-Seq. Using the Iso-Seq method, entire transcripts, including any isoforms, can be sequenced. In this method, RNA is converted to cDNA, and HiFi sequencing is used to generate sequencing data.
---- Oxford Nanopore Technology (ONT)
ONT sequencing is based on the passage of single-stranded nucleic acid (DNA or RNA) through a protein nanopore. The DNA templates are loaded onto a flow cell containing a membrane embedded with hundreds to thousands of nanopores. A preloaded motor enzyme along with an applied ion current, moves the single strand through the pore. The passage of each nucleotide through the pore results in a characteristic disruption in the ion current detected by sensors.
Beyond DNA sequencing, ONT may be used to sequence RNA and detect DNA and RNA modifications. Similar to PacBio, ONT can sequence full-length RNA as cDNA. However, ONT also has the ability to use native RNA.
3. Omics Data Analysis
The group of technologies that are used to study complex biological systems are collectively referred to as “omics”. The suffix – ‘omic’ is generally used to describe something big and hence the field that utilizes large amounts of information and studies life and its various mechanisms and principles are referred to as “Omics”.
The field of molecular biology in its quest for an understanding of the fundamentals of life and living beings focuses on either genes or gene products i.e., proteins. However, after many years of research and development, this approach still fails to answer many of the fundamental questions. The intricate network of genes and gene products and their interdependencies could be a reason for this failure. Since the genes and their products interact with each other to a large extent, a better way of studying the complex system of the network would be to use two approaches:
- Individual gene and gene product approach and
- A combinatorial study of genes, their interactions with each other, gene products and their interaction with genes and other gene products.
3.1. What is OMICs Data Analysis?
OMICs data analysis examine the complex interaction of genes, molecules and their influence on the phenotype. Different omics technologies allow for the in-depth exploration of the different modalities of data such as genomics, transcriptomics, proteomics, metabolomics, glycomics and lipidomics. Data from each modality can be used to compare different groups of data for a better understanding of the underlying principles. For example, genomic data from healthy and diseased individuals can be compared for the identification of the features causing the diseased phenotype. Similarly, gene products or protein-protein comparisons could find a role in the identification of toxicology assessment reports.
The first of the omics technology to be developed was genomics which is the study of genes and chromosomes. This resulted in the generation of massive amounts of genomic data which has applications in multiple fields like medicine, biotechnology, pharmaceutical etc. By nature, there are a wide range of variations in the genomes of organisms. Within species of one organism, smaller differences can be observed. The identification of these differences is helpful to a range of applications. For example, these regions of difference can be used to develop a personalized profile for a patient. It can also serve as a potential therapeutic target in drug development studies. Genomics has many subfields:
- Structural genomics: Aims to determine the structure of every protein encoded by the genome.
- Functional genomics: Aims to collect and use data from sequencing for describing gene and protein functions.
- Comparative genomics: Aims to compare genomic features between different species.
- Mutation genomics: Studies the genome in terms of mutations that occur in a person's DNA or genome.
The next omics technology to be developed was transcriptomics. This approach is used to measure the gene expression level. Understanding the quality and quantity of gene expression in the native and altered state has potential in therapeutic drug development.
Proteomics is another field that was involved for the large-scale analysis of proteins. Generation and Analysis of the proteomic data include many techniques such as protein fractionation methods, mass spectrometry and bioinformatic approaches to sift through the data and derive meaningful inferences. Metabolomics is yet another field that includes information about cell metabolites such as carbohydrates, peptides and nucleosides. Nuclear magnetic resonance (NMR) and mass spectrometry are the most common techniques used in the field of metabolomics.
3.2. Where are we today?
Although the core questions in genetic research are related to disentangling the associations between DNA, RNA and protein, the current tools and methods of data analysis are not oriented towards the integration of knowledge.
Today, data analysis is characterised by fragmentation. Whether you are interested in finding similarities or in detecting variations, all omics data analysis is organised in silos with different applications for each analytical step. During the analysis process, outputs are generated in different formats. Although automated pipelines are available, the process of analysis remains time consuming and very complex. Only highly trained professionals are able to perform these analyses.
The integration of biological databases is also lacking. Therefore, it is very difficult to query more than one database at a time. Currently, there is also no way to combine analysis in genomics, transcriptomics and proteomics, which has proven to be an impediment. This makes it difficult to maintain oversight and easily detect novel relationships.
Moreover, the current algorithms are far from flawless and result in an accumulation of errors during the analysis process. Additionally, most algorithms are computationally very intensive which results in slow processing times.
New developments in omics data analysis technologies should be aimed at the integration of knowledge and at increasing the precision of analysis. This would bring a high level of accessibility, efficiency and accuracy to the field.
Further downstream advanced analysis methods such as machine learning or graph reasoning could produce meaningful insights and predictions when the data that serve as input are of high quality. Currently no classification or prediction algorithm that can compensate for the quality of input data exists. So, in order to make better models when studying disease mechanisms, drug target development, etc., we need algorithms for the detection of similarities and variations in DNA, RNA and proteins that produce highly accurate results. Only then, will we be able to deliver better insights and better predictions leading to real advancements in precision medicine and other fields of science.
Integration of data analysis between genomics, transcriptomics and proteomics would not only expand the search field but also bridge the gap between isolated silos. It would facilitate the discovery of novel relationships such as between species, in gene transcription processes and other discoveries necessary for progression in medicine, and other life sciences.
To solve these challenges the BioStrand solution compresses multiple and often disparate stages of traditional omics data analysis into one simple, intuitive and user-friendly interface with the technology doing all the heavy lifting in the background. It eliminates all of the usual challenges of building complex pipelines, finding access to multiple databases, and navigating the steep learning curve of a disparate tool program. The solution actualizes the principle of ‘Data in, Results out’ to streamline and accelerate knowledge extraction and time-to-value.
Search is multi-domain and is very user-friendly: simply as input text or paste bio-sequences and the results will be displayed on three levels: DNA, RNA, and AA. Filter through the results and combine multiple dimensions, such as taxonomy or ontology, to quickly discover novelty functional relationships. Dive down to a microscopic view at the sequence level or discover other useful visualisation tools such as ontology maps, frequency tables, or multiple sequence alignment views.
In short, the BioStrand platform is designed to maximize researchers’ understanding of their data with integrated, comprehensive, and fast, accurate results.
3.3. What are Different Types of OMICs?
Advancements in technology have allowed for the linking of genes and their products to each other; then, this can be analysed to identify trends and patterns that could not be identified with traditional approaches. Each of these molecules occupies an omics space and there are multiple kinds of omics currently under study. To name a few:
Genomics: the study of genes and their function
Proteomics: the study of proteins
Metabolomics: the study of molecules involved in cellular metabolism
Transcriptomics: the study of mRNA
Glycomics: the study of cellular carbohydrates
Lipidomics: the study of cellular lipids
The use of multiple omics data together gives a holistic and complete view of the cell, tissue or organ being studied. This approach is unique as it allows for a hypothesis to be created, founded on data analysis, and be tested further, as opposed to the traditional approach of hypothesis-driven data analysis. The use of OMICs technology not only provides a better understanding of the mechanism and function of a process at a molecular level but also allows a deeper understanding of the etiology of a disease.
3.4. The Data Analysis Steps
The whole data processing is mostly subdivided into 3 steps: Primary, Secondary, and Tertiary Analyses.
3.4.1. Primary analysis
Generally, the primary analysis takes place inside the sequencing platform and consists of converting raw signals into nucleotide base calls. Next, a quality check is performed. Reads are filtered out based on base call quality scores (Phred score) and read length. In case of multiplexing, i.e. multiple samples have been sequenced simultaneously, the separation of reads according to the barcode attached to different files are carried out. Finally, trimming is performed: removing the adaptor sequences and poor-quality bases at the ends of reads. The output of primary analysis is a FASTQ file.
3.4.2. Secondary analysis
In the second step, the reads are either aligned to a reference genome or a de novo assembly is performed to then call all the variants. Typical file formats are produced: SAM (sequence alignment map), BAM (binary alignment map, a compressed version of SAM) and VCF (variant call format).
---- Multiple Sequence Alignment
A multiple sequence alignment (MSA) arranges protein sequences into a rectangular shape so that residues in a given column are homologous (derived from a single position in an ancestral sequence), superposable (in a rigid local structural alignment), or play a common functional role. To this extent, there is no right or wrong alignment; rather, there are different models that reflect different biological perspectives.
There are two general ways of thinking about alignments: (1) the degree of similarity shared across the full sequence lengths, or (2) by the similarity that is confined to specific regions of the sequences. The former results in a global alignment, while the latter produces a local alignment. Many tools exist to perform local or global alignments.
Two distinct computational methods are used: ‘Dynamic programming’ and heuristic algorithms. 'Dynamic programming' is an accurate method but lacks scalability and thus is only feasible for small sequences. To address this problem approximate (heuristic) algorithms were developed. However, these heuristic methods also cannot compute large data volumes. BioStrand has developed a completely new algorithm around a very efficient sorting principle called HYFTsTM that addresses the big data and scalability problem.
On the left, the classical MSA method, which is computationally hard
On the right, the BioStrand MSA. The sequences marked as red and orange represent HYFTsTM that are identified and function as a very efficient sorting mechanism.
---- De Novo assembly
In de novo assembly no reference is used and reads are aligned to each other based on their sequence similarity to create a long consensus sequence called a contig. In terms of complexity and time requirements, de novo assemblies are slower and more memory-intensive than mapping-based assemblies. This is mostly due to the fact that the de novo assembly algorithm needs to compare every read with every other read, which is an operation that has a naive time complexity of O(n2) where n = string length.
A typical problem with short reads in de novo assembly is that they can sometimes align equally well to multiple locations in the genome- the longer the read the easier it is to find its position. Paired-end reads reduce this issue to a certain extent since a pair of reads has a known distance in between which is used to validate its alignment position. Therefore, it is crucial to remove reads that are too short prior to performing the alignment, as misaligned reads will lead to false-positive variant calls.
---- Variant calling
After reads have been aligned and processed, the next step in the pipeline is to identify differences observed between the selected reference genome and the newly sequenced reads. In short, the aim of variant calling is to identify polymorphic sites where nucleotides are different from the reference. There are multiple tools for variant calling, all of which produce a VCF file.
3.4.3. Tertiary Analysis
The third step addresses the important issue of making sense of the observed data. In the context of human genetics, that is finding the fundamental link between variant data and the phenotype observed in a patient. Tertiary analysis begins with variant annotation, which adds additional information to the variants detected in the previous steps. Then variant interpretation is done which, in the context of human genetics is mostly performed by a a qualified individual such as a clinical geneticist and/or genetic counsellor. At the end of the interpretation process, a variant will be classified as pathogenic or benign for an individual and their phenotype. Variants may also be classified as a variant of unknown significance (VUS), which means that there is currently not enough evidence available to classify the variant as pathogenic or benign. As more evidence is gathered and further testing is performed, these classifications may change.
3.5. Further Downstream Analysis
Depending on the biological context many other types of analysis can be performed on sequence data, e.g. gene expression analyses, various types of visualisations and the clustering of data.
Read more: Why Omics Data Analysis Needs To Be Remade
4. An Overview of Multi-Omics Data Analysis
4.1. What is Multi-OMICs?
Multi-OMICs is a novel approach of integrating different OMICs datasets during analysis and approaching a research question armed with a combined suite of different OMIC technologies. Currently, multi-OMICs seems to be the best approach currently to achieve an in-depth understanding of the complex nature of living beings. This is mainly due to the crosstalk between the different modalities. Genes interact with other genes and gene products which influence the metabolites and regulate each other. The advent of Next-Gen Sequencing technologies has made it possible to generate data at all levels. Advancements in computational biology have led to the development of many tools that allow for the integration and analysis of data at multiple different levels. Studying all of these different levels in their entirety is paramount in understanding the mechanism, the difference between the native and altered state and consequently the development of drug targets.
4.2. Why Integrate OMICs Data?
Because there is a high degree of crosstalk between genes, proteins, metabolites, lipids and microbiomes, it has become imperative to study all these different aspects together to understand the organism as a whole. Targeting individual areas no longer provides the answers to the questions we are now asking.
Integrating data on multiple levels sequentially or simultaneously makes it possible to understand the interplay between the different molecules. Doing so provides a greater degree of specificity to a particular aspect of research, which is especially appreciated in many applications especially in the clinical world where it has the potential to improve disease diagnostics and prognostics./p>
Another important advantage of integrating omics data is being able to formulate hypotheses based on the integrated multi-omics data. For example, if an interactome map provides information on the network of proteins from which we can draw inferences about the existence of a multi-complex protein, we will still need to validate if this is a genuine complex. Utilizing data from a transcriptome will provide valuable information on the number of transcript clusters. Combining both the transcriptomics and interactomics information will then provide sufficient evidence for the presence of a multi-protein complex. This combined data can also be used to predict the functions of individual proteins. Further analysis of this data could lead to the formulation of a hypothesis. For example, if genes known for a specific trait show different subgroups during transcriptional analysis - one associated with general cellular factors and the other that is transcriptionally associated with the specific trait- we can hypothesize that both the genes are trait-specific factors.
4.3. Methods for the Integration of Multi-OMICs Data
The integration methods for multi-omics data can be broadly categorized into many categories based on the criteria. To mention a few:
- Sequential integration and parallel integration
- Stringent inputs (specific data modalities) or general inputs
In the sequential integration method, the result of one layer will be passed on to the next layer as input. Parallel integration can be done by models that consider each layer as a separate entity. iPAC and MCD (Multiple Concerted Disruption) are a few examples of sequential stringent input methods where genomic and gene expression data is parsed based on a few selection criteria. Integromics is a less specific approach and can be used to fit general types of omic data. This approach is best suited for studying biological markers and the identification of clusters in samples.
Data dimensionality reduction is an important aspect to consider while considering the integration of multi-omics data. The principle behind dimensionality reduction is to weigh the behaviour of the gene at different levels and then combine it to get an integrated picture. A commonly used technique to accomplish this is Partial Least Squares (PLS) and its various flavours such as sPLS, orthogonal PLS, Kernel and O2- PLS. The principle behind dimensionality reduction is to weigh the behaviour of the gene at different levels and then combine it to get an integrated picture.
There are also multiple platforms such as Galaxy and O-Miner that help to analyse multi-omics data.
4.4. Challenges in the Integration of Omics Data
Despite there being a clear advantage for integrating data, the field still has its own set of challenges that are yet to be addressed. Two of them are (1) the need to reduce false-positive rates in the presence of confounding factors, both characterized and uncharacterized, (2) the need for quantifiable and reproducible methodologies, and (3) the challenges related to insufficient dimensionality. In fact, the reduction of false-positive rates and the dimensionality challenge go hand in hand. Where the addition of more layers creates a dimensionality problem, the same additional data along with a priori information on the relationships of different variables could lead to a decrease in the false-positive rates. Although programming languages such as R and Matlab make it possible for the maintenance of user-friendly and up-to-date implementation documentation, the lack of such documents presents another major challenge.
Methods for analysing integrated data allow for a deeper and more comprehensive understanding of the complex biological systems. Thus, multi omics data has garnered increasing support over single layer omics studies. With a fast-growing demand for integrated data analysis, there is an even greater demand for tools and a deeper understanding of the algorithms behind the tools to account for certain challenges like the assumptions of independence and normality.
Read more: Challenges In Multi-Omics Data Integration
5. Genetic Big Data
5.1. Big Data in Genomics
5.1.1. Why is sequencing expensive?
According to tracking data from the NHGRI (National Human Genome Research Institute), the cost of determining a million bases of DNA sequence at a certain quality level as well as the cost of sequencing a human-sized genome has come down substantially over the past quarter-century.
The estimated cost of producing a 'finished' human genome was around $150 million in 2003. By 2006, this had dropped to between $20-25 million due to refinements to fundamental methodologies. Next-generation sequencing technologies have since further reduced costs to around $1,000.
However, there are several variables – like genome size, the process involved (shotgun, whole-genome, whole-exome sequence), data quality, sequence finishing, etc. – that determine the final cost of sequencing a genome. Additionally there are indirect costs pertaining to equipment, labour, consumables, etc.
Despite the widely circulated $1,000 benchmark, the cost of genome sequencing in routine healthcare is still disproportionately expensive. One microcosting study calculated the genome sequencing cost per cancer case at £6,841, and per rare disease case at £7,050. The study found that in routine clinical care, the study found, consumables accounted for as much as 72% of the total cost of sequencing.
5.1.2. How Big is This Data?
Over the past decade, the rapid increase in the expansion and sophistication of sequencing technologies has resulted in the exponential growth in the rate of generation of sequence data. One projection from 2015, that was based on the observation that sequencing capacities double every seven months, estimated that sequence data generation would approach one zettabase per year by 2025.
SOURCE: PLOS Biology
Today, emerging techniques promise to assemble entire genomes on a standard laptop at a rate that is a hundred times faster than current state-of-the-art approaches.
However, volume and velocity are just two dimensions of the big data challenge currently facing genomics. Data heterogeneity in modern bioinformatics research covers a vast range of diverse data types spanning multiple omics disciplines (chemogenomics, metagenomics, pharmacogenomics, toxicogenomics), diverse datasets (epigenomics, transcriptomics, proteomics, metabolomics), and a variety of sources (electronic health records, medical imaging data, social networks, wearables, etc.).
5.1.3. Interpreting OMICs Big Data
The rapid growth in large-volume, multidimensional, heterogeneous genomics datasets open up a whole new realm of opportunities in the more comprehensive interpretation of omics big data. As high-throughput technologies enable the generation of more data from different molecular layers of biological systems, conventional single-layer analysis techniques will be wholly inadequate in comprehensive knowledge extraction. Indeed, it has been proven that integrating multi-dimensional datasets can generate better results both from a statistical and biological perspective.
The challenges of genomic big data notwithstanding, there is a huge opportunity for research to undergo a radical shift from a conventional reductionist approach to a more holistic systems biology approach to understanding the complexities of biological systems.
However, the research ecosystem is still predominantly characterised by a sheer diversity of techniques developed in the era of single-layer analysis that presents the challenge of selecting and combining the right tools into one streamlined process – and the need for advanced programming and technical skills to put these composite workflows together. As a result, a key priority is the development of a conceptually unified framework that would enable the integrative analysis of multi-layer datasets.
Read more: Integrated Multi-Omics
5.2. Machine Learning in Bioinformatics
5.2.1. What is Bioinformatics?
Source: North American University
Bioinformatics is a still-evolving subdiscipline of biology, computer science and mathematics, involving the acquisition, storage, analysis, and dissemination of biological data. It encompasses the science of developing and utilising computer databases and algorithms to accelerate and enhance biological research, as well as the analysis of biological information using computers and statistical techniques.
The foundation of modern bioinformatics traces back to the 1960s and the application of computational methods to protein sequence analysis. It subsequently expanded to DNA analysis following advances in molecular biology methods, the rise of powerful computers, and the availability of software suited to bioinformatics tasks.
Today, bioinformatics is essentially a big data discipline that combines sophisticated bioinformatics tools and advanced computational analysis techniques. It has also produced several sub-disciplines, such as synthetic biology, systems biology, and whole-cell modelling.
5.2.2. What is Machine Learning?
Machine learning (ML) is a form of artificial intelligence (AI) that enables software applications to more accurately identify patterns and relationships and predict outcomes by learning from historical data.
There are four subcategories of ML:
- Supervised learning, where models are trained on labelled data sets that allow them to learn and become more accurate over time. A primary disadvantage with this approach is that the data has to be manually labelled, which is a time-consuming and expensive affair for large volume data.
- Unsupervised machine learning, where algorithmic programs look for patterns, trends, and connections in unlabelled data that people are not explicitly looking for. A key disadvantage of this approach is that the application spectrum is limited.
- Semi-supervised learning addresses the shortcomings of the previous models by using training data that combines a small number of labelled data and a large number of unlabelled data.
- And finally, there is reinforcement learning, where models are trained through trial and error for optimal behaviour based on a reward system.
5.2.3. How is Machine Learning Used in Bioinformatics?
The exponential increase of omics data has made machine learning imperative for the effective and efficient analysis of genomic big data. ML techniques are increasingly being applied across the spectrum of biological research, including gene prediction, protein structure analysis, biomarker analysis for disease research, microarray examination, and metabolic pathway determination.
In the genomics domain, a combination of ML and NLP (Natural Language Processing) has enabled the quick and accurate analysis of large amounts of data for relation extraction and named entity recognition. ML applications have also been successfully used to predict and classify gene expression, protein structures, and mutations.
5.2.4. How Important is Machine Learning to Overcoming the Challenges in Sequencing Analysis?
ML algorithms have been used to predict gene essentiality and identify the minimal genes required for the survival of an organism. Even for organisms with limited data for ML models, an approach combining an unsupervised feature selection technique, a dimension reduction algorithm, and a semi-supervised ML algorithm with a very limited labelled dataset has proven to be effective in predicting essential and non-essential genes from genome-scale metabolic networks.
In molecular evolution research, ML methods have helped analyze increasingly massive sets of sequences and other omics data. ML techniques, together with conventional proteomic methods, have been used to predict and analyze post-translational modifications, including CNN, hierarchical, and K-means clustering.
ML is also facilitating a systems biology approach to modelling and analysis by enabling the integration of diverse data types into established biological networks and combining different systems biology approaches to investigate multi-omics datasets.
In certain applications, like skin cancer or breast cancer detection, the diagnostic performance of deep learning models have been demonstrated as being on par with that of healthcare professionals. One deep learning model, in particular, has even outperformed full-time breast-imaging specialists in mammogram classification, with a 14% average increase in sensitivity.
NLP Solution by BioStrand
The BioStrand solution was engineered from the ground up to fully leverage the potential of AI/ML technologies to accelerate bioinformatics innovation and augment data-driven decision-making in multi-omics research. Our HYFTs™ IP provides a universal framework for researchers to quickly and effortlessly integrate a wide range of biological, medical and clinical data and metadata, including sequence data across species, domain & regulation, patient records, clinical trials, ICD codes, lab tests, etc.
With BioStrand NLP Link, a biomedical domain-specific NLP solution, researchers can integrate textual data from various sources such as sequence descriptions, annotations, scientific/medical literature, healthcare data, electronic patient records etc. The BioStrand NLP solution can quickly and accurately mine vast volumes of biomedical texts for research relevant to augment hypothesis generation and broaden the scope of discovery.
The BioStrand solution combines a completely automated approach to the normalization and integration of diverse types of biological, medical and clinical data into a unified repository with best-in-class AI/ML-based analytical tools and techniques to maximize the accuracy and efficiency of integrated multi-omics research.
Check out our NLP Solution here.
5.3. Cloud Computing in Bioinformatics
5.3.1. Cloud Computing Enabled Big Multi-OMICs Data Analysis
The Cancer Genome Atlas (TCGA), one of the world’s largest and richest collections of genomic data, is a comprehensive and coordinated effort to accelerate the understanding of the molecular basis of around 200 types of cancer. The 2.5 petabytes of publicly available data uses 7 different data types to describe 33 different tumour types, including 10 rare cancers, based on paired tumour and normal tissue sets from 11,000 patients. It is, in short, a critical resource for driving cutting-edge cancer studies across the globe.
However, the biggest challenge would be in making these huge volumes of diverse data accessible to researchers from around the world and providing them with the relevant bioinformatics, mathematical, and collaborative tools, as well as the computing power to analyze the data.
Scalable, cloud-based data science infrastructure like the NCI Cancer Research Data Commons (CRDC) address this challenge by bringing together data and computational power to accelerate cancer research and discovery. The cloud model upends the traditional model of downloading and storing large datasets to be analyzed with on-site tools. Instead, these cloud-based platforms give researchers web-based access to all relevant data in the cloud, together with best-practice tools and pipelines, on-demand computational capacity and collaborative functionality.
Cloud-based multi-omics platforms represent the optimal paradigm for multi-omics analysis given the exponential increase in the volume, variety, and velocity of biomedical data and the corresponding shift to a multi-omics systems biology approach to data analysis. These platforms that offer researchers easy on-demand access to a shared pool of data storage and at a time when computer resources will increasingly play a central role in accelerating time-to-knowledge while ensuring enhanced performance, granular cost management and seamless collaboration.
5.3.2. Biomedical and Multi-OMICs Data Sources
A cloud-based approach also takes biological research a step closer to the ideal model of integrated multi-omics analysis, where multiple data sets are normalized and integrated into a common data structure and analyzed in parallel with a unified/universal analytics framework. The biggest advantage of this integrated model is that it will allow researchers to assess the flow of information from one omics level to the other, which will help in bridging the gap from genotype to phenotype.
It is currently quite difficult, however, to consistently execute the integrated multi-omics model across the board given the predominance of domain-specific data sources, the manual effort required to normalize and integrate diverse datasets and the lack of universal analytics frameworks.
Some examples include:
The Cancer Genome Atlas (TCGA)
RNA-Seq, DNA-Seq, miRNA-Seq, SNV, CNV, DNA methylation, and RPPA
Clinical Proteomic Tumor Analysis Consortium (CPTAC)
Proteomics data corresponding to TCGA cohorts
International Cancer Genomics Consortium (ICGC)
Whole genome sequencing, genomic variations data (somatic and germline mutation)
Cancer Cell Line Encyclopedia (CCLE)
Cancer cell line
Gene expression, copy number, and sequencing data; pharmacological profiles of 24 anticancer drugs
Molecular Taxonomy of Breast Cancer International Consortium (METABRIC)
Clinical traits, gene expression, SNP, and CNV
Gene expression, miRNA expression, copy number, and sequencing data
Omics Discovery Index
Consolidated data sets from 11 repositories in a uniform framework
Genomics, transcriptomics, proteomics, and metabolomics
5.3.3. Cloud-Based OMICs Challenges
Cloud-based multi-omics is not without its challenges. For instance, some of the common challenges associated with cloud computing, such as data privacy and security, will be especially amplified given all the sensitive clinical data involved. The second challenge is to create a standardized approach to data normalization and integration across multiple public and proprietary data repositories. And finally, the third challenge is ensuring interoperability and portability across multiple platforms to ensure the seamless and secure flow of biological data.