In life sciences, massive amounts of omics data are produced, available as genome, transcriptome, proteome and metabolome. These data have led to great insights and scientific breaktrouhgs in various fields such as developments in pharmaceuticals and biotech, gene therapy treatments, and virology, agriculture and climate change.
But as we expect the volume of omics data to be around 40 exabytes by 2025, the generation of more data is not without issues - especially when it comes to analysing and integrating that data. So, there is a need to improve the complete analysis process from start to finish, so that data acquisition, evaluation, comparison, and results can occur much faster, and more easily.
1. DNA and RNA Sequencing
The human genome project is most known and was a large-scale genome project conducted in collaboration with many different countries across the globe with an aim to sequence the whole genome of humans. It was groundbreaking research that took 13 years to finish, nevertheless provided us with vast valuable information to progress the field of human genetic research. It also paved the path of not just other scientific areas including crop improvement, forensics, population genetics etc. but also in advancement in sequencing technologies. In recent years, sequencing technology has become more accessible and affordable with a wide range of possibilities in scientific research.
1.1. What is sequencing?
The fundamental unit of DNA is the base pairs located on the opposite side of the double helix structure by four different nucleotides adenine (A), cytosine (C), guanine (G), and thymine (T). These long sequences of these pairs form chromosomes and DNA. The purpose of the sequencing is to determine the exact order of these base pairs in our genetic code. So sequencing basically is to identify the nucleotides order determines the functioning of our genes, phenotype, metabolism, genetic risks, diseases etc. Sequencing could help determine a wide range of variants like insertions, deletions, mutations in the DNA resulting in different functions and traits.
For decades, Sanger sequencing has been gold standard sequencing technology fluorescently labelled terminating nucleotides and electrophoresis. In the traditional Sanger sequencing, the molecules were first cloned inside the bacteria, amplified, and then sequenced by adding fluorescently labelled nucleotides by enzyme DNA polymerase. With the advancement of technologies, first-generation sequencing has been replaced by next-generation sequencing (NGS) technologies which were potentially high throughput, affordable and accessible. NGS technologies methods have been developed in the last decade including
- Sequencing by synthesis (SBS) by Illumina
- Single-molecule real-time sequencing by Pac Bio
- Nanopore technology sequencing by Oxford Nanopore
- Sequencing by ligation or SOLiD sequencing by ThermoFisher Scientific
- Combinatorial probe anchor synthesis cPAS- BGI/MGI
- Ion semiconductor or Ion Torrent sequencing by ThermoFisher Scientific
Sequencing could be DNA sequencing or RNA sequencing depending on the downstream applications which could be biomarkers, assisting in molecular diagnostics.
Why is sequencing important?
The usage and benefits of NGS have no limitations with the robustness of the technology that developed over years. NGS can effectively support a wide range of genetic analysis research applications such as:
- Whole-genome sequencing, sequencing, and studying the entire genomes,
- Genotyping, studying variation in the sequences,
- Transcriptome and gene expression, analyzing the differential expression of the transcripts and genes in each set of conditions,
- Epigenetics, identifying heritable changes in regulating the genes.
NGS technology has no limitations and allows researchers to study other areas including microbial diversity and evolution with increased precision. It has become an indispensable tool in genomic research with many valuable insights into complex biological systems ranging from cancer genomics to diverse microbial communities.
1.2. Evolution of sequencing technologies
The field of sequencing has been quite a fast field to keep up as there was a rapid evolution among all the platforms and technologies in the market.
1.2.1. First generation sequencing
DNA sequencing was first studied in the late 1970s and further developed the gel-based method that combines DNA polymerase with a mixture of standard and chain-terminating nucleotides ddNTPs. This random early termination by mixing dNTPs with ddNTPs during PCR is visualized with gel electrophoresis and sorted by length. The technique has revolutionized at that time enabling the sequencing of 500-1000 bp fragments. Later in the 1980s, Sanger sequencing has been automated replacing the radioactive dNTPs were replaced with dye-labelled nucleotides and acrylic-finer capillaries instead of large gels. By the mid-2000s, the cost of sequencing has been dramatically reduced facilitating even the Human genome project.
1.2.2. Second-generation sequencing
The advent of next generations sequencing was started with Solexa which was later acquired by Illumina. Bridge amplification allows the formation of densely clustered amplified fragments on a silicon chip is a key innovation of this platform. Amplification of the molecule to a large cluster of multiple copies to detect the fluorescent signals as a single dNTP is added one at a time as they are synthesized. Illumina became the first commercially available massively parallel sequencing technology with a backend principle ‘Sequencing by Synthesis’. Over time, other tools were developed including the Ion Torrent platform reducing the sequencing costs with reading length ~50-500bp in length and an excellent fit for multiple applications (SNP calling, target sequencing).
1.2.3. Third-generation sequencing
This is new generation sequencing addressing the limitation of the short reads which are not suitable for all the sequencing projects. The approach was Single-molecule, Real-Time (SMRT) sequencing from PacBio. The technique has revolutionized by using miniature wells in which a single polymerase incorporates labelled nucleotides and light emission is measured in real-time during this process. One more approach to long-read sequencing was adopted by Oxford Nanopore technologies. It is a different single-molecule approach using pore-forming proteins and electrical detection for long-read sequencing. SMRT sequencing is the most notable and has the ability to produce long reads with good precision. The challenges of repetitive regions in the short-read sequencer because of their short snippets were addressed by this sequencing approach. Further added advantages include sequencing the extreme-GC at AT regions that cannot be amplified after the cluster generation on short-read platforms and also being able to detect the DNA methylation during the process as no amplification is done on the instrument.
1.3. What is single-cell RNA sequencing?
Individual cells coordinate together and function to determine the complex biological function or traits. Conventional methods like RNAseq provides an overview of the differential expression but are unable to reveal the cellular heterogeneity that drives the complexity. Single-cell RNA (ScRNA) sequencing is an NGS method that examines the individual cells providing a high-resolution view of cell-to-cell variation.
The conventional bulk population sequencing provides the average expression signal of the group of cells and with the increasing evidence suggesting the heterogenous expression in similar cell types and this stochastic expression is reflected in the cell composition and cell fate decisions. The sequencing at single-cell resolution was pioneered by James Eberwine et al., and Iscove and colleagues to be commercially used for high-density DNA microarray chips and eventually adapted for ScRNA sequencing. The first ScRNA study was published in 2009 describing the characterization of early development stages of cells in mice. This study drew the interest of researchers because of the high-resolution views of single-cell heterogeneity on a global scale.
1.3.1. Why is single cell sequencing important?
ScRNA sequencing is quickly becoming a standard tool in scientific research providing the scale and depth of insights in diverse cell populations. Microfluidic technology has transformed the capability of researchers to isolate single cells and further study them using ScRNA-seq. Consistency and reproducibility are the need of the hour and critical for advancements in drug development and precision medicine. The variability and reproducibility issues have been introduced to the conventional bulk sequencing methods by the users. The automated single-cell gene expression pipelines address these critical concerns and would unlock considerable time and resources.
1.3.2. What are the advantages of single-cell RNA sequencing?
With minimal input in an unbiased manner, researchers were able to extract high-resolution cellular data by this single-cell and ultra-low input RNAseq. With the advent of ScRNA-seq, researchers were able to provide a robust transcriptome analysis at a single-cell input level. This high-resolution analysis enables the researchers to discover the cellular differences which are masked by the bulk sampling methods. ScRNA-seq is very helpful in analyzing the rare cell types efficiently compared to the other techniques where they have so many limitations. They could characterize the hidden population effectively to measure the gene expression.
Cancer is one of the complex diseases to understand given its heterogeneous nature. To ensure effective diagnosis and treatment of different cancers, it is important we understand the early stages of their development from the cancer stem cells. ScRNA proved to be effective to understand the intra-tumoral heterogeneity, mapping the clone in the tumor. There has been increasing usage to understand in research of different types of cancers including lung, breast, renal, hepatocellular carcinoma and more. ScRNA also enables in exploring the complex networks beyond the different cell types by integrating and applying to functional genomics, immunology, oncology, as well as stem cell biology.
Apart from the given benefits, ScRNA sequencing technology has been constantly evolving and applied to various applications. The emerging and deeply focused studies strengthen our knowledge to provide novel insights into biological systems and create new opportunities for therapeutic development.
2. An overview of sequencing techniques
2.1. Chain termination method or Sanger sequencing
This method was developed by Fredrick Sanger and was a major technological breakthrough. Based on this technology the human genome project was completed in 2003. Sanger sequencing is based on PCR (polymerase chain reaction) to make multiple copies of a target DNA region. The ingredients needed for the whole process are a DNA polymerase enzyme, a primer that serves as a starter for the PCR process, the 4 DNA nucleotides and unique to the Sanger method, modified DNA nucleotides, called dideoxyribonucleotide triphosphates (ddNTPs), which are chain terminating and labeled with a specific fluorescence dye.
Once a ddNTP has been added to the chain, the reaction stops. The PCR process is repeated a number of times to make sure that a ddNTP is incorporated in every position of the target DNA. After that, all fragments pass capillary gel electrophoresis. The longer the fragments, the slower they move through this tube filled with gel matrix. Each sequence length thus has a typical speed. At the end of the tube a laser illuminates the passing fragment, and the attached dye is detected. From the colour of the dyes, the original DNA template can be reconstructed.
Sanger sequencing produces high quality results for DNA lengths of appr 900 base pairs. Although next generation sequencing techniques with high throughput volumes are now widely available, Sanger sequencing is still used as a method to confirm sequence variants identified by NGS. Sanger sequencing can also be used to solve some NGS coverage problems e.g., regions rich in GC content that might be poorly covered by NGS.
2.2. Next-generation sequencing
Characteristic for NGS is the high throughput volumes that can be achieved at relatively low costs. A whole NGS workflow consists of library preparation, sequencing and data analysis.
2.2.1. Short-read sequencing
Short-read sequencing technologies typically produce reads of 250-800 base pairs long.
---- DNA- and RNA-seq library preparation
DNA-seq can include Whole Genome Sequencing (WGS), Whole Exome Sequencing (WES), epigenome sequencing and targeted sequencing (TS). Two methods for template preparation are mainly used: PCR and hybridisation capture.
As in any PCR, the same ingredients are included (a template, primer, dNTP’s, DNA polymerase and buffer). All reagents are mixed together in a tube that goes into a thermal cycler. The PCR reaction consists of 3 distinct steps: denaturation (separating the double stranded DNA), annealing (primers bind, anneal to the template DNA at specific positions) and extension (the DNA polymerase attaches to one end of the primer and synthesizes DNA complementary to the template DNA, by raising the temperature at the end of the process all double stranded DNA molecules denature to single strands). In each complete cycle the amount of DNA is doubled.
In hybridisation capture-based preparation of templates, long biotinylated oligonucleotide baits (probes) are used to hybridise the regions of interest. After that, streptavidin-coated magnetic beads are introduced to separate the bait/target fragment complex from fragments not bound to baits. It is in particular used for WS and TS.
DNA-seq library construction further involves fragmentation, end-repair, adaptor ligation and size selection. Fragmentation aims at shearing DNA to the optimal size range for the sequencing platform of choice. Three methods exist; physical by acoustic shearing, enzymatic and chemical (heath). The fragments are then end-repaired and ligated to adaptors. Adaptors have defined lengths and often include a barcode, a unique sequence, to identify samples in the case of multiplex sequencing (multiple samples are pooled and sequenced simultaneously in the same run). These barcodes allow afterwards in data analysis to assign reads to individual samples. Size selection might then be performed based on gel electrophoresis or using magnetic beads.
RNA-seq can include Whole Transcriptome Sequencing (WTS), mRNA sequencing (mRNA-seq) and small RNA sequencing (smRNA-seq). Sample preparation generally includes total RNA isolation, target RNA enrichment and reverse transcription of RNA into complementary DNA (cDNA).
---- Sequencing platforms
The sequencing principle used for short reads is sequencing by synthesis and involves two steps: clonal amplification and sequencing.
Prior to sequencing the DNA library must be attached to a solid surface. Amplification is necessary to increase the signal coming from each target during sequencing. The solid surface to which the unique DNA molecules bind are beads or flow cell surfaces. Depending on the sequencing platform emulsion PCR (Ion Torrent) or bridging PCR (Illumina) is used to amplify the anchored DNA fragments.
On the Ion Torrent Platform, during sequencing, a micro conductor chip (ion sensor) is flooded with unmodified A, C, T or G nucleotides one after another. Incorporation of a single nucleotide releases a hydrogen ion resulting in a pH change, which is measured by the ion sensor. If the next nucleotide that floods the chip is not a match, then no change is detected, and no base is called.
On the Illumina platform sequencing is based on the optical read out of incorporating fluorescent nucleotides by a DNA polymerase. Each nucleotide contains a fluorescent tag and a reversible terminator that blocks incorporation of the next base. The fluorescent signal indicates which nucleotide has been added. After each cycle the terminator is cleaved, allowing a next base to bind. In addition, Illumina NGS platforms are capable of paired- end sequencing, sequencing that occurs from both ends of a DNA fragment, which generates high-quality sequence data with in- depth coverage and high numbers of reads.
2.2.2. Long-read sequencing
Long-read sequencing technologies can produce reads > 10 kb directly from native DNA. These technologies circumvent the need for PCR, sequencing single molecules without prior amplification steps. This is an advantage as PCR can cause errors in the amplification process. Today, for long-read sequencing two main techniques are used.
---- SMRT Sequencing (PacBio)
Single-Molecule Real-Time sequencing is a third-generation sequencing method for DNA and RNA. The DNA to be sequenced is turned into a SMRTbell template. This template is created by ligation of hairpin adapters (SMRTbell adapters) to both ends of the double-stranded DNA. The sequencing reaction takes place in a SMRTcell chip with many small pores called zero-mode waveguides (ZMW). Each ZMW contains an individual DNA polymerase which enables the sequencing of a single SMRTbell template. During replication four fluorescently labeled nucleotides with unique emission spectra are used. As the anchored polymerase incorporates a labeled base, a signature light pulse is emitted measured in real-time. As the template is circular, the polymerase can continue sequencing through the hairpin adapter to replicate the second DNA strand. Sequencing of one strand is called a ‘pass’. The sequence obtained from each ZMW is called a continuous long read (CLR). The adapter sequences are removed to retain the DNA templates in between, resulting in what is called multiple ‘sub reads’ that are collapsed into a HiFi read (highly accurate long read).
PacBio technology can also be used for RNA sequencing by a technique termed Iso-Seq. Using the Iso-Seq method, entire transcripts, including any isoforms, can be sequenced. In this method, RNA is converted to cDNA, and HiFi sequencing is used to generate sequencing data.
---- Oxford Nanopore Technology (ONT)
ONT sequencing is based on the passage of single-stranded nucleic acid (DNA or RNA) through a protein nanopore. The DNA templates are loaded onto a flow cell containing a membrane embedded with hundreds to thousands of nanopores. A preloaded motor enzyme along with an applied ion current, moves the single strand through the pore. The passage of each nucleotide through the pore results in a characteristic disruption in ion current detected by sensors.
Beyond DNA sequencing, ONT may be used to sequence RNA and detect DNA and RNA modifications. Similar to PacBio, ONT can sequence full-length RNA as cDNA. However, ONT also has the ability to use native RNA.
3. Omics data analysis
3.1. Where are we today?
Although the core questions in genetic research are related to disentangling the associations between DNA, RNA and protein, the current tools and methods of data analysis are not oriented towards integration of knowledge.
Today, data analysis is characterised by fragmentation. Whether you are interested in finding similarities or in detecting variations, all omics data analysis is organised in silo’s with different applications for each analytical step. Along the different steps in the analysis process, outputs are generated in different formats. Although automated pipelines are available, the process of analysis remains time consuming and very complex. Only highly trained professionals are able to perform these analyses.
The integration of biological databases is also lacking. So, it is very difficult to query more than one database at a time. Currently, there is also no way to combine analysis in genomics, transcriptomics and proteomics which has proven to be a blocking factor. It is certainly not very helpful in maintaining oversight and easily detecting novel relationships.
Moreover, the current algorithms are far from flawless and result in an accumulation of errors during the analysis process. Additionally, most algorithms are computationally very intensive which results in slow processing times.
New developments in omics data analysis technologies should be aimed at integration of knowledge and at increasing precision of analysis. This would bring a high level of accessibility, efficiency and accuracy to the field.
Further downstream advanced analysis methods such as machine learning or graph reasoning can only produce meaningful insights and predictions when the data that serve as input are of high quality.There exists no classification or prediction algorithm that can compensate for the quality of input data. So, in order to make better models such as in relation to disease mechanisms, or for drug target development, we need algorithms for detection of similarities and variations in DNA, RNA and proteins that produce highly accurate results. Only then, we will be able to deliver better insights and better predictions leading to real advancements in precision medicine and other fields of science.
Integration of data analysis between genomics, transcriptomics and proteomics would not only expand the search field but also bridge the gap between isolated silo’s. It would facilitate the discovery of novel relationships such as between species, in gene transcription processes and other kinds of knowledge necessary for progression in medicine, and other life sciences.
To solve these challenges the BioStrand solution compresses multiple and often disparate stages of traditional omics data analysis into one simple, intuitive and user-friendly interface with the technology doing all the heavy lifting in the background. It eliminates all the usual challenges of building complex pipelines, finding access to multiple databases, and navigating the steep learning curve of a disparate tool environment. The solution actualizes the principle of ‘Data in, Results out’ to streamline and accelerate knowledge extraction and time-to-value.
Search is multi-domain and is as simple as inputting text or pasting bio-sequences with the results displayed on three levels: DNA, RNA, and AA. Drill down, filter, and extrapolate through the results and combine multiple dimensions, such as taxonomy or ontology, to quickly discover novelty functional relationships. Take a microscopic view down to the sequence level or discover other useful visual applications such as ontology maps, frequency tables, or multiple sequence alignment views.
In short, the BioStrand platform is designed to maximize researchers’ view of their data with integrated, comprehensive, and accurate results that accelerate time-to-insight and -value.
3.2. The data analysis steps
The whole data processing is mostly subdivided into 3 steps:
3.2.1. Primary analysis
Generally, primary analysis takes place inside the sequencing platform and consists of converting raw signals into nucleotide base calls. Furthermore, a quality check is performed. Based on base call quality scores (Phred score) and read length reads are filtered out. In case of multiplexing, i.e. multiple samples have being sequenced simultaneously, the separation of reads according to the barcode attached into different files is carried out. Also trimming is performed: removing the adaptor sequences and poor-quality bases at the ends of reads. The output of primary analysis is a FASTQ file.
3.2.2. Secondary analysis
In this step the reads are aligned to a reference genome or a de novo assembly is performed to then call all the variants. Typical file formats are produced: SAM (sequence alignment map), BAM (binary alignment map, a compressed version of SAM) and VCF (variant call format).
---- Multiple Sequence Alignment
A multiple sequence alignment (MSA) arranges protein sequences into a rectangular array with the goal that residues in a given column are homologous (derived from a single position in an ancestral sequence), superposable (in a rigid local structural alignment) or play a common functional role. To this extent, there is no right or wrong alignment; rather, there are different models that reflect different biological perspectives.
Two general ways of thinking about alignments involve consideration either of the degree of similarity shared across the full sequence lengths or of the similarity that’s confined to specific regions of the sequences: the former results in a global alignment, the latter produces a local alignment. Many tools exist to perform local or global alignments.
Two distinct computational methods are used. ‘Dynamic programming’ is a formally very correct and accurate method but lacks scalability and thus is only feasible for small sequences. To address this problem approximate (heuristic) algorithms were developed. Yet these heuristic methods cannot accommodate the big data volumes that need to be computed. BioStrand has developed a completely new algorithm around a very efficient sorting principle called HYFTsTM that addresses the big data and scalability problem.
On the left, the classical MSA method, which is computationally hard
On the right, the BioStrand MSA. The sequences marked as red and orange represent HYFTs that are identified and function as a very efficient sorting mechanism.
---- De Novo assembly
In de novo assembly no reference is used, and reads are aligned to each other based on their sequence similarity to create a long consensus sequence called a contig. In terms of complexity and time requirements, de novo assemblies are orders of magnitude slower and more memory-intensive than mapping-based assemblies. This is mostly due to the fact that the de novo assembly algorithm needs to compare every read with every other read, which is an operation that has a naive time complexity of O(n2) where n = string length.
A typical problem with short reads in de novo assembly is that they can sometimes align equally well to multiple locations in the genome, the longer the read the easier it is to find its position. Paired-end reads reduce this issue to a certain extent since a pair of reads has a known distance in between which is used to validate its alignment position. Therefore, it is crucial to remove reads that are too short prior to performing the alignment as misaligned reads will lead to false-positive variant calls.
---- Variant calling
After reads have been aligned and processed, the next step in the pipeline is to identify differences observed between the selected reference genome and the newly sequenced reads. In short, the aim of variant calling is to identify polymorphic sites where nucleotides are different from the reference. There are multiple tools for variant calling. As outcome of the variant calling step a VCF file is produced.
3.2.3. Tertiary analysis
This step addresses the important issue of making sense of the observed data. In the human genetics context, that is finding the fundamental link between variant data and the phenotype observed in a patient. Tertiary analysis begins with variant annotation, which adds additional information to the variants detected in the previous steps. Then variant interpretation is done, which in the context of human genetics is mostly performed by a a qualified individual such as a clinical geneticist and/or genetic counsellor. At the end of the interpretation process, a variant will be classified as pathogenic or benign for an individual and their phenotype. Variants may also be classified as a variant of unknown significance (VUS) which means that there is currently not enough evidence available to classify the variant as pathogenic or benign. As more evidence is gathered and further testing is performed these classifications may change.
3.3. Further downstream analysis
Depending on the biological context many other types of analysis can be performed on sequence data, e.g. gene expression analyses, various types of visualisations and clustering of data.
4. Genetic Big Data
4.1. Big Data in Genomics
4.1.1. Why is sequencing expensive?
According to tracking data from the NHGRI (National Human Genome Research Institute), the cost of determining a million bases of DNA sequence at a certain quality level as well as the cost of sequencing a human-sized genome has come down substantially over the past quarter-century.
The estimated cost of producing a 'finished' human genome was around $150 million in 2003. By 2006, this had dropped to between $20-25 million due to refinements to fundamental methodologies. Next-generation sequencing technologies have since further reduced costs to around $1,000.
However, there are several variables – like genome size, the process involved (shotgun, whole-genome, whole-exome sequence), data quality, sequence finishing, etc. – that determine the final cost of sequencing a genome, in addition to indirect costs pertaining to equipment, labour, consumables, etc.
Despite the widely circulated $1,000 benchmark, the cost of genome sequencing in routine healthcare is still disproportionately excessive. One microcosting study calculated the genome sequencing cost per cancer case at £6,841, and per rare disease case at £7,050. In routine clinical care, the study found, consumables accounted for as much as 72% of the total cost of sequencing.
4.1.2. How big is this data?
Over the past decade, the rapid increase in the expansion and sophistication of sequencing technologies has resulted in the exponential growth in the rate of generation of sequence data. One projection from 2015 that was based on the observation that sequencing capacities double every seven months estimated that sequence data generation would approach one zettabase per year by 2025.
SOURCE: PLOS Biology
Today, emerging techniques promise to assemble entire genomes on a standard laptop at a rate that is a hundred times faster than current state-of-the-art approaches.
However, volume and velocity are just two dimensions of the big data challenge currently facing genomics. Data heterogeneity in modern bioinformatics research covers a vast range of diverse data types spanning multiple omics disciplines (chemogenomics, metagenomics, pharmacogenomics, toxicogenomics), diverse datasets (epigenomics, transcriptomics, proteomics, metabolomics), and a variety of sources (electronic health records, medical imaging data, social networks, wearables, etc.).
4.1.3. Interpreting omics big data
The rapid growth in large-volume, multidimensional, heterogeneous genomics datasets opens up a whole new realm of opportunities in the more comprehensive interpretation of omics big data. As high-throughput technologies enable the generation of more data from different molecular layers of biological systems, conventional single-layer analysis techniques will be wholly inadequate in comprehensive knowledge extraction. Indeed, it has been proven that integrating multi-dimensional datasets can generate better results both from a statistical as well as a biological perspective.
The challenges of genomic big data notwithstanding, there is a huge opportunity for research to undergo a radical shift from a conventional reductionist approach to a more holistic systems biology approach to understanding the complexities of biological systems.
However, the research ecosystem is still predominantly characterised by a sheer diversity of techniques developed in the era of single-layer analysis that present the challenge of selecting and combining the right tools into one streamlined process – and the need for advanced programming and technical skills to put these composite workflows together. As a result, a key priority is for the development of a conceptually unified framework that would enable the integrative analysis of multi-layer datasets.
Read more: Integrated Multi-Omics
4.2. Machine learning in bioinformatics
4.2.1. What is bioinformatics?
Source: North American University
Bioinformatics is a still-evolving subdiscipline of biology, computer science and mathematics, involving the acquisition, storage, analysis, and dissemination of biological data. It encompasses the science of developing and utilising computer databases and algorithms to accelerate and enhance biological research, as well as the analysis of biological information using computers and statistical techniques.
The foundation of modern bioinformatics traces back to the 1960s and the application of computational methods to protein sequence analysis. It subsequently expanded to DNA analysis following advances in molecular biology methods, the rise of powerful computers, and the availability of software suited to bioinformatics tasks.
Today, bioinformatics is essentially a big data discipline that combines sophisticated bioinformatics tools and advanced computational analysis techniques. It has also produced several sub-disciplines, such as synthetic biology, systems biology, and whole-cell modelling.
4.2.2. What is machine learning?
Machine learning (ML) is a form of artificial intelligence (AI) that enables software applications to more accurately identify patterns and relationships and predict outcomes by learning from historical data.
There are four subcategories of ML:
- Supervised learning, where models are trained on labelled data sets that allow them to learn and become more accurate over time. A primary disadvantage with this approach is that the data has to be manually labelled – a time-consuming and expensive affair for large volume data.
- Unsupervised machine learning, where algorithmic programs look for patterns, trends, and connections in unlabelled data that people aren’t explicitly looking for. A key disadvantage of this approach is that the application spectrum is limited.
- Semi-supervised learning addresses the shortcomings of the previous models by using training data that combines a small number of labelled data and a large number of unlabelled data.
- And finally, there is reinforcement learning, where models are trained through trial and error for optimal behaviour based on a reward system.
4.2.3. How is machine learning used in bioinformatics?
The exponential increase of omics data has made machine learning imperative for the effective and efficient analysis of genomic big data. ML techniques are increasingly being applied across the spectrum of biological research, including gene prediction, protein structure analysis, biomarker analysis for disease research, microarray examination and metabolic pathway determination.
In the genomics domain, a combination of ML and NLP (Natural Language Processing) has enabled the quick and accurate analysis of large amounts of data for relation extraction and named entity recognition. ML applications have also been used successfully to predict and classify gene expression, protein structures, and mutations.
4.2.4. How important is machine learning to overcome the challenges in sequencing analysis?
ML algorithms have been used to predict gene essentiality and identify the minimal genes required for the survival of an organism. Even for organisms with limited data for ML models, an approach combining an unsupervised feature selection technique, a dimension reduction algorithm, and a semi-supervised ML algorithm with a very limited labelled dataset has proven to be effective in predicting essential and non-essential genes from genome-scale metabolic networks.
In molecular evolution research, ML methods have helped analyze increasingly massive sets of sequence and other omics data. ML techniques, together with conventional proteomic methods, have been used to predict and analyze post-translational modifications, including CNN, hierarchical, and K-means clustering.
ML is also facilitating a systems biology approach to modelling and analysis by enabling the integration of diverse data types into established biological networks and combining different systems biology approaches to investigate multi-omics datasets.
In certain applications, like skin cancer or breast cancer detection, the diagnostic performance of deep learning models have been demonstrated as being on par with that of healthcare professionals. One deep learning model, in particular, has even outperformed full-time breast-imaging specialists in mammogram classification, with a 14% average increase in sensitivity.
NLP solution by BioStrand
The BioStrand solution was engineered from the ground up to fully leverage the potential of AI/ML technologies to accelerate bioinformatics innovation and augment data-driven decision-making in multi-omics research. Our HYFTs™ IP provides a universal framework for researchers to quickly and effortlessly integrate a wide range of biological, medical and clinical data and metadata, including sequence data across species, domain & regulation, patient records, clinical trials, ICD codes, lab tests, etc.
With BioStrand NLP Link, a biomedical domain-specific NLP solution, researchers can integrate textual data from various sources such as sequence descriptions, annotations, scientific/medical literature, healthcare data, electronic patient records etc. The BioStrand NLP solution can quickly and accurately mine vast volumes of biomedical texts for research-relevant to augment hypothesis generation and broaden the scope of discovery.
The BioStrand solution combines a completely automated approach to the normalization and integration of diverse types of biological, medical and clinical data into a unified repository with best-in-class AI/ML-based analytical tools and techniques to maximize the accuracy and efficiency of integrated multi-omics research.
4.3. Cloud Computing in Bioinformatics
4.3.1. Cloud computing enabled big multi-omics data analysis
The Cancer Genome Atlas (TCGA), one of the world’s largest and richest collections of genomic data, is a comprehensive and coordinated effort to accelerate the understanding of the molecular basis of around 200 types of cancer. The 2.5 petabytes of publicly available data uses 7 different data types to describe 33 different tumour types, including 10 rare cancers, based on paired tumour and normal tissue sets from 11,000 patients. It is, in short, a critical resource for driving cutting-edge cancer studies across the globe.
However, the biggest challenge would be in making these huge volumes of diverse data accessible to researchers from around the world and providing them with the relevant bioinformatics, mathematical, and collaborative tools, as well as the computing power to analyze the data.
Scalable, cloud-based data science infrastructure like the NCI Cancer Research Data Commons (CRDC) address this challenge by bringing together data and computational power to accelerate cancer research and discovery. The cloud model upends the traditional model of downloading and storing large datasets to be analyzed with on-site tools. Instead, these cloud-based platforms give researchers web-based access to all relevant data in the cloud, together with best-practice tools and pipelines, on-demand computational capacity and collaborative functionality.
Cloud-based multi-omics platforms represent the optimal paradigm for multi-omics analysis given the exponential increase in the volume, variety, and velocity of biomedical data and the corresponding shift to a multi-omics systems biology approach to data analysis. These platforms that offer researchers easy on-demand access to a shared pool of data storage and compute resources will increasingly play a central role in accelerating time-to-knowledge while ensuring enhanced performance, granular cost management and seamless collaboration.
4.3.2. Biomedical and multi-omics data sources
A cloud-based approach also takes biological research a step closer to the ideal model of integrated multi-omics analysis, where multiple data sets are normalized and integrated into a common data structure and analyzed in parallel with a unified/universal analytics framework. The big advantage of this integrated model is that it will allow researchers to assess the flow of information from one omics level to the other and thus help in bridging the gap from genotype to phenotype.
It is currently quite difficult, however, to consistently execute the integrated multi-omics model across the board given the predominance of domain-specific data sources, the manual effort required to normalize and integrate diverse datasets and the lack of universal analytics frameworks.
Some examples include:
The Cancer Genome Atlas (TCGA)
RNA-Seq, DNA-Seq, miRNA-Seq, SNV, CNV, DNA methylation, and RPPA
Clinical Proteomic Tumor Analysis Consortium (CPTAC)
Proteomics data corresponding to TCGA cohorts
International Cancer Genomics Consortium (ICGC)
Whole genome sequencing, genomic variations data (somatic and germline mutation)
Cancer Cell Line Encyclopedia (CCLE)
Cancer cell line
Gene expression, copy number, and sequencing data; pharmacological profiles of 24 anticancer drugs
Molecular Taxonomy of Breast Cancer International Consortium (METABRIC)
Clinical traits, gene expression, SNP, and CNV
Gene expression, miRNA expression, copy number, and sequencing data
Omics Discovery Index
Consolidated data sets from 11 repositories in a uniform framework
Genomics, transcriptomics, proteomics, and metabolomics
4.3.3. Cloud-based omics challenges
However, cloud-based multi-omics is not without its challenges. For instance, some of the common challenges associated with cloud computing, such as data privacy and security, will be especially amplified given all the sensitive clinical data involved. The other challenge is to create a standardized approach to data normalization and integration across multiple public and proprietary data repositories. And finally, there is the challenge of ensuring interoperability and portability across multiple platforms to ensure the seamless and secure flow of biological data.
5. Protein structure prediction
One of the challenging tasks is protein structure prediction. Predicting a protein’s three-dimensional structure from its amino acid sequence remains an unsolved problem after several decades of efforts. Almost all structure prediction relies on the fact that, for two homologous proteins, structure is more conserved than sequence. If two protein sequences are similar, these two proteins are likely to have a very similar structure. If the query protein has a homolog of known structure, the task is relatively easy and high-resolution models can often be built by copying and refining the framework of the solved structure. However, a template-based modeling procedure does not help answer the questions of how and why a protein adopts its specific structure.
In particular, if structural homologs do not exist, or exist but cannot be identified, models have to be constructed from scratch. This procedure, called ab initio modeling, is essential for a complete solution to the protein structure prediction problem; it can also help us understand the physicochemical principle of how proteins fold in nature. Currently, the accuracy of ab initio modeling is low and the success is generally limited to small proteins (<120 residues). In general, less than 20% of the structure prediction is correct in the majority of models; even in the best cases at most 40% of the residues are modelled accurately.