It's all about NLP





The future is here.

The roots of modern NLP systems can be traced back to the concept of “Language as a Science” developed in the early 1900s that started with linguistics and later expanded to computers and other fields. NLP came to the fore after World War II based on the need for a machine that could automatically translate from one language to another. Then in 1950, interest in NLP began in earnest with the publishing of Alan Turing’s seminal paper “Computing Machinery and Intelligence.” The paper proposed that a computer that could converse with human beings without them realizing they were talking to a machine could be considered intelligent. 


Today, NLP is everywhere. Though digital voice assistants are the most ubiquitous real-world application of NLP, the concept itself encompasses both speech and text and is used in a variety of applications including search, email spam filtering, online translation, grammar- and spell-checking, and more.  

1. Natural Language Processing



1.1. What is Natural Language Processing?


Natural-language processing (NLP) refers to the automated computational processing technologies that convert natural language text or audio speech into encoded, structured information, based on an appropriate ontology. In the context of biomedical literature, the structured information can then be used to classify a body of textual information, as in “related to laparoscopic cholecystectomy,” or to extract more refined insights such as participants, procedures, findings, etc. 

Read more: AI, ML, DL, and NLP: An Overview


1.2. How does NLP work?


nlp 1

SOURCE: AI Multiple

In simple terms, NLP works through machine learning (ML) systems that store words and information on the ways they relate to each other. ML engines use grammatical rules and real-world linguistic patterns to process words, phrases and sentences and extract meaning, context and intent. However, NLP derives from a broad range of techniques for interpreting human language, including rules-based and algorithmic approaches, statistical and machine learning methods, and deep learning and neural networks. 


HYFTs Connecting the Dots and Databases

2. NLP Tasks


All NLP Tasks are considered to be generation tasks with the three main categories being classification, unconditional generation and conditional generation.[4] [5] 


Here’s a brief overview of some of the tasks from within each of those categories. 


2.1. Speech Recognition


Speech recognition

Source: Researchgate


Speech recognition, also known as automatic speech recognition (ASR), computer speech recognition, or speech-to-text, is the ability of a program to process human speech and convert it into readable text. 


ASR systems can be classified based on utterances (connected words, continuous speech etc.), speakers (speaker independent, speaker adaptive etc.), or even vocabulary (small to very large vocabulary). 


Speech recognition systems typically follow four steps: analyze audio, deconstruct, digitize, and use an algorithm to create the most suitable text representation.


2.2. Part of Speech Tagging



Source: freeCodeCamp


Part of Speech (POS) describes the grammatical function of a word. There are typically 8 parts of speech — noun, verb, pronoun, preposition, adverb, conjunction, participle, article — that are relevant to NLP. POS tagging, also known as grammatical tagging, is the process of automatically assigning POS tags to words in a sentence. 


Most POS tagging approaches fall under one of three categories; rule-based, stochastic, or transformation-based tagging.


Rule-based POS tagging relies on a dictionary or lexicon to generate tags with additional hand-written rules used to identify the correct tag from a set of possible tags for a word. 


Stochastic POS Tagging refers to any model that includes frequency or probability statistics to determine the most appropriate tags based on the probability of occurrence or the frequency of their association with a word in a training corpus. 


Transformation-based tagging is based on transformation-based learning and incorporates features from both previous approaches. Just as in rule-based tagging, it relies on rules that associate tags to words – and like in stochastic tagging, it applies ML techniques to automatically deduce rules from data.


2.3. Word Sense Disambiguation



Source: Springer


Word sense disambiguation (WSD) is the process of finding the meaning of a word that is most suitable to the context.  WSD seeks to resolve one of the most pervasive linguistic challenges in NLP – polysemous words or words that have multiple related meanings. The primary purpose of WSD, therefore, is to clarify the contextually appropriate meaning for polysemous words. 


There are four main ways to implement WSD; Dictionary- and knowledge-based, supervised, semi-supervised, and unsupervised methods.


2.4. Named Entity Recognition



Source: Analytics Vidhya


Named Entity Recognition (NER) is an NLP technique for identifying and extracting essential entities from text-based documents. Typical essential entities include names of people, locations, organisations, monetary values, etc. However, named entities can cover a wide number of categories, such as unit, type, quantity, occupation, ethnicity etc., and depend on the NLP requirement.   


Biomedical named entity recognition (BioNER) is a specialised field dealing with the extraction of biomedical entities from scientific texts. The automated and accurate identification of entities from a rapidly growing library of literature can streamline and accelerate several downstream tasks in biomedical research. 


Deep learning methods and end-to-end neural networks have been used quite successfully to automatically extract relevant features from biomedical text. The deep learning methods typically used for NER are classified into four categories: single neural network-based, multitask learning-based, transfer learning-based, and hybrid model-based methods. These methods can be used for BioNER applications across multiple domains. 


2.5. Coreference Resolution



Source: KDnuggets


Coreference resolution (CR) is the task of finding all linguistic expressions, or mentions,  in a given text that refer to the same real-world entity. Once all the mentions have been identified and grouped, they can be replaced with words that provide context. For instance, a sentence could introduce a person by name and then subsequently use pronouns to refer to the same person. CR determines if two mentions refer to the same discourse entity in the discourse model.


2.6. Sentiment Analysis


Sentiment Analysis

Source: MonkeyLearn



Sentiment analysis is an analytical technique that determines the emotional meaning of communications. Sentiment Analysis identifies whether a message is positive, negative or neutral and interprets what people are feeling via their language.


Some of the most popular methods of sentiment analysis include standard, fine-grained and aspect-based sentiment analysis. The standard approach provides a broad interpretation of the overall tone of communication. The fine-grained model accounts for a more elaborate range of polarity. And the aspect-based approach delivers a more precise interpretation of sentiment based on particular attributes and components. 


2.7. Natural Language Generation



Source: TechTarget


Natural language generation (NLG) is the process of transforming data into natural language. The process involves applying statistical techniques to analyse large datasets of structured information to generate natural-sounding sentences. NLG systems can automatically turn numbers in a spreadsheet into data-driven narratives or even generate entire articles or responses. 


NLG is a six-stage process that includes content analysis, data understanding, document structuring, sentence aggregation, grammatical structuring, and language presentation. 


2.8. Relation Extraction



Source: Open Data Science


Relation Extraction (RE) is the task of predicting attributes and relations for entities in a sentence. It is a key component for building relation knowledge graphs and is used in several NLP applications including structured search, sentiment analysis, question answering, and summarization. End-to-end relation extraction can help identify named entities and extract relations between them.


For instance, text mining is increasingly being used in the biomedical domain to automatically organise information from large volumes of scientific literature. In this context, relation extraction aims to identify designated relations among biological entities in literature. RE can also facilitate the extraction of semantic relations between different biomedical entities such as protein and protein, gene and protein, drug and drug, and drug and disease. 

3. NLP in Real Life


3.1. NLP Use Cases in Life Sciences


In the biomedical and life sciences domain, NLP opens up access to data sources, like scientific journals and medical/clinical data, that were previously incompatible with conventional data analytics frameworks. As a result, it has catalysed a gamut of real-world applications in novel drug target intelligence, biomarker discovery, safety case processing, clinical trial analytics, medical affairs insights and clinical documentation improvement. 


Here’s a brief overview of how NLP is being leveraged in the life sciences.  


Clinical natural language processing

Though clinical records have predominantly moved on from paper to codify valuable information in EHRs, much of the information related to real-world clinical practices still appear as unstructured narrative free-text. This has given rise to a specialised research field called clinical natural language processing (cNLP) to explore clinically relevant information contained in EHRs. cNLP systems have transformed the scope and scale of the utilisation of unstructured free-text information in EHRs, thereby providing valuable insights into clinical populations, epidemiology trends, patient management, pharmacovigilance, and optimisation of hospital resources.


Competitive intelligence from patent literature

Pharma R&D can benefit significantly by extracting competitive intelligence from publicly available patent literature. One pharma major leveraged NLP text mining to automatically extract information related to four main entities, from three major patent registries, and update this data every week. 


Disease diagnosis with NLP

Most health systems track patients using codes that are primarily created for billing and are therefore not particularly useful for clinical care or research. This creates a huge challenge in identifying patients with complex conditions and in studying the disease, tracking practice patterns, and managing population health. However, researchers at a large US healthcare provider were able to train an NLP model to automatically sort through over a million EMRs to identify abbreviations, words and phrases associated with aortic stenosis. In a matter of minutes, the NLP algorithms were able to identify nearly 54,000 patients with specific conditions. 


NLP for patient stratification

A leading biopharma company was able to focus the patient stratification process by using NLP on EMR and imaging data. By capturing data on 40 different elements related to a number of variables, including demographics, clinical outcomes, clinical phenotypes, etc., researchers were able to identify four patient groups with substantial differences in one- and two-year mortality and one-year hospitalisations. By using insights from the NLP-based analysis the company was able to improve clinical trial design, identify unmet needs, and develop better therapeutics.


NLP advances precision medicine research

Precision medicine transfers the focus of treatment from the average patient to the individual patient. However, developing a  personalised medical approach requires substantial volumes of disparate data that need to be analysed in a multi-scale context. A top medical school in the US has adopted NLP tools to pull key information regarding diagnoses, treatments, and outcomes from EHRs. 


3.2. BioNLP Case Studies


NLP-ML system for analyzing clinical notes

During the pandemic, the healthcare authority in the Canadian province of Alberta launched a free telehealth service that allowed patients and caregivers to speak directly to rehabilitation clinicians and professionals about the impact of the pandemic on chronic musculoskeletal, neurological, and other conditions. The service was designed to provide callers with assistance regarding services available in their location, condition-specific exercises, self-management advice etc. For every call, clinical notes containing detailed patient information were entered into an online charting platform. Apart from patient demographics, these call notes consisted of several layers of unstructured data that included patient history of diagnoses, medications, and existing symptoms, details about the ensuing discussion including causes, over-the-phone assessment and action taken by the advisor and finally, details about advice/service referrals provided to the patient. 


An NLP-ML system was designed for the automated pre-processing of these clinical notes and for modelling and analyzing the collected data. Preliminary results have shown that the NLP system was capable of accurately identifying salient keywords within the clinical notes.


NLP for rapid response to emergent diseases

Conventional bioinformatics largely relies on structured data and preexisting knowledge models. But this approach does not work in the context of novel diseases with no preexisting knowledge models. COVID-19 presented an opportunity to test the hypothesis that NLP technologies can enable the conversion of unstructured text to novel knowledge models. Researchers designed a study to evaluate the value that information from clinical text could add to the response to an emergent disease. 


The focus was on COVID-19 infections in high blood pressure patients and the effect of long-term treatment with calcium channel blockers on outcomes. The study used two sources of information: one where data was solely from structured EHRs and the other on data from structured EHRs and text mining.


According to the results of the study, text mining was able to augment statistical power sufficient enough to change a negative result to a positive one. When compared to the baseline study with structured data, the NLP study saw a steep increase in the number of patients available for inclusion, the amount of available information on medications and the amount of additional phenotypic information. The conclusion was that supplementing conventional structured data approaches with information from the NLP pipeline would increase the sample size sufficiently enough to see treatment effects that were not previously statistically detectable.


NLP to detect virus mutations

Viral escape is the ability of viruses to mutate thereby not only evading neutralizing antibodies but also impeding vaccine development.  Researchers at MIT have now developed a novel approach to model viral escape based on models originally developed to analyze language. 


The basic light bulb idea is that the immune system interprets a virus the same way humans interpret a sentence. The team used the linguistic concepts of grammar and semantics to interpret the characteristics and mutations of a virus. In linguistic terms, the grammatical correctness of a virus determines its evolutionary ability to infect a host. Similarly, mutations are analogous to semantics in that a virus that has altered its surface proteins to become invisible to antibodies is said to have altered its meaning. So in essence a successful virus that can change semantically without compromising grammatical correctness. 


The research team trained an NLP model on thousands of genetic sequences taken from three strains of viruses, influenza, HIV, and Sars-Cov-2. This model was then used to predict the likelihood of sequences generating escape mutations.

New call-to-action

4. NLP Tools & Techniques


4.1. Named Entity Recognition (NER)


Google BERT


BERT (Bidirectional Encoder Representations from Transformers) is an open-sourced neural network-based technique for NLP pre-training that enables anyone to train their own state-of-the-art question answering system. BERT models are able to understand the intent behind Google search queries by seeing a word in the context of the words preceding and following it. The BERT model can be fine-tuned to facilitate state-of-the-art NER. There are also BERT variations, like SpanBERTa for NER.


4.2. Tokenization



NLTK (Natural Language Toolkit) is the go-to API for NLP with Python. It enables the pre-processing of text data and helps convert text into numbers for further analysis with ML models. The advantages of word tokenization with NLTK includes white space tokenization, dictionary-based tokenization, rule-based tokenization, regular expression tokenization, Penn treebank tokenization, spacy tokenization, Moses tokenization, and subword tokenization.  



TextBlob is a Python (2 and 3) library for processing textual data. A simple API enables users to easily access its methods and perform basic NLP tasks. 



spaCy is a free, open-source library for advanced NLP in Python. Apart from tokenization, spaCy features a range of capabilities related to linguistic concepts as well as to general machine learning functionality.



Gensim is a library for unsupervised topic modelling that also contains a tokenizer 



Keras is a deep learning API written in Python, running on top of the machine learning platform TensorFlow. Keras’ tokenizer class is used for vectorizing a text corpus. 


4.3. Stemming and Lemmatization

Stemming and Lemmatization are NLP algorithms used to normalise text and prepare words and documents for further ML processing. Stemming is used to remove suffixes from similar words to create a word stem that is common to multiple words. This enables NLP models to understand how multiple words are somehow similar. Lemmatization is a progression of Stemming whereby different inflexions of a word are grouped together to be analysed as a unit. 


4.4. Bag of Words

Bag of Words is a simple vectorisation technique involving three operations. The input text is first tokenized, then unique words from the tokenized list are selected and alphabetised to create the vocabulary, and finally the frequency of vocabulary words is used to create a sparse matrix. 


4.5. Sentence Segmentation

Sentence segmentation, also known as sentence tokenization, is the technique of dividing a string of written language into its component sentences. The key difference between segmentation and tokenization is that the former is a more generic approach to splitting the input text while the latter is performed on the basis of pre-defined criteria.



A better way to analyse multi omics data

Read our latest blog articles below.

Every week we publish a new blog post about OMICs, NLP, ML & AI. Subscribe to our blog to get the articles straight into your email!