Updated: Jun 29, 2020
NLP (Natural Language Processing) is a subfield of AI that is concerned with understanding and synthesizing the natural language of humans. It can also be understood as the science of making our machines do what we desire by simply commanding them in our own language.
We all know that our computers or machines are only able to understand Machine Language i.e. the language of zeroes and ones which are different than our natural language. NLP, therefore, has proven significant in helping machines decipher and synthesize natural language with the ease of a human being.
NLP has been an area of focus for a long time now. Traditionally, it was based on statistical modeling and rule-based learning but since the inception of neural networks and deep learning, NLP based solutions have started to achieve much better results. A neural network's ability to generalize over the data has been very helpful in creating practical applications.
Applications of NLP
Today, NLP is not only a field of active and accelerated research but also used to develop solutions that hold major financial significance for firms. Some of its applications are discussed below:
Conversational Chatbots have been one of the highlighting technologies of this decade — recently showcased Google Duplex seemed to shatter the Turing test when it was able to make an appointment call for the user without the receiver knowing if a virtual assistant had called.
Virtual assistants like Apple’s Siri, Microsoft’s Cortana, Google’s Assistant, and Amazon’s Alexa are helping users with their day to day life like setting reminders and alarm, play music on demand and now even control a user’s home with the help of devices like Amazon’s Echo and Google Home.
What makes chatbots so popular is that you can interact with them just as you would interact with any other human. For this, chatbots should be capable of understanding various languages and grammar of different regions. This is a problem that has been solved to a great extent but still is an active research topic to offer much better interaction between the user and the assistant.
Speech recognition is a part of computational linguistics concerned with converting what a speaker is saying in a machine-readable form which can be further processed. The Speech recognition system processes the sound signals and tries to figure out words and phrases that the speaker is actually trying to say. A speech recognition system is very important for virtual assistants as it removes the hassle of typing every command to control the assistant. This is the actual reason behind the success of devices like Google home and Amazon Echo.
However, the prime barrier to speech recognition is the accent of people which varies from region to region.
Machine translation is a part of language linguistics concerned with converting text or speech from one language to another. Machine translation software holds great importance in removing language barriers, making a huge impact on people traveling to countries.
Though a host of machine translation software is being used worldwide currently, the quality of translation is not satisfactory. In most cases, the translation done is only word to word without even considering the grammar. To improve this situation, neural network techniques can be used to achieve translations with proper grammar as neural networks are famous for pattern recognition and generalization over the datasets.
Text classification means classifying texts in various classes. The best example of text classification we see is the spam detection method used by e-mail websites. The accuracy is so that it is nearly impossible to find a spam email in the primary inbox.
Text classification also finds its use in various other applications like identifying insincere comments and sentiment analysis. Sentiment analysis is popular on websites like Amazon where a large number of feedbacks are received for each product every second. Since it’s not feasible to analyze it all manually, the case is treated as a multiclass text classification problem that categorizes feedbacks as positive/ negative/ neutral, or others.
Named Entity Recognition (NER)
NER or Named Entity Recognition is the task of classifying named entities into predefined categories like person, organization, location, time, and date. NER is a method of automatically extracting information from a large amount of unstructured data.
There is a recent trend of digitizing important pieces of paper like receipts, doctor prescriptions, and many more. Most of this exists in the form of unstructured data and requires NER to extract information categorically.
Document summarization is the process of shortening a text document in order to create a summary with the major points of the original document. It is used to create a concise form of reports which contains the gist of the entire original document. This makes it easier for the user to read more information in a shorter duration.
Similarly, there can be as many applications of NLP as we want, depending upon our objective to use the machine for. However, to perform all those tasks, there are different types of neural network/ deep learning methods required to train models. Below we highlight the most common and latest neural networks used by today’s researchers and experts in the field of AI.
Neural Networks Used in NLP
Recurrent Neural Networks (RNN)
RNNs are specially modeled for sequential data. Sequential data means that the inputs will come one after another and the present input will be somehow dependent on the previous data. Time-series data like stock prices and textual data are some of the examples of sequential data. Just as the convolutional networks are able to scale to images with large width and height, and process images of variable size, recurrent networks can scale to much longer sequences than the networks without sequence-based specialization.
RNNs are based on the idea of parameter sharing that makes it possible to extend and apply the model to examples of different forms (such as length) and generalize it. It’s because of this parameter sharing ability that RNNs are able to effectively generalize over sequential data, where one value or input might be dependent on the previous value/input. This parameter sharing can also be considered as a feedback mechanism and can be thought of as a way to preserve the experience.
We can understand how RNNs work on the basis of the figure above. As we see, the input at time x(t-1) is used to compute y(t-1) and at the same time, the h(t-1) (which was also calculated using x(t-1)) is passed on to calculate the h(t), which will be used in calculating y(t). There, the hidden unit h(t) is a function not only of x(t) but also of h(t-1).
Like any other network, RNNs also use backpropagation for updating the weights during the training process. Like all backpropagation techniques, it also calculates a loss function and obtains gradients to update the weight parameters. Everything is similar to as other networks but there’s a catch. As hidden units are dependent on the previously hidden layer, the weights get updated from right to left.
RNNs are mostly used for machine translation, speech recognition, document summarization. RNNs are also used in some of the computer vision problems, like activity recognition where all the previous frames of the video have to be studied.
RNN’s job is to look at the connected previous information to the present task. It is quite effective in the case of tasks where it just has to look at a small number of cases, but when the number of steps it has to look behind is large, RNNs face trouble. This happens due to the problem of vanishing gradients. In vanishing gradients, the gradients of the farthest steps reach saturation and have no effect on the current hidden unit.
Long Short Term Memory (LSTM)
LSTM networks are specially designed to learn long term dependencies. Hence, countering the problem of vanishing gradients by remembering information for a long period of time.
All recurrent neural networks have the form of a chain of repeating modules of a neural network. In standard RNNs, this repeating module will have a very simple structure, such as a single ‘tanh’ layer.
LSTMs also have a chain like structure but the repeating modules have a different structure.
In the above diagram, each arrow shows the transfer of a vector from the output of one node to the inputs of the other nodes. The pink circles represent pointwise operations, like vector addition, while the yellow boxes are learned neural network layers. Lines merging denote concatenation, while lines forking denote content being copied first and then going to different locations.
Gated Recurrent Units (GRU)
GRU tries to solve the problem of vanishing gradients too, but differently than LSTMs.
Differences between LSTM and GRU:
LSTM contains 3 gates name namely input, output, and reset whereas GRU only has two gates as studied in the previous section.
LSTM contains an internal memory state while GRU doesn’t.
LSTM applies a nonlinearity function(like sigmoid or Relu) just before the output gate while the GRU doesn’t.
GRU uses two vectors, an update gate and a reset gate, to decide what information should be passed on to the output. These gates are specialized to retain information from many timesteps behind in the input.
Quaternion Recurrent Neural Networks — The Latest
A paper was presented recently at the International Conference on Learning Representations 2019 introducing Quaternion Recurrent Neural Networks alongside Quaternion LSTMs (or QLSTM) for neural networking. These newly proposed networks use quaternion algebra to understand both external relations and internal structural dependencies. In the paper, it is shown that these proposed networks work better than RNNs and LSTMs for realistic applications like automatic speech recognition. These networks are able to reduce the number of parameters required by vanilla RNNs or LSTMs by 3.3 times, yielding better results.
Though RNNs perform well, many tasks are now related to multi-dimensional input features, such as pixels of an image, acoustic features, or orientations of 3D models. Moreover, RNN-based algorithms commonly require a huge number of parameters to represent sequential data in the hidden space.
Preprocessing Techniques for Textual Data
We all know how important it is to preprocess data for data science applications. We also have many tricks and techniques to get our data in an easily machine-understandable representation. In the same way, it is very important to preprocess our textual data and convert it in an easily understandable form using the Machine Learning algorithms. There are various techniques that can be used to preprocess the data. Mentioning a few below:
Count Vector as Features
Count vector is a matrix notation of a dataset that is used to determine the frequency of a word in the document. This helps in identifying the most important word in the document based on how many times it occurs in the document. In the count vector matrix, every column represents a document from the corpus and every row represents a term in the corpus and every cell gives the frequency of that particular term in that particular document.
TF-IDF Vector as features
TF stands for Term Frequency and IDF stands for Inverse Document Frequency. TF-IDF is used to find the relative importance of the words with respect to the entire corpus i.e to tell us about how much impact does that particular term is having on the meaning of the entire sentence or paragraph.
This score is generated at three levels:
Word Level: Matrix representing TF-IDF scores of every term in different documents. N-gram: Matrix representing TF-IDF scores of N-grams. N-grams are the combination of N terms together. Character level: Matrix representing TF-IDF scores of character level N-grams in the corpus.
A word embedding is a way of representing words and documents using dense vector representation. The position of a word within the vector space is learned from the text and is based on the words that surround the word when it is used. Word embeddings are used to understand the contexts of the words which also help in identifying the words holding similar meaning. Word embeddings can be trained using the input corpus or can be generated using pre-trained word embeddings such as Glove, FastText, and Word2Vec.
Topic Modelling is a process to automatically identify topics present in a text object and to derive hidden patterns exhibited by a text corpus. It is an unsupervised approach for finding and observing a bunch of words in a large cluster of texts. It is usually done using Lateral Dirichlet Allocation(LDA) — a statistical model that allows a set of observations to be explained by unobserved groups that explain why some parts of the data are similar.