Here’s something you don’t hear everyday – everything we perceive is just a best case probabilistic prediction by our brain, based on our past encounters and knowledge gained through other mediums. This might sound extremely counter intuitive because we have always imagined that our brain mostly gives us deterministic answers.
We’ll do a small experiment to showcase this logic. Take a look at the below image:
Q1. Do you see a human ?
Q2. Can you identify the person?
Q3. Can you identify the color of the clothes this person is wearing?
Keep these questions in mind. We will get back to them after seeing what we will be covering in this article.
Understanding Artificial Neural Networks becomes easy if you draw parallels with the working of the human brain. Today, we will explore the concept of embedding. We will first understand what “embedding” means in our human brain space, and then look at it’s applications and use cases.
Let’s get back to our experiment. Recall the blurred image and our three questions. All of you must have guessed the answer to the first question, most of you must have guessed the second one, and a few of you might have guessed the third one. Give a score on your confidence level (out of 100) for each of the three questions and average them. Why do you think you did not score an absolute 100? The reason is simple, our brain does not have enough data to identify this image with a probability ~ 1.0 . Now look at the image below:
Try to answer the three questions again. Did you get a 100 this time? You might think that our brain is returning a deterministic value for the three questions this time. But it turns out that your brain is making predictions that are very, very close to 1.0 probability because this time our brain got enough information.
Let’s make this experiment more interesting. Look at the first image again and compute your score. Did you score higher than before? If yes, the reason is simple. Your brain has used its past memory to add new information which is not even there in the picture and increased the returned confidence value. This new information is that the brain has seen an almost similar picture before.
The above experiment demonstrates that our brain tries to predict everything around us and works with the best guess. This statement is true to such an extent that our brain even has to predict the location of parts of our own body through experience. The brain cannot even be deterministic of the facts like where your hand or legs or chest are located. Google “body transfer illusion” to read about experiments that will prove this fact.
Coming to the big question – if our brain predicts everything with past experience or data, how do we work so well in situations where we have no prior experience or knowledge? For instance, if you go to a grocery store and get a new fruit called “Alphaberry”, what will you do with this item?
Probably keep it in the refrigerator, wash it and eat it. How do you know so much about what to do with this new item even when you have no experience with the subject? It turns out that our brain tries to create a semantic understanding of everything and refines this understanding with experience. This initial semantic understanding gives it a head start.
For instance, the brain knows that “Alphaberry” is a fruit, hence it will share a lot of property with other fruits, i.e., we can eat it without any danger. Once we have eaten this fruit, our brain adds the new information to refine the semantic representation of “Alphaberry”.
These semantic representations can be used by the brain to find similarities between concept/objects or draw analogies or make reasoning. The name “Alphaberry” is just the address of this semantic representation in our memory. The moment I ask you – “Have you tried alphaberry?”, your brain will look for semantic representations of “alphaberry” and retrieve all the experiences/information about the fruit. It then evaluates probabilistic answers to the question and gives you the appropriate response – “Yes. It was very sweet.”
Let’s take another example. What is 20 * 10 ? Obviously 200. Numbers, unlike words and pictures, have their semantics encoded in itself. Hence, our brain does not see 10 and 20 as addresses of semantics but as semantics itself. Our brain can directly do operations on these numbers unlike “alphaberry”, where it had to retrieve semantic representations before it could answer anything about the subject. This is an important concept and we will refer this concept later in this article.
The above description, as you can imagine, is an over simplification of what really goes in our complex brain. In the next section, we will talk about a concept used in Artificial Neural Networks (ANN), that is parallel to semantic representation in humans.
In the last decade, we have made computers highly efficient with numbers (even better than humans). Computers can only work with numbers as they are the only entities that have semantic encoded on itself. How can we make computers understand concept like words, images, audio or videos? The answer lies in our previous section.
We need a “semantic representation” of all these concepts in numbers or array of numbers, as numbers are all computers can understand. It is important that the array of numbers do a good job of denoting the semantics of the entity, else we will have a “garbage in garbage out” kind of model.
We will contrast two models to denote the semantics in order to get a deeper understanding of this concept. Here we will talk about representation of 6 concepts – “Lion”, “Cub”, “Cat”,”Kitten”,”Apple”, “Alphaberry”. You have 6 pictures of each and need to represent them in numeric form.
Our first model is one-hot encoding vector representation of words. The representation looks as follows:
Our second model is based on attributes of each picture. This is more like the semantic/meaning of the picture. Following is an example:
Now let’s try to contrast. Try answering the following questions with both the representations individually.
I am sure you already know that the second representation did well on all the above 3 questions. Not only does the semantic representation give us a head start on new concepts like “Alphaberry”, it also helps us infer logically. For instance if you do the following task mathematically, you will find the answer to our second question :
Vector (Pic 1-Lion) + Vector(Pic4-kitten) - Vector(Pic3-Cat) = Vector(Pic2-Cub)
How awesome is that! Now we can do mathematical operations on abstract concepts like words/pictures/audio. We also know semantic representation in form of numeric vectors for abstract concepts can be very helpful for our ANN models. But how can we create so many semantic representations for all the words/images/audios in this world? The answer is simple – by learning from lots and lots of data. The features from such a learning will be a lot more abstract unlike the one we used in our demonstrative example above.
Like our brain uses semantics in all the cognitive tasks, Artificial Neural Networks use semantic embedding for numerous tasks. We will categorize these applications under 3 main types of embedding they use.
Let us now try to understand a generic concept of learning embedding.
The underlying methodology for developing any type of embedding is almost the same. We prepare a training data with one hot encoded entity (this can be a word or image or audio), define some kind of target function and develop a neural network. We then throw out the last layer of the neural net and use the weights of the intermediate layer as the embedding. Following is the generic process flow:
Most of the algorithms that are used to train text embedding work on a very simple framework. Let’s try to understand this with an example. We will use the following sentence to illustrate the logic:
I have been playing Cricket since my childhood.
Word2vec – Word2vec is the most popular embedding algorithm. It works on very simplistic scenarios of the generic algorithm. We randomly choose a target word and then choose one word from the generic context vector as the final context. For instance, we can choose “playing” as the target vector and “cricket” as the context. Now we run the generic model that we discussed in the last section.
The only limitation of this methodology is that the calculation of the softmax function at the end of the framework is extremely expensive. This is because the size of the output node is equal to the vocabulary size (which is generally more than 10k). You can always use pre-trained word2vec matrix for your business case to avoid this computational cost.
Negative Sampling – This is another powerful concept which gets rid of the one challenge in word2vec. Instead of multiple output nodes, we convert the problem into a binary classification. The target word is chosen at random, as before. For the context word, we initially choose one of the correct proximity words, and then choose random words from the dictionary. Each of these pairs are seen as separate observations.
In the above table, we use the field “correct” as the output node and treat this model as a binary classifier. Hence, we can avoid the expensive calculation of softmax function in the word2vec algorithm.
Other algorithms – Many other algorithms such as GloVe word vectors, etc. have been used in the industry. All the models work on the same generic architecture with minor changes. If you have a small dataset, it is generally recommended to use pre-trained embedding. These embeddings have been trained on millions of documents and hence have very accurate semantic information.
Even though word embedding is the most popular application of embedding, image & speech embedding is no less when it comes to practical applications. The main use of both image and speech embedding is authentication. We authenticate customers in every industry before we share any private information. There is a good chance you have encountered the use of embedding without even knowing it. Consider the following examples:
Hopefully you can relate to some of the above use cases. Each of these is primarily based on image or speech embedding. Both speech and image are analyzed with the same objective, i.e., to find similarity between multiple voice/images, with almost the same architecture. The only difference is that speech is first converted to image using filter banks/MFCC in order to visualize how humans perceive sound. Then it follows the same process as the formation of image embedding. In both speech and image, we see two broad use cases:
Why do we need embedding for the verification/recognition tasks? Why can’t we train a model for each face/voice separately? We already know neural networks need a lot of data to be precise and accurate. However, most of our verification/recognition tasks are one shot learning. One shot learning is learning made on one, or very few examples. For instance, Baidu’s system will probably have 1 or 2 pictures of every employee. How can we create a model when we have just a few data-points per class? This is the reason we create embedding for each image and then try to find similarities between the embeddings. This concept will become clearer once we are done with the neural network architecture of training image/voice embedding.
The neural network architecture for training image embedding is commonly known as Siamese Network. I have included only one out of the many algorithms that are used to create image embedding. In this method, we randomly choose two images from our population and send them through shared CNN stacked layers. The vector we get as output is the image embedding. Then we take a distance/difference between the two embeddings. This difference is finally passed through an activation function to check if the image is of the same person.
Note that the embedding matrix we get in this process is not for any unique person, but to find features that can tell us “How similar the two faces look”.
I hope this article has given you a strong foundation in the concept of embedding and helped you understand how important it is when it comes to analyzing unstructured data. In simple terms, we are trying to create a structured data out of unstructured underlying data using these embeddings. This structured data has the meaning of underlying data embedded in form of a vector and hence the name “embedding”.
If you have any ideas or suggestions regarding the topic, do let me know in the comments below!
Very well explained
Hi, Excellent post! Thank you.
Good Article. It would be useful if you elaborate a little more on some of the concepts. For example define what "Embedding Matrix" is. This could be done with an additional figure or some text in Figure 1. Then in Figure 2, it is helpful if you explain what E or e variables are exactly and why you have that many entries in the Target layer as you have in this Figure. SOme thoughts....