There’s text everywhere around us, from digital sources like social media to physical objects like books and print media. The amount of text data being generated every day is mind boggling and yet we’re not even close to harnessing the full power of natural language processing.
I see a ton of aspiring data scientists interested in this field, but they often turn away daunted by the challenges NLP presents. It’s such a niche line of work, and we at Analytics Vidhya would love to see more of our community actively participate in ground-breaking work in this field.
So we thought what better way to do this than get an NLP expert on our DataHack Radio podcast? Yes, we’ve got none other than Sebastian Ruder in Episode 12! This podcast is a knowledge goldmine for NLP enthusiasts, so make sure you tune in.
It’s catapulted to the top of my machine learning recommended podcasts. I strongly feel every aspiring and established data science professional should take the time to hear Sebastian talk about the diverse and often complex NLP domain.
If you’re looking for a place to get started with NLP, you’ve landed at the right place! Check out Analytics Vidhya’s Natural Language Processing using Python course and enrol yourself TODAY!
Subscribe to DataHack Radio today and listen to this, as well as all previous episodes, on any of the below platforms:
Sebastian’s background is in computational linguistics, which is essentially a combination of computer science and linguistics. His interest in mathematics and languages piqued in high school and he carved a career out of that.
He completed his graduation in Germany in this field as well, and has been immersed in the research side of things ever since. This is a very niche field so there are a lot of things Sebastian had to pick up from scratch and learn on his own. Quite an inspiring story for folks looking to transition into machine learning – a healthy dose of passion added to tons of dedication and an inquisitive mind.
He is currently working at Aylien as a research scientist and also pursuing a Ph.D. from the National University of Ireland in, you guessed it, Natural Language Processing. Relationship extraction, named entity recognition and sentiment analysis are some of the areas where Sebastian has worked during his initial Ph.D. years.
There’s a certain bias in the machine learning world when it comes to NLP. When someone mentions text, the first language that pops into our mind is English. How difficult is it to transfer a NLP model between languages? Next to impossible, as it turns out. Sebastian explained this with an example between his native tongue German and English.
German is more syntactically richer than English, while in the latter you can go a long way with techniques like tokenization and building models on the word-level. In German, you need to be more careful of how words are composed. This is important in computational linguistics as hierarchy on words matters.
And of course, a common difference is how the words as used. Because of this, there are different rules in place for individual languages which is what makes working with text so challenging.
For sentiment analysis, Sebastian mentioned that the primary example is looking at different categories of product reviews. These are usually already well-defined and getting training data is comparatively easier. Apart from this, social media data (especially Twitter) is popularly used to mine text and extract sentiments.
Other sources that are referred to include print media, like digital newspaper articles, magazines, blogs, among others. If you’re applying deep learning techniques, then scanned images of text can also be used for training your model.
Sebastian recently co-authored a fascinating research paper with the great Jeremy Howard called ‘Universal Language Model Language Fine-tuning’ (ULMFiT). The paper made waves in the NLP community as soon as it was released, and the techniques are available in the fastai library.
ULMFiT is an effective transfer learning method that can be applied to almost any NLP task. At the time of the release, it outperformed six state-of-the-art text classification tasks. Sebastian and Jeremy have done all the hard work for you – they have released pretrained models that you can plug into your existing project and generate incredibly accurate text classification results.
You can read the paper in full here.
The most prominent challenge in most machine learning applications is first getting the data you need to train the model, and then find the right amount of computational resources to actually do the training. This has often proved to be a step too far for a lot of project.
So Sebastian introduced us to the idea of increased sample efficiency wherein you can train models without having to collect millions of text data points. In addition to this, the trained model should not overfit on this relatively smaller sample, and should generalize well.
Another challenge, which we touched on earlier, is the lack of datasets in non-English languages. The majority of data, and subsequently algorithms, are from English origins. We should seriously think about democratizing data from other languages to reduce this gap and eliminate the current state of bias.
I’ve always had a deep fascination with the field of NLP, given my interest in literature. So it was a pleasure to hear Sebastian deep dive into the nuts and bolts of how different text challenges work. Like I mentioned at the start of the article, this is definitely one of my favorite podcasts we’ve hosted on DataHack Radio and I hope you find it as useful as I did.