In the latest episode of Leading with Data, we had the pleasure of hosting Matthew Honnibal, the founder of Explosion AI and creator of the widely-used spaCy NLP library. Matthew’s mission is to democratize the development of language technologies, making it accessible beyond those with advanced degrees in the field. With a prolific background in both theoretical and practical aspects of natural language processing (NLP), Matthew has significantly contributed to the advancement of the domain. His work includes over 20 peer-reviewed publications, breakthrough contributions in parsing conversational speech, and impactful projects that bridge the gap between research and real-world applications.
You can listen to this episode of Leading with Data on popular platforms like Spotify, Google Podcasts, and Apple. Pick your favorite to enjoy the insightful content!
Let’s look into the details of our conversation with Dr. Matthew Honnibal!
In the last few years, NLP has seen significant advancements, particularly with the advent of deep learning and pre-trained transformers like BERT. These models have revolutionized the field by effectively utilizing unlabeled data, allowing for fewer examples to train task-specific models. This shift has been a game-changer, as it enables models to start with some knowledge of language before applying it to a task, rather than learning everything from scratch.
Spacey has continued to serve the needs it was designed for, despite the emergence of new technologies like large language models (LLMs). The library has remained relevant and its use cases have only grown as more people delve into NLP. We’ve stayed true to our roots, focusing on solving real NLP problems and ensuring that Spacey evolves alongside the field without deviating from its original purpose.
Initially, I was skeptical about the potential of LLMs, but their success has been undeniable. However, it’s still unclear what direction things will take. While in-context learning has its advantages, especially as a prototyping tool, there’s still a significant technical benefit to training models for classification problems. The more niche a domain, the better the outcome of a trained model over in-context learning. It’s not just about the domain but also the task. For instance, in-context learning may not be as effective for tasks with many labels or nonarbitrary tasks.
Explosion has seen a lot of changes, including the pandemic and the growth of AI technologies. We’ve maintained our commitment to using the tools we develop and solving real NLP problems. Consulting has been an integral part of our business, allowing us to stay in touch with real-world applications and test new methods. Spacey LLM, our latest initiative, encapsulates the process of prompting an LLM, annotating a Spacey doc object, and allowing users to replace the LLM-powered module with a trained model if desired. It’s particularly useful for prototyping and working alongside rule-based classifiers.
The belief that developers need custom models and transparent tools stems from the idea that ease of starting isn’t the only factor that matters in AI development. What’s crucial is the ability to invest more time and effort into a project to consistently improve it. Open source software has been successful because it offers predictability and the ability to build a mental model of what you’re developing against, as opposed to vendor solutions that may hit walls as you progress.
I believe that smaller, task-specific models will continue to be important, especially for machine-facing tasks. The feasibility of running all classifiers at the scale of GPT-4 is doubtful due to resource constraints. However, LLMs will play a significant role in improving the efficiency of creating classifiers, especially in data annotation and understanding training issues. We’ll also see more applications that connect machine-facing outputs to human-facing outputs in rich and interesting ways.
Multimodal tasks are becoming increasingly feasible with larger-scale models. While truly multimodal tasks combining text and image are rarer in business, understanding formatted documents, including tables and figures, is a significant part of the business need for NLP. Better capabilities in this area are crucial, and I expect continued improvement in handling formatted text and numbers.
Matthew Honnibal’s insights in this episode underscore the dynamic evolution of NLP, highlighting the profound impact of deep learning and pre-trained transformers. His balanced view on the coexistence of large language models and task-specific models emphasizes the nuanced approach needed for different NLP applications. Explosion AI’s continued innovation, particularly with the introduction of spaCy LLM, showcases their commitment to practical solutions and real-world impact. As we look to the future, Matthew’s belief in the importance of custom models and transparent tools serves as a guiding principle for sustainable AI development, ensuring adaptability and continuous improvement in the field of NLP.
For more engaging sessions on AI, data science, and GenAI, stay tuned with us on Leading with Data.