In this episode of Leading with Data, we dive into the fascinating world of data science with Rohan Rao, a Quadruple Kaggle Grandmaster and expert in machine learning solutions. Rohan shares insights on strategic partnerships, the evolution of data tools, and the future of large language models, shedding light on the challenges and innovations shaping the industry.
You can listen to this episode of Leading with Data on popular platforms like Spotify, Google Podcasts, and Apple. Pick your favorite to enjoy the insightful content!
Let’s look into the details of our conversation with Rohan Rao!
Thank you, Kunal, for having me on Leading With Data. My journey in data science began nearly a decade ago, filled with coding, hackathons, and competitions. It’s challenging to pick a standout competition, but one memorable moment was achieving a hat trick of wins on Analytics Vidhya’s hackathons by cleverly teaming up with a strong competitor. It was a strategic move that paid off and is a fond memory from my competitive days.
The field of data science has seen phases of gradual progress and sudden leaps. Tools like XGBoost revolutionized predictive modeling, while BERT transformed NLP. Recently, the release of ChatGPT marked a significant milestone, showcasing the capabilities of LLMs. These advancements have required data scientists to continuously adapt and upgrade their skills.
The trajectory of LLMs tends to show a steep initial improvement followed by a plateau. Improving performance incrementally becomes more challenging over time. While LLMs have learned from vast amounts of internet data, the future improvements may hinge on new, large datasets or innovations in synthetic data generation. The computational resources available today are unprecedented, making innovation more accessible than ever.
Businesses across various industries are eager to integrate LLMs into their operations. The challenge lies in marrying these models to proprietary business data, which is often not as extensive as the data LLMs are trained on. At H2O.ai, we’re seeing a significant portion of our work focused on enabling businesses to leverage the power of LLMs with their unique datasets.
The most common question from businesses is how to make an LLM learn from their specific data. The goal is to apply the general capabilities of LLMs to address unique business challenges. This involves understanding the models’ strengths and limitations and integrating them with existing systems and data formats.
Certainly. The framework I presented at the Data Hack Summit includes 12 points to consider when selecting an LLM for your business. These range from the model’s capabilities and accuracy to scalability, cost, and legal considerations like compliance and privacy. It’s crucial to evaluate these factors to determine which LLM aligns best with your business objectives and constraints.
The key is to experiment and iterate. While traditional algorithms like XGBoost have been the go-to for many problems, LLMs offer new possibilities. By comparing their performance on specific tasks, businesses can determine which approach yields better results and is more feasible to deploy and manage.
Choosing between proprietary LLMs with APIs and hosting open-source LLMs on-premises is a significant decision. While open-source models may seem cost-effective, they come with hidden complexities like infrastructure management and scalability. Often, businesses gravitate towards API services for their convenience, despite higher costs.
Responsible AI is a complex issue that extends beyond technological solutions. While guardrails and frameworks are in place to prevent misuse, the unpredictable nature of LLMs makes it difficult to fully control. The solution may involve a combination of technological safeguards, government policies, and AI regulations to balance innovation with ethical use.
I’m extremely bullish on the potential of AI agents. Specialized agents can perform specific tasks with high accuracy, and the challenge lies in integrating these microtasks into broader solutions. While some products may simply wrap existing LLMs with custom prompts, truly specialized agents have the potential to revolutionize how we approach problem-solving in various domains.
As Rohan emphasizes, navigating the landscape of data science and generative AI requires continuous learning and experimentation. By embracing innovative frameworks and responsible AI practices, businesses can harness the power of data to drive meaningful solutions, ultimately transforming the way they operate and compete in a rapidly evolving market.
For more engaging sessions on AI, data science, and GenAI, stay tuned with us on Leading with Data.