Large Language Models (LLMs) have changed the entire world. Especially in the AI community, this is a giant leap forward. Building a system that can understand and reply to any text was unthinkable a few years ago. However, these capabilities come at the cost of missing depth. Generalist LLMs are jacks of all trades but masters of none. For domains that require depth and precision, flaws like hallucinations can be costly. Does that mean domains like medicine, finance, engineering, legal, etc., will never reap the benefits of LLM? Experts have already started building dedicated domain-specific LLMs for such areas, which leverage the same underlying techniques like self-supervised learning and RLHF. This article explores domain-specific LLMs and their capability to yield better results.
Before we dive into the technical details, let us outline the learning objectives of this article:
This article was published as a part of the Data Science Blogathon.
A large language model, or LLM, is an artificial intelligence system that contains hundreds of millions to billions of parameters and is built to understand and generate text. The training involves exposing the model to many sentences from internet text, including books, articles, websites, and other written materials, and teaching it to predict the masked words or the following words in the sentences. By doing so, the model learns the statistical patterns and linguistic relationships in the text it has been trained on. They can be used for various tasks, including language translation, text summarization, question answering, content generation, and more. Since the invention of transformers, countless LLMs have been built and published. Some examples of recently popular LLMs are Chat GPT, GPT-4, LLAMA, and Stanford Alpaca, which have achieved groundbreaking performances.
LLMs have become the go-to solution for language understanding, entity recognition, language generation problems, and more. Stellar performances on standardized evaluation datasets like GLUE, Super GLUE, SQuAD, and BIG benchmarks reflect this achievement. When released, BERT, T5, GPT-3, PALM, and GPT-4 all delivered state-of-the-art results on these standardized tests. GPT-4 scored more on the BAR and SATs than any average human. The chart (Figure 1) below shows the significant improvement in the GLUE benchmark since the advent of large language models.
Another major advantage large language models have is their improved multilingual capabilities. For example, the multilingual BERT model, trained in 104 languages, has shown great zero-shot and few-shot results across different languages. Moreover, the cost of leveraging LLMs has become relatively low. Low-cost methods like prompt design and prompt tuning have come up, which ensure that engineers can easily leverage existing LLMs at a meager cost. Hence, large language models have become the default option for language-based tasks, including language understanding, entity recognition, translation, and more.
Most popular LLMs, like the ones mentioned above, trained on various text resources from the web, books, Wikipedia, and more, are called generalist LLMs. There have been multiple applications for these LLMs ranging from search assistant (Bing Chat using GPT-4, BARD using PALM); content generation tasks like writing marketing emails, marketing content, and sales pitches; question and answering tasks like personal chatbots, customer service chatbots, etc.
Although generalist AI models have shown great skills in understanding and generating text over various topics, they sometimes need more depth and nuance for specialized areas. For example, “bonds” are a form of borrowing in the finance industry. However, a general language model may not understand this unique phrase and confuse it with bonds from chemistry or between two humans. On the other hand, domain-specific LLMs have a specialized understanding of terminology related to specific use cases to interpret industry-specific ideas properly.
Moreover, generalist LLMs have multiple privacy challenges. For example, in the case of medical LLMs, patient data is highly critical, and exposure of such confidential data to generic LLMs could violate privacy agreements due to techniques like RLHF. Domain-specific LLMs, on the other hand, ensure a closed framework to avoid the leak of any data.
Similarly, generalist LLMs have been prone to significant hallucinations as they are often catered heavily to creative writing. Domain-specific LLMs are more precise and perform significantly better on their field-specific benchmarks, as seen in the use cases below.
LLMs that are trained on domain-specific data are called domain-specific LLMs. The term domain covers anything from a specific field, like medicine, finance, etc., to a specific product, like YouTube Comments. A domain-specific LLM aims to perform best on domain-specific benchmarks; generic benchmarks are no longer critical. There are multiple ways to build dedicated language models. The most popular approach is fine-tuning an existing LLM to domain-specific data. However, pre-training is the way to go for use cases striving to achieve state-of-the-art performances in a niche domain.
Tuning existing LLMs to a particular domain can greatly improve the process of developing language models fine-tuned to that domain. In fine-tuning, the model uses the knowledge encoded during pre-training to tweak those parameters based on domain-specific data. Fine-tuning requires less training time and labeled data. Because of its inexpensive cost, this has been the popular approach for domain-specific LLMs. However, fine-tuning could have severe performance limitations, especially for niche domains. Let us understand this with a simple example of the BERT model built for legal language understanding (paper). Two pre-trained models are used: BERT-base and Custom Legal-BERT. As shown in the image below, a BERT-base model fine-tuned on legal tasks severely outperforms a Custom Legal-BERT model fine-tuned on legal tasks.
The above example clearly exhibits the power of domain-specific pre-training over fine-tuning in niche areas like law. Fine-tuning generic language models is helpful for more generalized language problems, but niche problem areas would do much better by using pre-trained LLMs. The following sections explain different pre-training approaches and give an example of each approach and its success.
Pre-training a language model using a large-sized dataset carefully selected or created to be aligned with a specific field is called domain-specific pre-training. Models can learn domain-specific knowledge, for example, terminology, concepts, and subtleties unique to that field, by being trained on domain-specific data. It helps models learn about a field’s unique requirements, language, and context, producing predictions or replies that are more accurate and contextually appropriate. This enhances the model’s understanding of the target field and improves the precision of its generative capabilities. There are multiple ways to use domain-specific data for pre-training for LLMs. Here are a few of them:
Use only domain-specific data instead of general data for pre-training the model on self-supervised language modeling tasks. This way, the model will learn domain-specific knowledge. The domain-specific LLM can then be fine-tuned for the required task to build the task-specific model. This is the simplest way to pre-train a domain-specific LLM. A figure shows the flow for using only domain-specific data for self-supervised learning to build the domain-specific LLM.
Example: StarCoderBase
StarCoderBase is a Large Language Model for Code (Code LLMs) trained using permissively licensed data from GitHub, including 80+ programming languages, Git commits, and Jupyter notebooks. It is a 1 trillion token 15B parameter model. StarCoderBase beat the most significant models, including PaLM, LaMDA, and LLaMA, while being substantially smaller, illustrating the usefulness of domain-specialized LLMs. (Image from StarCoder Paper)
Combine domain-specific data with general data for pre-training the model on self-supervised language modeling tasks. This way, the model will learn domain-specific knowledge and utilize the general language pre-training to improve language understanding. Here is a figure showing the flow for using only domain-specific data and general corpora for self-supervised learning for building the domain-specific LLM, which can then be fine-tuned for domain-specific tasks.
Example: Bloomberg GPT
Bloomberg GPT is a finance domain LLM trained on an extensive archive of financial data, including a 363 billion token dataset of English financial papers. This data was supplemented with a public dataset of 345 billion tokens to generate a massive training corpus of over 700 billion tokens. The researchers built a 50-billion parameter decoder-only causal language model using a subset of this training dataset. Notably, the BloombergGPT model surpassed current open models of a similar scale by a vast amount on financial-specific NLP benchmarks. The chart below shows Bloomberg GPT’s performance comparison on finance-specific NLP tasks. Source: Bloomberg.
Build or use a pre-trained generic LLM and cold start on its parameters. Run the language modeling self-supervised tasks using domain-specific data on top of the cold-started generic LLM to build the domain-specific LLM, which can then be fine-tuned for the required task to build the task-specific model. This leverages transfer learning from the generic LLM by cold starting on the generic LLM. Here is a figure showing the flow for step-by-step self-supervised learning, first using general and then domain-specific corpora for building the domain-specific LLM.
Example: BioBERT
BioBERT (Lee et al., 2019) is built on the BERT-base model (Devlin et al., 2019), with extra bio-medical domain pre-training. This model was trained for 200K steps on Pub Med and 270K steps on PMC, followed by 1M steps on the Pub Med dataset. BioBERT beats BERT and earlier state-of-the-art models in biomedical text-based tasks when pre-trained on biomedical corpora while having almost the same architecture across tasks. BioBERT outperforms BERT on three representative biomedical text mining tasks: biomedical named entity recognition (0.62% F1 score improvement), biomedical relation extraction (2.80% F1 score improvement), and biomedical question answering. (12.24% MRR improvement).
The examples above illustrate the power of pre-training a language model in a specific domain. The techniques listed can significantly improve performance on tasks in that domain. There are several advantages beyond performance improvements as well. Domain-specific LLMs eventually result in better user experiences. Another important advantage of domain-specific LLMs is reduced hallucination. A big problem with large-sized models is the possibility of hallucinations or inaccurate information generation. Domain-specific LLMs can prioritize precision in their replies and decrease hallucinations by restricting the spectrum of application cases. Another major benefit of domain-specific LLM is protecting sensitive or private information, a major issue for today’s businesses.
As more use cases adopt the LLMs for better performance and multilingual capabilities, it is worthwhile to start approaching new problems through the lens of LLMs. Moreover, the performance data listed in the sections above suggests that moving existing solutions to use LLM is a worthwhile investment. Running experiments with the approaches mentioned in this article will improve your chances of achieving your targets using domain-specific pre-training.
A. Its size characterizes a large language model (LLM). AI accelerators enable their size by processing vast amounts of text data, mostly scraped from the Internet. They build them with artificial neural networks and transformer architecture, which can contain tens of millions up to billions of weights, and pre-train them using self-supervised and semi-supervised learning.
A. Companies customize domain-specific LLMs for fields of interest, like legal, medicine, or finance. They outperform generic LLMs on field-specific benchmarks and may perform poorly on general language tasks.
A. One can build domain-specific LLMs from scratch by pre-training them on self-supervised tasks using domain-specific corpora. This process can also involve utilizing generic corpora independently, in combination, or sequentially. Alternatively, you can enhance the performance of generalist LLMs in a specific domain by fine-tuning them using domain-specific data. Despite the convenience, fine-tuning could have severe performance limitations, and pre-training a domain-specific model outperforms fine-tuning significantly for most use cases.
A. Key benefits of domain-specific LLMs are better performance on the target domain, fewer hallucinations, and better privacy protections.
A. Some example applications of domain-specific LLMs covered in this article are Bio-BERT for bio-medicine, Custom Legal-BERT for Law, Bloomberg GPT for finance, and Star Coder for code-completion.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.