The Beginners Guide to Small Language Models

The Beginners Guide to Small Language Models

June 19, 2024
shelly

Small language models explained: Use cases, applications, advantages, technologies, implementation and development

small language model

Moderators are staffed during regular business hours (New York time) and can only accept comments written in English. With this procedure in hand, Eldan and Li were finally ready to compare different models and find out which were the star students. First, Eldan used GPT-4 to generate a list of 1,500 nouns, verbs and adjectives that a 4-year-old might know — short enough that he could easily check it himself. That means LLMs are also more versatile and can be adapted, improved and engineered for better downstream tasks such as programming.

To tokenize our text sequences, we trained a single SentencePiece model (SPM)55 for all languages. To ensure low-resource languages are well-represented in the vocabulary, we downsampled high-resource and upsampled low-resource languages with a sampling temperature of five (ref. 10). Notably, vocabulary size is an important hyperparameter in multilingual translation models involving low-resource languages56,57,58. Such a large vocabulary ensures adequate representation across the wide spectrum of languages we support. The quality of NMT outputs is typically evaluated by automatic metrics such as BLEU44 or spBLEU41.

This allows people to communicate with machines as they do with each other, to a limited extent. A review of modelling languages is essential to be able to assign which languages are appropriate for different modelling settings. In the term settings we include stakeholders, domain and the knowledge connected. Object modeling languages are modeling languages based on a standardized set of symbols and ways of arranging them to model (part of) an object oriented software design or system design. A framework-specific modeling language (FSML) is a kind of domain-specific modeling language which is designed for an object-oriented application framework. FSMLs define framework-provided abstractions as FSML concepts and decompose the abstractions into features.

small language model

Fox-1 was trained from scratch with a 3-stage data curriculum on 3 trillion tokens of text and code data in 8K sequence length. In various benchmarks, such as MMLU, ARC Challenge, TruthfulQA, and GSM8k, Fox-1 performs better or on par with other SLMs in its class including Gemma-2B, Qwen1.5-1.8B, and OpenELM-1.1B. GPT-4 Turbo, OpenAI’s previous “leading “most advanced” model, was trained on a combination of images and text and could analyze images and text to accomplish tasks like extracting text from images or even describing the content of those images.

A creation story told through immersive technology

There is also a concern about highly agglutinative languages in which BLEU fails to assign any credit to morphological variants. ChrF++ overcomes these weaknesses by basing the overlap calculation on character-level n-grams F-score (n ranging from 1 to 6) and complementing with word unigrams and bi-grams. In this work, we primarily evaluated using chrF++ using the settings from sacrebleu. However, when comparing with other published work, we used BLEU and spBLEU where appropriate. MoE transformer models differ from dense transformer models in that some of the feed-forward network layers are replaced with MoE layers in both the encoder and the decoder. An MoE layer consists of E experts (each is a feed-forward network) and a gating network to decide how to route input tokens to experts.

Eldan immediately set out to create a library of synthetic children’s stories generated by large language models. But he soon discovered that even state-of-the-art models aren’t naturally very creative. The difference in results between the two architectures suggests that the impact of instruction-tuning might be architecture-dependent.

During the design phase, however, logical design notation is used to depict the relationship between software entities. In addition, the discipline-specific modeling language best practices does not preclude practitioners from combining the various notations in a single diagram. Since the SLM trains on relatively smaller domain-specific data sets, the risk of bias is naturally lower when compared to LLMs.

Picsart partners with Getty Images to develop a custom AI model

This approach offers cost efficiency, enhanced privacy, and personalized user experiences, all within a unified ecosystem that facilitates seamless collaboration between cloud and edge environments. For the first time, our latest survey explored the value created by gen AI use by business function. The function in which the largest share of respondents report seeing cost decreases is human resources.

The robot will need to combine your instructions with its visual observations to determine the steps it should take to complete this task. LinkedIn is launching new AI tools to help you look for jobs, write cover letters and job applications, personalize learning, and a new search experience. In the initial release of the Toxicity-200 lists, the average number of items in a toxicity detection list was 271 entries, whereas the median number small language model of entries was 143. The latter may be a better measure of central tendency than the mean average, given that languages with a rich inflectional morphology constitute extreme outliers (for example, the Czech list had 2,534 entries and the Polish list 2,004). First, we used a combination of multiple binary classifiers in which the final decision was obtained by selecting the language with the highest score after applying a threshold.

  • The Trustworthy Language Model takes the same basic idea—that disagreements between models can be used to measure the trustworthiness of the overall system—and applies it to chatbots.
  • The two researchers showed that language models thousands of times smaller than today’s state-of-the-art systems rapidly learned to tell consistent and grammatical stories when trained in this way.
  • In addition, he spoke about how long it will take before such models are ready for use and what barriers — particularly data quality — organizations need to overcome to get them into production.

GPT-4o is more multilingual as well, OpenAI claims, with enhanced performance in around 50 languages. And in OpenAI’s API and Microsoft’s Azure OpenAI Service, GPT-4o is twice as fast as, half the price of and has higher rate limits than GPT-4 Turbo, the company says. While today GPT-4o can look at a picture of a menu in a different language and translate it, in the future, the model could allow ChatGPT to, for instance, “watch” a live sports game and explain the rules to you. For example, users can ask the GPT-4o-powered ChatGPT a question and interrupt ChatGPT while it’s answering.

Data Preparation

The impact of instruction fine-tuning is also evident, but its efficacy is dependent on the architecture. Notably, the choice of scoring function doesn’t seem to make a marked difference in performance. In the dynamic landscape of NLP, small language models serve as catalysts for innovation, democratizing access to advanced language processing tools and fostering inclusivity within the field. Their potential to empower diverse communities and streamline development processes holds promise for driving impactful advancements across numerous sectors, from education to healthcare and beyond.

Of course, although it can be downloaded and used by everyone, that is very different from being “open source” or some variety of that term, as we discussed last week at Disrupt. Though the license is highly permissive, the model itself was developed privately, using private money, and the datasets and weights are likewise private. The chrF++ score38 overcomes the limitation of the BLEU score, which requires that a sentence can be broken up into word tokens. However, some languages, such as Chinese or Thai, do not use spaces to separate words, and word segmentation tools may not be readily available.

The first is the probability of the label given the prompt, it is the most straightforward method, giving the probability of the continuation. The second and third methods are the ratio between this probability and the probability of the label given a “tasks specific premise” (called DCPMI) and an “unconditional/not task specific premise”. These methods are a reweighting of each label options according to its a priori likelihood in/out of the context of the task.

A discipline-specific modeling (DspM) language is focused on deliverables affiliated with a specific software development life cycle stage. Therefore, such language offers a distinct vocabulary, syntax, and notation for each stage, such as discovery, analysis, design, architecture, contraction, etc. For example, for the analysis phase of a project, the modeler employs specific analysis notation to deliver an analysis proposition diagram.

To break this pattern, here we introduce No Language Left Behind—a single massively multilingual model that leverages transfer learning across languages. We developed a conditional computational model based on the Sparsely Gated Mixture of Experts architecture2,3,4,5,6,7, Chat GPT which we trained on data obtained with new mining techniques tailored for low-resource languages. Furthermore, we devised multiple architectural and training improvements to counteract overfitting while training on thousands of tasks.

As we noticed difference in classification performances under different scoring functions but none could lead to a clear winner, could’nt juge really how well models performed. So we decided to take the mean of these scores to have a more robust evaluation of the model’s performance. For both encoder-decoder and decoder-only models, values are above the standard 0.05 by a large margin. In the same way as architecture, we quantified the impact of instruction-tuning on performances while controlling the number of parameters. Figure 1 presents the relationship between the number of parameters and the performance in terms of Acc/F1 scores across various datasets.

Another important use case of engineering language models is to eliminate bias against unwanted language outcomes such as hate speech and discrimination. To learn the complex relationships between words and sequential phrases, modern language models such as ChatGPT and BERT rely on the so-called Transformers based deep learning architectures. The general idea of Transformers is to convert text into numerical representations weighed in terms of importance when making sequence predictions. We did not mention external factors such as pre-training time, data quality, or potential biases in the datasets. These external factors might impact the results or the generalizability of the conclusions.

Uses and examples of language modeling

Both the graphical analysis and the ANCOVA show an effect of instruction-tuning on encoder-decoder architecture. For the causal architecture, there is no significant impact of instruction-tuning on Acc/F1 scores. The p-value for the decoder-only architecture is 0.6693, much greater than 0.05. For the seq2seq architecture, there is a significant impact of instruction tuning on Acc/F1 scores. The p-value for the encoder-decoder architecture is highlighted in red as 0.0086, less than 0.05. In our analysis, we shift our attention to which features among the model size, instruction-tuning, and scoring functions have an impact on performance.

Empirically, we find zero-shot performance to be negatively affected when conditioning the encoder on the target language. When the source is conditioned on only the source language, the encoder generalizes better to pairs of source and target languages not encountered during training1. Faced with these difficulties, some researchers have opted to train smaller models on smaller data sets and then study their behavior. “It’s like sequencing the Drosophila genome versus sequencing the human genome,” said Ellie Pavlick, a language model researcher at Brown University. It is worth noting that the behavior of our downstream models is subject to biases inherited

from the dataset it was trained, as no alignment nor specific filtering was done. We envision the same research progress in reducing anti-social behaviors in LLMs can also be applied to improve smaller language models.

Then you study a lot of text to understand how those different words relate to each other in context. Recently, he discussed the rising interest in small, domain-specific language models, including how they differ from LLMs, what types of organizations are developing them and how they can be applied. In addition, he spoke about how long it will take before such models are ready for use and what barriers — particularly data quality — organizations need to overcome to get them into production. While many are using large language models to write content and improve search results, enterprises are developing domain-specific models trained on their own data to address specific business problems.

An information model in Gellish can express facts or make statements, queries and answers. Apple’s new AI models, collectively named OpenELM for “Open-source Efficient Language Models,” are currently available on the Hugging Face under an Apple Sample Code License. Since there are some restrictions in the license, it may not fit the commonly accepted definition of “open source,” but the source code for OpenELM is available. Training an SLM in-house with this knowledge and fine-tuned for internal use can serve as an intelligent agent for domain-specific use cases in highly regulated and specialized industries. Recent iterations, including but not limited to ChatGPT, have been trained and engineered on programming scripts. Developers use ChatGPT to write complete program functions – assuming they can specify the requirements and limitations via the text user prompt adequately.

With our proficiency in integrating SLMs into diverse enterprise systems, we prioritize a seamless integration process to minimize disruptions. This guarantees uninterrupted business operations while leveraging the benefits of AI. This integration paves the way for advanced personal assistants capable of understanding complex tasks and providing personalized interactions based on user habits and preferences.

The push to produce a robotic intelligence that can fully leverage the wide breadth of movements opened up by bipedal humanoid design has been a key topic for researchers. An Indian court has restrained Byju’s from proceeding with its second rights issue amid allegations of oppression and mismanagement by its shareholders. The Mistral 7B model is available today for download by various means, including a 13.4-gigabyte torrent (with a few hundred seeders already). The https://chat.openai.com/ company has also started a GitHub repository and Discord channel for collaboration and troubleshooting. To obtain aggregated calibrated XSTS scores on the language direction level, we explored several different calibration methodologies. None of the calibration methods we investigated showed a marked difference in correlation with automated scores, and all calibration methodologies we explored provided superior correlation compared with uncalibrated XSTS scores.

You can foun additiona information about ai customer service and artificial intelligence and NLP. Respondents most commonly report meaningful revenue increases (of more than 5 percent) in supply chain and inventory management (Exhibit 6). For analytical AI, respondents most often report seeing cost benefits in service operations—in line with what we found last year—as well as meaningful revenue increases from AI use in marketing and sales. In a range of tests across different large language models, Cleanlab shows that its trustworthiness scores correlate well with the accuracy of those models’ responses. In other words, scores close to 1 line up with correct responses, and scores close to 0 line up with incorrect ones.

An ANCOVA is made to quantify the impact of instruction-tuning on each architecture (encoder-decoder/decoder-only) while statistically controlling for the effect of the model size feature. Instruction-tuning refers to the strategy for fine-tuning a language model on instruction datasets (Longpre et al., 2023). Going beyond mere model construction, we harness the capabilities of SLM to develop potent AI solutions that transform your business. Our suite of solutions encompasses chatbots, virtual assistants, sentiment analysis tools, OCR systems, and more – all tailored to your specific needs. We aim to unlock the full potential of SLMs to automate tasks, enhance communication, and uncover profound insights.

Gen AI high performers are also much more likely to say their organizations follow a set of risk-related best practices (Exhibit 11). At the model’s release, some speculated that GPT-4 came close to artificial general intelligence (AGI), which means it is as smart or smarter than a human. GPT-4 powers Microsoft Bing search, is available in ChatGPT Plus and will eventually be integrated into Microsoft Office products. However, the researchers were surprised to see that combining language-based representations with vision-based methods improves an agent’s ability to navigate.

Enter the code you received via email to sign in, or sign in using a password. Nunez also pointed out that by supporting small businesses, it also supports the immigrant community who come from different countries, where there are different laws and regulations for the business industry. It saddens me when I go buy from a Hispanic business that had just started about six months ago and it’s closing,” she said. “There will be much better understanding, and I believe that many will be able to function better within the community and become more responsible for our businesses and our contributions to the community,” Soto said.

First, changing the threshold for one language did not affect the performance of the other (which is not true in the first setting). Second, this approach generalizes better to out-of-domain data, which is our primary use case (Wikipedia → web data). Finally, a single classifier has the added benefit of being computationally simpler, thus streamlining the language identification process. In many ways, the composition of the NLLB-200 effort speaks to the centrality of interdisciplinarity in shaping our vision. Machine translation and AI advancements lie at the intersection of technological, cultural and societal development, and thus require scholars with diverse training and standpoints to fully comprehend every angle49,50.

But despite their considerable capabilities, LLMs can nevertheless present some significant disadvantages. Their sheer size often means that they require hefty computational resources and energy to run, which can preclude them from being used by smaller organizations that might not have the deep pockets to bankroll such operations. With larger models there is also the risk of algorithmic bias being introduced via datasets that are not sufficiently diverse, leading to faulty or inaccurate outputs — or the dreaded “hallucination” as it’s called in the industry. Despite these advantages, it’s essential to remember that the effectiveness of an SLM largely depends on its training and fine-tuning process, as well as the specific task it’s designed to handle.

What are the typical hardware requirements for deploying and running Small Language Models? One of the key benefits of Small Language Models is their reduced hardware requirements compared to Large Language Models. Typically, SLMs can be run on standard laptop or desktop computers, often requiring only a few gigabytes of RAM and basic GPU acceleration. This makes them much more accessible for deployment in resource-constrained environments, edge devices, or personal computing setups, where the computational and memory demands of large models would be prohibitive.

Compared with 2023, respondents are much more likely to be using gen AI at work and even more likely to be using gen AI both at work and in their personal lives (Exhibit 4). The survey finds upticks in gen AI use across all regions, with the largest increases in Asia–Pacific and Greater China. Respondents at the highest seniority levels, meanwhile, show larger jumps in the use of gen Al tools for work and outside of work compared with their midlevel-management peers.

The Trustworthy Language Model draws on multiple techniques to calculate its scores. First, each query submitted to the tool is sent to one or more large language models. The tech will work with any model, says Northcutt, including closed-source models like OpenAI’s GPT series, the models behind ChatGPT, and open-source models like DBRX, developed by San Francisco-based AI firm Databricks. If the responses from each of these models are the same or similar, it will contribute to a higher score. Eliza, running a certain script, could parody the interaction between a patient and therapist by applying weights to certain keywords and responding to the user accordingly.

Only a few companies can muster the requisite resources, let alone train and compare different models. Microsoft, a frontrunner in this evolving landscape, is actively pursuing advancements in small language models. Their researchers have developed a groundbreaking method to train these models, exemplified by the Phi-2, the latest iteration in the Small Language Model (SLM) series.

SLMs are well-suited for the limited hardware of smartphones, supporting on-device processing that quickens response times, enhances privacy and security, and aligns with the trend of edge computing in mobile technology. These requirements can render LLMs impractical for certain applications, especially those with limited processing power or in environments where energy efficiency is a priority. But LLMs sometimes suffer “hallucinations” — including inaccurate and misleading responses — and they’re subject to security risks.

Mistral 7B is a further refinement of other “small” large language models like Llama 2, offering similar capabilities (according to some standard benchmarks) at a considerably smaller compute cost. Foundation models like GPT-4 can do much more, but are far more expensive and difficult to run, leading them to be made available solely through APIs or remote access. The most popular language models out there may be accessed via API, but open models — as far as that term can be taken seriously — are gaining ground. Mistral, a French AI startup that raised a huge seed round in June, has just taken the wraps off its first model, which it claims outperforms others of its size — and it’s totally free to use without restrictions. Mistral Small, developed by Mistral AI, is a highly efficient large language model (LLM) optimized for high-volume, low-latency language-based tasks. Mistral Small is perfectly suited for straightforward tasks that can be performed in bulk, such as classification, customer support, or text generation.

The impressive power of large language models (LLMs) has evolved substantially during the last couple of years. In conclusion, small language models represent a compelling frontier in natural language processing (NLP), offering versatile solutions with significantly reduced computational demands. Their compact size makes them accessible to a broader audience, including researchers, developers, and enthusiasts, but also opens up new avenues for innovation and exploration in NLP applications. However, the efficacy of these models depends not only on their size but also on their ability to maintain performance metrics comparable to larger counterparts. Therefore, as we continue to delve into the capabilities of small language models, it becomes imperative to prioritize their refinement, ensuring they uphold efficiency while delivering robust performance across various tasks and domains. Previous work35 notes that translation quality generally increases with the amount of high-quality training data, which is difficult to procure when working with low-resource languages.

  • Apart from automatic metrics, we also created Cross-lingual Semantic Text Similarity (XSTS) and Evaluation of Toxicity (ETOX).
  • Knowledge distillation transfers knowledge from a pre-trained LLM to a smaller model, capturing its core capabilities without the full complexity.
  • The models were trained on the publicly available datasets RefinedWeb, a version of PILE with duplications removed, a subset of RedPajama, and a subset of Dolma v1.6, which Apple says totals around 1.8 trillion tokens of data.
  • Language model fine-tuning is a process of providing additional training to a pre-trained language model making it more domain or task specific.
  • Today, neural machine translation (NMT) systems can leverage highly multilingual capacities and even perform zero-shot translation, delivering promising results in terms of language coverage and quality.
  • These rules are specifically mentioned in section 5.1.3 of ref. 34 and include linguistic filters to mitigate the learning of spurious correlations due to noisy training samples while modelling hundreds of languages.

This aims to identify annotators who have a systematic tendency to be more harsh or generous in their scoring and correct for this effect. The calibration set consists of the machine translation output paired with the reference translation only in English. Based on how evaluators used the XSTS scale on this calibration set, we adjusted their raw scores on the actual evaluation task to ensure consistency across evaluators. We then compared NLLB-200 with a few other state-of-the-art models, such as Deepnet42 and M2M-100 (ref. 1), to report scores for 87 languages against FLORES-101. Overall, the results show that NLLB-200 improves on state-of-the-art systems by a notable margin despite supporting 200 languages, or twice as many languages (and more than 30,000 additional directions) compared with any previous work. We also show in additional experiments that NLLB-200 is a general-purpose NMT model, transferable to other domains by fine-tuning on small quantities of high-quality bitexts (see Supplementary Information E.3).

It aims to improve on advancements made by other open source models by imitating the reasoning procedures achieved by LLMs. Orca achieves the same performance as GPT-4 with significantly fewer parameters and is on par with GPT-3.5 for many tasks. Unlike the others, its parameter count has not been released to the public, though there are rumors that the model has more than 170 trillion. OpenAI describes GPT-4 as a multimodal model, meaning it can process and generate both language and images as opposed to being limited to only language.

Another advantage by formalizing is the ability to discover errors in an early stage. It is not always that the language best fitted for the technical actors is the same as for the social actors. An example of a graphical modeling language and a corresponding textual modeling language is EXPRESS. Depending on the number of concurrent users accessing an LLM, the model inference tends to slow down.

The voyage of language models highlights a fundamental message in AI, i.e., small can be impressive, assuming that there is constant advancement and modernization. In addition, there is an understanding that efficiency, versatility, environmentally friendliness, and optimized training approaches grab the potential of SLMs. GPT-3 is OpenAI’s large language model with more than 175 billion parameters, released in 2020. In September 2022, Microsoft announced it had exclusive use of GPT-3’s underlying model. GPT-3’s training data includes Common Crawl, WebText2, Books1, Books2 and Wikipedia. They do natural language processing and influence the architecture of future models.

The idea is to develop a mathematical model with parameters that can represent true predictions with the highest probability. Indeed, ChatGPT is the first consumer-facing use case of LLMs, which previously were limited to OpenAI’s GPT and Google’s BERT technology. To sum it up, no matter which model architecture we look at, the choice of scoring function doesn’t seem to affect more than another. AI for enterprises strategically deploys AI technologies and methodologies within large-scale organizations to enhance various operational aspects. Our comprehensive support and maintenance services are designed to uphold the peak performance of your SLM. This includes ongoing monitoring, adaptation to evolving data and use cases, prompt bug fixes, and regular software updates.

A model only truly comes to life during training, when it repeatedly compares its own output to the text in its training data set and adjusts its parameters to increase the resemblance. An untrained network with random parameters is trivially easy to assemble from a few lines of code, but it will just produce gibberish. Larger models often undergo further fine-tuning that teaches them to answer questions and follow instructions, but the bulk of the training is mastering word prediction. The models were trained on the publicly available datasets RefinedWeb, a version of PILE with duplications removed, a subset of RedPajama, and a subset of Dolma v1.6, which Apple says totals around 1.8 trillion tokens of data. Tokens are fragmented representations of data used by AI language models for processing.

Choosing the most suitable language model is a critical step that requires considering various factors such as computational power, speed, and customization options. Models like DistilBERT, GPT-2, BERT, or LSTM-based models are recommended for a local CPU setup. A wide array of pre-trained language models are available, each with unique characteristics. Selecting a model that aligns well with your specific task requirements and hardware capabilities is important.

Microsoft brings out a small language model that can look at pictures – The Verge

Microsoft brings out a small language model that can look at pictures.

Posted: Tue, 21 May 2024 07:00:00 GMT [source]

A good language model should also be able to process long-term dependencies, handling words that might derive their meaning from other words that occur in far-away, disparate parts of the text. A language model should be able to understand when a word is referencing another word from a long distance, as opposed to always relying on proximal words within a certain fixed history. Linked data and ontology engineering require ‘host languages’ to represent entities and the relations between them, constraints between the properties of entities and relations, and metadata attributes. Success at word prediction requires a language model to master many different skills.

The Trustworthy Language Model takes the same basic idea—that disagreements between models can be used to measure the trustworthiness of the overall system—and applies it to chatbots. Some of the most well-known language models today are based on the transformer model, including the generative pre-trained transformer series of LLMs and bidirectional encoder representations from transformers (BERT). First, because text requires fewer computational resources to synthesize than complex image data, their method can be used to rapidly generate synthetic training data. In one test, they generated 10,000 synthetic trajectories based on 10 real-world, visual trajectories. Note that we prefixed the source sequence with the source language, as opposed to the target language, as done in previous work10,60. We did so because we prioritized optimizing the zero-shot performance of our model on any pair of 200 languages at a minor cost to supervised performance.

Gen AI is a new technology, and organizations are still early in the journey of pursuing its opportunities and scaling it across functions. So it’s little surprise that only a small subset of respondents (46 out of 876) report that a meaningful share of their organizations’ EBIT can be attributed to their deployment of gen AI. These, after all, are the early movers, who already attribute more than 10 percent of their organizations’ EBIT to their use of gen AI. The AI-related practices at these organizations can offer guidance to those looking to create value from gen AI adoption at their own organizations.

small language model

We applied threshold optimization so that when the confidence of a classifier is low, the corresponding language is not considered for the final decision. A sentence was filtered out if none of the classifiers surpassed its threshold. Second, we built a multiclass classifier using softmax over all possible languages. To evaluate the participant appropriateness we try to identify how well the language expresses the knowledge held by the stakeholders.

small language model

The smaller model size of the SLM means that users can run the model on their local machines and still generate data within acceptable time. They may lack holistic contextual information from all multiple knowledge domains but are likely to excel in their chosen domain. Language models are AI computational models that can generate natural human language.

This paper had a large impact on the telecommunications industry and laid the groundwork for information theory and language modeling. The Markov model is still used today, and n-grams are tied closely to the concept. Language modeling is used in a variety of industries including information technology, finance, healthcare, transportation, legal, military and government. In addition, it’s likely that most people have interacted with a language model in some way at some point in the day, whether through Google search, an autocomplete text function or engaging with a voice assistant.

Pruning removes less useful parts of the model, and quantization reduces the precision of its weights, both of which further reduce its size and resource requirements. In conclusion, compact language models stand not just as a testament to human ingenuity in AI development but also as a beacon guiding us toward a more efficient, specialized, and sustainable future in artificial intelligence. I think companies will get more long-term productivity gains from the small language models and developing domain-specific applications of language models. We did a survey at Eckerson Group that showed about 30% of companies say they’re building their own language models. We’ll have to see what success they have and how much data quality interferes with that success over the next year or two. You can have companies building their own language models to engage more proactively with customers.

The lightweight nature of SLMs opens up a wider range of real-world applications and democratizes access to advanced language AI capabilities. What small language models might lack in size, they more than make up for in potential. In a world where AI has not always been equally available to everyone, they represent its democratization and a future where AI is accessible and tailored to diverse needs. This smaller size and efficiency is achieved via a few different techniques including knowledge distillation, pruning, and quantization. Knowledge distillation transfers knowledge from a pre-trained LLM to a smaller model, capturing its core capabilities without the full complexity.