The Most Popular Large Language Models (LLMs)

Illustration created by Midjourney with the prompt “a massive data center filled with GPUs and TPUs secifically designed for AI. Simply huge. –ar 16:9 –v 5.2”

Large Language Models (LLMs) are transforming the way we interact with technology. These models, developed by leading tech companies such as OpenAI, Replicate, Cohere, Hugging Face, and Anthropic, (to name a few), are pushing the boundaries of what’s possible in natural language processing. Here’s a short overview of some of the most popular LLMs, exploring their unique capabilities and best use cases.

OpenAI

GPT-4

Best Used For: chatbots, AI system conversations, and virtual assistants

OpenAI has introduced GPT-4, a large multimodal model that accepts both image and text inputs and produces text outputs. OpenAI spent six months aligning GPT-4 using lessons from adversarial testing and ChatGPT, improving performance in factuality, steerability, and adherence to safety measures. GPT-4 is available for text input via ChatGPT and the API, while image input capability is being developed in collaboration with a partner. OpenAI has open-sourced OpenAI Evals for model evaluation and welcomes feedback for further improvements. The model’s behavior can be customized through system messages, although there are still limitations and risks associated with its outputs, including hallucinations and biases. GPT-4 lacks knowledge of events occurring after September 2021 and may make reasoning errors. While improvements have been made, OpenAI acknowledges the need for public input on defining AI’s behavior. The model’s base training focuses on predicting the next word in a document, and reinforcement learning with human feedback is used to fine-tune its behavior.

GPT-3.5 Turbo

Best Used For: to create human-like text and content (images, music, and more), and answer questions in a conversational manner

GPT-3.5 (click the link above for descriptions of all models) allows developers to describe functions to the models, which can then output a JSON object containing arguments to call those functions. This provides a new way for GPT models to connect with external tools and APIs to generate structured data output. It enables interactive and detailed conversations with the model, making it valuable for various applications. OpenAI now offers continual updates for some models while also providing static versions for at least three months after an update. Users can contribute evaluations to improve the model for different use cases through OpenAI Evals repository. Deprecation dates for temporary snapshot models will be announced once updated versions are available.

Codex

Best Used For: programming, writing, and data analysis

OpenAI Codex is a descendant of GPT-3 and launched in partnership with GitHub for Github Copilot. Proficient in more than a dozen programming languages, Codex can now interpret simple commands in natural language and execute them on the user’s behalf—making it possible to build a natural language interface to existing applications. It demonstrates high proficiency in Python and extends its capabilities to other languages like JavaScript, Go, Perl, PHP, Ruby, Swift, TypeScript, and Shell. With an expanded memory of 14KB for Python code, Codex surpasses GPT-3 by considering over three times the contextual information during task execution.

Text-ada-001

Best Used For: parsing text, simple classification, address correction, keywords

Ada, also known as text-ada-001, is a model in the GPT-3 series that is fast and cost-effective. It is primarily designed for simple tasks and is considered the fastest and cheapest option. In the spectrum of capabilities, Ada falls on the simpler end. Other models like Curie (text-curie-001) and Babbage (text-babbage-001) offer intermediate capabilities. There are different versions of Ada text modules, such as Text-similarity-ada-001, Text-search-ada-doc-001, and Code-search-ada-text-001, each with its own strengths and limitations in terms of quality, speed, and availability. This article provides a detailed understanding of these modules and their suitability for specific requirements, with Text-ada-001 being suitable for tasks like text parsing, address correction, and basic classification.

Text-baggage-001

Best Used For: moderate classification, semantic search classification

Text-babbage-001 is a GPT-3 language model capable of handling straightforward tasks. It is known for its fast response time and lower cost than other models. If you want to associate your repository with the topic of text-babbage-001, you can do so by visiting the landing page of your repository and selecting the “manage topics” option.

Text-curie-001

Best Used For: language translation, complex classification, text sentiment, summarization

Text-curie-001 is a highly capable language model in the GPT-3 series, released in June 2020. It is faster and more cost-effective compared to Davinci. With a capacity of 6.7 billion parameters, Text-curie-001 is designed to be efficient while offering powerful capabilities. It excels in various natural language processing tasks and is a versatile option for processing text-based data.

Text-davinci-003

Best Used For: complex intent, cause and effect, summarization for audience

Text-davinci-003 is a language model with similar capabilities to text-davinci-003 but with a different training approach. It is trained using supervised fine-tuning instead of reinforcement learning. This model surpasses the curie, babbage, and ada models regarding quality, output length, and consistency in following instructions. Additionally, it offers additional features, such as the ability to insert text.

Anthropic

Anthropic, an AI safety and research company, announced the arrival of a new ChatGPT rival, Claude. Claude is an AI assistant built with conversational and text-processing capabilities.

Claude-V1

Best Used For: research, creative writing, collaborative writing, Q&A, coding, summarization

Claude-V1 can handle sophisticated dialogues allowing it to engage in conversations with users. Claude-V1 excels in providing detailed instructions, and facilitating clear and concise communication. Its strengths extend to complex reasoning, allowing it to tackle intricate problems with ease. Its creative content generation abilities also enable it to produce innovative and captivating outputs. Moreover, Claude-V1 proves invaluable in coding tasks, assisting developers with its comprehensive knowledge and problem-solving skills. Overall, Claude-V1 is versatile and is best used for detailed content creation.

Claude-instant-v1

Best Used For: lighter, less expensive, and much faster option with the same capabilities as Claude-V1

Claude-Instant-V1 is a high-performance model prioritizing speed and cost efficiency. With its streamlined architecture, it excels in handling casual dialogues, providing quick and effective responses. Text analysis and summarization tasks are also well-suited to Claude-Instant-V1, as it efficiently extracts key information and generates concise summaries. Document question-answering becomes a breeze with this model, offering accurate answers and insights. It is also adept at moderation, enabling efficient content filtering and maintaining a safe environment. Lastly, it proves valuable in classification tasks, ensuring accurate data categorization. Claude-Instant-V1’s strengths lie in its ability to deliver excellent performance at a low cost, reducing latency and providing a lightweight dialogue experience.

Stanford University-LLaMA 7B model

Alpaca-7b

Best Used For: conversing, writing and analyzing code, generating text and content, querying specific information

Stanford Alpaca and LLaMA models offer a solution to ChatGPT’s limitations by enabling the creation of custom AI chatbots that run locally and are always available offline. These models allow users to build AI chatbots tailored to their specific needs, free from the constraints of external servers or connectivity issues. Alpaca demonstrates similar behaviors to text-davinci-003 while being smaller, cost-effective, and easy to reproduce. The model’s training recipe involves utilizing strong pre-trained language models and high-quality instruction data generated from OpenAI’s text-davinci-003. The release aims for academic research purposes only and emphasizes the need for further evaluation and reporting of concerning behaviors.

Replicate

Stablelm-tuned-alpha-7b

Best Used For: conversational tasks such as chatbots, question-answering systems, and dialogue generation

StableLM-Tuned-Alpha-7B is a decoder-only language model with 7 billion parameters. It is built on top of the StableLM-Base-Alpha models and is further fine-tuned on chat and instruction-following datasets. The model utilizes a new dataset based on The Pile, which is three times larger, containing approximately 1.5 trillion tokens. A forthcoming technical report will detail the model specifications and training settings. As a proof-of-concept, the model was also fine-tuned using datasets from Stanford Alpaca, Nomic-AI, RyokoAI, Databricks labs, and Anthropic; these models will be released as StableLM-Tuned-Alpha.

Hugging Face

BLOOM

Best Used For: text generation, exploring characteristics of language generated by a language model

BLOOM is a BigScience Large Open-science Open-access Multilingual Language Model funded by the French government. It is an autoregressive Large Language Model trained on vast amounts of text data, capable of generating coherent text in 46 natural languages and 13 programming languages. BLOOM can also perform text tasks not explicitly trained for by framing them as text generation tasks. BLOOM aims to enable public research on large language models and can be utilized by researchers, students, educators, engineers/developers, and non-commercial entities. However, potential risks and limitations of the model include biased viewpoints, stereotypes, personal information, errors, irrelevant outputs, and the potential for users to attribute human-like traits to the model.

BLOOMZ

Best Used For: performing tasks expressed in natural language

BLOOMZ and mT0 are models developed by Bigscience capable of following human instructions in multiple languages without prior training. These models are fine-tuned on a cross-lingual task mixture called xP3, enabling them to generalize across different tasks and languages. However, the performance may vary depending on the prompt provided. To ensure accurate results, it is recommended to clearly indicate the end of the input to avoid the model attempting to continue it. Providing sufficient context is also advised, such as specifying the desired language for the answer. These measures help improve the accuracy and effectiveness of the models in generating appropriate responses to user instructions.

FLAN-t5-xxl

Best Used For: research on language models, including: research on zero-shot NLP tasks and in-context few-shot learning NLP tasks, such as reasoning, and question answering; advancing fairness and safety research, and understanding limitations of current large language models

FLAN-T5 is a language model that surpasses T5 in various aspects. It has been fine-tuned on over 1000 additional tasks, including more languages while maintaining the same number of parameters. FLAN-T5 is primarily designed for research purposes, such as exploring zero-shot and few-shot learning in natural language processing (NLP), reasoning, and question-answering. It also aims to contribute to fairness and safety research and address the limitations of current large language models. However, it is important to note that language models, including FLAN-T5, have the potential to be used in harmful ways. Therefore, a prior assessment of safety and fairness concerns specific to an application is necessary before directly utilizing FLAN-T5.

FLAN-ul2

Best Used For: intended to offer a reliable and scalable method for pre-training models that can excel on a variety of tasks and datasets

FLAN-UL2 is an encoder-decoder model based on the T5 architecture. FLAN-UL2 is part of the FLAN project, which focuses on training large language models using diverse instruction-based datasets. It is a fine-tuned version of the UL2 model with notable improvements. The FLAN-UL2 checkpoint has an increased receptive field of 2048, making it more suitable for few-shot in-context learning. Unlike the original UL2 model, FLAN-UL2 does not require mode switch tokens, simplifying inference and fine-tuning processes. The FLAN datasets and methods have been open-sourced, promoting effective instruction tuning.

GPT-NeoX-20b

Best Used For: powerful language model that can be used for a wide range of natural language processing tasks

GPT-NeoX-20B is a dense autoregressive language model with 20 billion parameters, trained on the Pile dataset. It is currently the largest autoregressive model with publicly accessible weights. The model competes on various language-understanding, mathematics, and knowledge-based tasks. The model utilizes a different tokenizer than GPT-J-6B and GPT-Neo, allocating extra tokens for whitespace characters, which enhances its suitability for tasks like code generation. The GPTNeoXModel configuration class is used to define the model architecture and control its outputs.

Open-Assistant SFT-4 12B

Best Used For: as an assistant, such that it responds to user queries with helpful answers

The fourth iteration of the Open-Assistant project introduces an English supervised-fine-tuning (SFT) model. It is derived from a Pythia 12B model, fine-tuned on human demonstrations of assistant conversations collected through the Open-Assistant human feedback web app before March 25, 2023. OASST1, an open-source chatbot alternative to ChatGPT, is now accessible free on Paperspace, utilizing Graphcore IPUs.

SantaCoder

Best Used For: multilingual large language model for code generation

The SantaCoder models are 1.1B parameter models trained on subsets of Python, Java, and JavaScript code from The Stack. The main model employs Multi Query Attention with a context window of 2048 tokens and was trained using filtering criteria based on near-deduplication and comment-to-code ratio. Additional models were trained with different filter parameters, architecture, and objectives. These models are designed for generating code snippets based on context, using phrases similar to source code comments or function signatures. However, the generated code may not be guaranteed to work correctly and could contain inefficiencies, bugs, or vulnerabilities.

Cohere

Command-medium-nightly

Best Used For: developers who require fast response, like those building chatbots

Cohere offers the Command generative model, available in two sizes: command-light and command. Command is the higher-performing model, while command-light is ideal for developers seeking faster response times, such as chatbot builders. Cohere also provides nightly versions of the command model, ensuring continuous improvement in performance. These nightly versions, designated as command-nightly-*, are regularly updated, allowing users to anticipate weekly enhancements and optimizations. Note: Command-xlarge-nightly was discontinued on 1/1/23 and has been incorporated into command-nightly.

McAuley lab (UC San Diego), Sun Yat-Sen University, and Microsoft Research

Baize

Best Used For: large corpus of multi-turn chat data

If you want to start building your own chat models, such a multi-turn chat corpus is not super common to come by. Baize aims at facilitating the generation of such a corpus using ChatGPT and uses it to fine-tune a LLaMA model. This helps you build better chatbots with reduced training time.

Google

PaLM 2 (Bison-001)

Best Used For: commonsense reasoning, formal logic, mathematics, and advanced coding in 20+ languages

Google announced four models based on PaLM 2 in different sizes (Gecko, Otter, Bison, and Unicorn). Of which, Bison is available currently. It’s a multilingual model and can understand idioms, riddles, and nuanced texts from different languages. This is something that other LLMs struggle with. One more advantage of PaLM 2 is that it’s very quick to respond and offers three responses at once. You can follow our article and test the PaLM 2 (Bison-001) model on Google’s Vertex AI platform. As for consumers, you can use Google Bard which is running on PaLM 2.

Gopher – Deepmind

Best Used For: reading comprehension, fact-checking, understanding toxic language, and logical and common sense tasks

DeepMind researchers and developers have created a series of language models, including Gopher. The 280 billion parameter model, Gopher, demonstrates exceptional language understanding and generation capabilities, surpassing existing models and achieving human-level expertise in tasks like Massive Multitask Language Understanding (MMLU). Gopher performs well in various domains, including math, science, technology, humanities, and medicine, and is particularly effective in dialogue-based interactions, providing simplified explanations on complex subjects.

Technology Innovation Institute (TII)

Falcon

Best Used For: commercial uses, chatting

Falcon is that it has been open-sourced with Apache 2.0 license, which means you can use the model for commercial purposes. There are no royalties or restrictions. So far, the TII has released two Falcon models, which are trained on 40B and 7B parameters. The developer suggests that these are raw models, but if you want to use them for chatting, you should go for the Falcon-40B-Instruct model, fine-tuned for most use cases.

LMSYS

Vicuna 33B

Best Used For: chatbots, research, hobby use

Vicuna 33B has been derived from LLaMA like many other open-source models. It has been fine-tuned using supervised instruction and the training data has been collected from sharegpt.com, a portal where users share their incredible ChatGPT conversations. It’s an auto-regressive large language model and is trained on 33 billion parameters.

AI21 Labs

Jurassic-2

Best Used For: reading and writing related use cases

Jurassic-2 family includes base language models in three different sizes: Large, Grande and Jumbo, alongside instruction-tuned language models for Jumbo and Grande. Jurassic is already making waves on Stanford’s Holistic Evaluation of Language Models (HELM), the leading benchmark for language models. J2 now offers zero-shot instruction capabilities, allowing the model to be steered with natural language without the use of examples.

Baichuan Intelligent Technology

Baichuan-13B

Best Used For: pre-training model is a “base” suitable for developers, while the aligned model with dialogue functions is more suitable for general users

Baichuan is a Chinese language model developed based on the Transformer architecture, similar to GPT. It is trained on Chinese and English data. The model is open source and optimized for commercial applications and is comparable to OpenAI’s GPT-3.5. The foundational model, Baichuan-13B, is now available for free to approved academics and developers and offers variations that can run on consumer-grade hardware.

Merlyn Mind

Merlyn

Best Used For: educators and education

Merlyn is designed for education and operates on a school’s own content and runs on a domain-specific AI platform with large language models. The platform aims to provide a trustworthy generative AI experience for teachers and students by focusing on curriculum-aligned, hallucination-resistant and age-appropriate content. Merlyn brings safe and effective AI into classrooms to automate teacher workflows, generate curriculum-aligned content, and engage students.

This List is Incomplete

There are dozens, if not hundreds, more LLMs in common use and new models emerge daily. If you’d like your model included in this list, simply send an email to info@shellypalmer.com with your details in the format above.

Author’s note: This is not a sponsored post. I am the author of this article and it expresses my own opinions. I am not, nor is my company, receiving compensation for it.