Key Issue: What are foundational models ?

Sep 30

Written By Gideon Gartner

What are foundational models ?

Foundational models are large-scale machine learning models trained on broad, diverse datasets that can be adapted for a wide range of downstream tasks and applications.
They serve as a base layer for numerous AI tasks, rather than being built from scratch for specific tasks.
Examples include large language models like GPT-3 and BERT, as well as models for computer vision, robotics, and other domains.

What is driving adoption ?

Ability to adapt these models to many different tasks with minimal fine-tuning, increasing efficiency and reducing costs.
Improved performance across a variety of AI applications compared to traditional task-specific models.
Emerging capabilities like in-context learning that allow more flexible use.
Rapid progress in capabilities, driving interest from businesses and researchers.

Key Trends:

Increasing scale - Models are growing larger, with hundreds of billions of parameters.
Multimodal models - Expanding beyond just text to incorporate images, audio, video, etc.
Improved reasoning and task-generalization abilities.
Focus on making models more reliable, safe, and aligned with human values.
Growing adoption across industries like healthcare, finance, education.
Emergence of specialized hardware and infrastructure to support these large models.

Leading Players:

Word Embeddings (Word2Vec, GloVe, FastText)
Date: 2013 (Word2Vec), 2014 (GloVe), 2016 (FastText)
Founders: Tomas Mikolov et al. (Word2Vec), Jeffrey Pennington et al. (GloVe), Piotr Bojanowski et al. (FastText)
Location: Google (Word2Vec), Stanford University (GloVe), Facebook AI Research (FastText)
Unique capability: Represent words as dense vectors, capturing semantic relationships.

Word embeddings revolutionized natural language processing by representing words as dense vectors in a continuous space, capturing semantic relationships between words. Developed between 2013-2016, these models enabled more efficient and meaningful representations of text data. Word2Vec, created by Tomas Mikolov and colleagues at Google, introduced the skip-gram and continuous bag-of-words architectures. GloVe, developed by Jeffrey Pennington and team at Stanford University, combined global matrix factorization with local context window methods. FastText, from Piotr Bojanowski and researchers at Facebook AI Research, extended Word2Vec by incorporating subword information, allowing it to generate vectors for out-of-vocabulary words.
Mathematically, Word2Vec uses neural networks to maximize the probability of observing context words given a target word (skip-gram) or vice versa (CBOW). GloVe utilizes a weighted least squares model that trains on global word-word co-occurrence counts. FastText represents words as bags of character n-grams, allowing it to compute representations for unknown words by summing the vector representations of its n-grams.
ELMo (Embeddings from Language Models)
Date: February 2018
Founders: Matthew Peters et al.
Location: Allen Institute for AI, Seattle, USA
Unique capability: First to use deep, contextualized word representations.

ELMo, introduced in February 2018 by Matthew Peters and colleagues at the Allen Institute for AI in Seattle, marked a significant advancement in contextual word representations. Unlike previous static embeddings, ELMo generates dynamic word vectors that change based on the context in which a word appears. This breakthrough allowed for more nuanced understanding of word meanings in different contexts. ELMo uses a deep, bi-directional LSTM model trained on a large text corpus to create these context-sensitive embeddings. Its ability to capture complex characteristics of word use, including syntax and semantics, led to substantial improvements across a wide range of NLP tasks.
The mathematical foundation of ELMo lies in its use of a bidirectional language model. It computes the probability of a sequence of words in both forward and backward directions, then combines these probabilities to form the final representation. The model uses a weighted sum of the internal states of a multi-layer bidirectional LSTM, allowing it to capture different levels of syntactic and semantic information.
ULM-FiT (Universal Language Model Fine-tuning)
Date: May 2018
Founders: Jeremy Howard and Sebastian Ruder
Location: fast.ai and National University of Ireland, Galway
Unique capability: Introduced effective transfer learning for various NLP tasks.

ULM-FiT, developed by Jeremy Howard and Sebastian Ruder in May 2018, introduced a groundbreaking approach to transfer learning in natural language processing. This technique, created at fast.ai and the National University of Ireland, Galway, allows for effective fine-tuning of pre-trained language models for various NLP tasks. ULM-FiT employs a three-stage process: general-domain pre-training, target task fine-tuning, and target task classifier fine-tuning. This method significantly reduced the amount of task-specific data and computation time required to achieve state-of-the-art results on text classification tasks. ULM-FiT's innovation lies in its ability to transfer knowledge from a general language model to specific NLP tasks, much like transfer learning had been successfully used in computer vision.
The mathematical approach in ULM-FiT involves techniques like discriminative fine-tuning, where different layers of the model are fine-tuned at different learning rates, and slanted triangular learning rates, which allow the model to quickly converge to a suitable region of the parameter space and then refine its parameters. It also uses gradual unfreezing, where layers are progressively unfrozen during fine-tuning, starting from the last layer, to avoid catastrophic forgetting of the pre-trained knowledge.
BERT (Bidirectional Encoder Representations from Transformers)
Date: October 2018 Founders: Jacob Devlin et al.
Location: Google AI Language, USA
Unique capability: Bidirectional context understanding using masked language modeling.

BERT (Bidirectional Encoder Representations from Transformers): BERT, introduced by Jacob Devlin and colleagues at Google AI in October 2018, marked a significant advancement in contextual word representations. It uses bidirectional training of the Transformer architecture to develop deep language understanding.
Mathematically, BERT's key innovation is its Masked Language Model (MLM) pre-training objective. In this approach, the model randomly masks some percentage of input tokens and predicts only those masked tokens, allowing it to build a deep bidirectional representation. BERT also introduces a Next Sentence Prediction (NSP) task during pre-training, where it learns to predict if two sentences follow each other in the original text. These techniques allow BERT to capture complex, bidirectional contexts and relationships between sentences.
XLNet
Date: June 2019
Founders: Zhilin Yang et al.
Location: Carnegie Mellon University and Google AI, USA
Unique capability: Permutation-based training to capture bidirectional context without masks.

XLNet, developed in June 2019 by Zhilin Yang and team at Carnegie Mellon University and Google AI, builds upon the successes of BERT while addressing some of its limitations. XLNet is an autoregressive language model that leverages the best of both autoregressive language modeling and autoencoding while avoiding their limitations. It achieves this by using a permutation language modeling objective, which allows the model to learn bidirectional contexts without the drawbacks of BERT's masked language modeling approach.
The key mathematical innovation in XLNet is its permutation language modeling objective. Instead of predicting masked tokens like BERT, XLNet predicts tokens auto-regressively in all possible orders. This allows it to capture bidirectional context while maintaining the benefits of autoregressive models. XLNet also introduces two-stream self-attention for target-aware representations, enabling it to consider the predicted position without seeing the content.
GPT (Generative Pre-trained Transformer)
Date: June 2018 (GPT-1), February 2019 (GPT-2), June 2020 (GPT-3)
Founders: OpenAI team
Location: OpenAI, San Francisco, USA
Unique capability: Increasingly large-scale autoregressive language modeling.

Released by OpenAI in March 2023, GPT-4 represents the latest iteration in the GPT series. It demonstrates significant improvements over its predecessors in terms of reasoning capabilities, factual accuracy, and ability to follow complex instructions. GPT-4 is also multimodal, capable of processing both text and image inputs.
While the full details of GPT-4's architecture are not public, it likely builds upon the scaling laws and architectural choices of previous GPT models. Its mathematical foundations likely include advanced attention mechanisms, possibly sparse attention or other techniques to efficiently handle long-range dependencies. The multimodal aspect suggests the integration of vision transformers or similar architectures to process image data alongside text.
RoBERTa (Robustly Optimized BERT Approach)
Date: July 2019 Founders: Yinhan Liu et al.
Location: Facebook AI, USA
Unique capability: Improved BERT training with larger batches and more data.

BERT, introduced in October 2018 by Jacob Devlin and colleagues at Google AI Language, represented a major leap forward in natural language understanding. It uses bidirectional training of the Transformer, an attention model, to develop language representations. BERT is designed to pre-train deep bidirectional representations from unlabeled text, allowing it to be fine-tuned with just one additional output layer for a wide range of tasks. This model significantly outperformed previous methods on eleven natural language processing tasks, including question answering, named entity recognition, and sentiment analysis.
Mathematically, BERT employs a novel technique called Masked Language Model (MLM). In this approach, the model randomly masks some of the tokens from the input and predicts the original vocabulary id of the masked word based on its context. Additionally, it uses Next Sentence Prediction (NSP) to understand the relationship between sentences, which helps in tasks like question answering and natural language inference.
ALBERT (A Lite BERT)
Date: September 2019
Founders: Zhenzhong Lan et al.
Location: Google Research, USA
Unique capability: Parameter-efficient version of BERT with comparable performance.

ALBERT (A Lite BERT): ALBERT, developed by Zhenzhong Lan and team at Google Research in September 2019, addresses the issue of model size and training time in large language models. It introduces parameter-reduction techniques to lower memory consumption and increase training speed. ALBERT uses factorized embedding parameterization and cross-layer parameter sharing, allowing it to scale to much larger models while keeping the parameter count low.
Mathematically, ALBERT's key innovation is in its architecture modifications. The factorized embedding parameterization separates the size of the hidden layers from the size of vocabulary embeddings, reducing parameters without significant performance loss. The cross-layer parameter sharing drastically reduces the number of parameters, forcing the model to find more general representations. ALBERT also replaces BERT's Next Sentence Prediction with a Sentence Order Prediction task, which proves to be more challenging and beneficial for downstream tasks.
T5 (Text-to-Text Transfer Transformer)
Date: October 2019
Founders: Colin Raffel et al.
Location: Google Research, USA
Unique capability: Unified framework for multiple NLP tasks as text-to-text problems.

T5, introduced in October 2019 by Colin Raffel and colleagues at Google Research, presents a unified framework for many NLP tasks. It treats every text processing problem as a "text-to-text" problem, i.e., taking text as input and producing new text as output. This approach allows for a single model architecture and training procedure to be used for a diverse set of NLP tasks including translation, summarization, and question answering.
The mathematical approach of T5 lies in its formulation of all NLP tasks as text-to-text problems. It uses a standard encoder-decoder transformer architecture but introduces task-specific prefixes to differentiate between tasks. The model is pre-trained using a "span-corruption" objective, where spans of input text are replaced with unique sentinel tokens, and the model must reconstruct these spans. This approach proves to be more effective than BERT-style masked language modeling for transfer learning.
BART (Bidirectional and Auto-Regressive Transformers)
Date: October 2019
Founders: Mike Lewis et al.
Location: Facebook AI, USA
Unique capability: Combined bidirectional encoding with autoregressive decoding.

BART (Bidirectional and Auto-Regressive Transformers): BART, developed in October 2019 by Mike Lewis and colleagues at Facebook AI, combines the bidirectional encoder of BERT with the autoregressive decoder of GPT. This architecture makes BART particularly effective for text generation tasks, while also performing well on comprehension tasks. BART is pre-trained by corrupting text with an arbitrary noising function and learning to reconstruct the original text.

Mathematically, BART's innovation lies in its flexible pre-training scheme. It can be seen as generalizing BERT (due to the bidirectional encoder), GPT (with the left-to-right decoder), and many other pre-training schemes. The noising approaches include token masking, token deletion, text infilling, sentence permutation, and document rotation. This variety allows BART to learn a diverse set of pretraining tasks in a single model.
Switch Transformer
Date: January 2021
Founders: William Fedus, Barret Zoph, Noam Shazeer
Location: Google Research, Brain Team
Unique capability: Sparsely-activated expert models for efficient scaling.
The Switch Transformer, introduced in January 2021 by William Fedus, Barret Zoph, and Noam Shazeer at Google Research, presents a sparsely-activated expert model. It scales up language models by drastically increasing model size while maintaining a constant computational cost. This is achieved by routing tokens to different "experts" (specialized sub-networks) depending on the input.
The key mathematical concept in the Switch Transformer is the sparse routing mechanism. For each token, the model computes routing probabilities to different experts and selects the top-k experts (often just k=1). This allows for conditional computation, where only a small portion of the model is activated for each input, enabling extremely large models with manageable computation. The loss function includes an additional load balancing term to ensure even utilization of experts.
mT5 (Multilingual T5)
Date: October 2020
Founders: Linting Xue et al.
Location: Google Research
Unique capability: Multilingual version of T5, supporting 101 languages.

mT5, created by Linting Xue and team at Google Research in October 2020, extends the T5 model to support 101 languages. It aims to create a single pre-trained model capable of performing well on a variety of cross-lingual and multilingual tasks without the need for translation or parallel data.
The mathematical approach of mT5 involves training on a massively multilingual corpus, with a vocabulary shared across all languages. It uses a sentencepiece tokenizer to handle the diverse scripts and morphologies of different languages. The model employs language-agnostic pre-training, where all languages are mixed together without any special language identifiers or tokens. This forces the model to learn language-universal representations that can be effectively fine-tuned for cross-lingual tasks.
Instruction Tuning (e.g., T0, FLAN models)
Date: October 2021 (T0), October 2022 (FLAN-T5)
Founders: Victor Sanh et al. (T0), Jason Wei et al. (FLAN)
Location: Hugging Face (T0), Google Research (FLAN)
Unique capability: Improved zero-shot task performance through instruction-based fine-tuning.
FLAN models, introduced by Jason Wei and colleagues at Google Research in October 2021, represent a significant advancement in instruction tuning. These models are fine-tuned on a diverse set of tasks described via instructions, improving their zero-shot performance across a wide range of unseen tasks.
The key innovation in FLAN lies in its training methodology rather than its architecture. It uses a large set of tasks, each reformulated as a text-to-text problem with natural language instructions. This approach allows the model to generalize to new tasks it hasn't explicitly seen during training. Mathematically, this can be viewed as optimizing the model's parameters to perform well across a distribution of tasks, each defined by its instruction, rather than for any single specific task.
ChatGPT
Date: November 2022
Founders: OpenAI team
Location: OpenAI, San Francisco, USA
Unique capability: Conversational AI with strong dialogue and instruction-following abilities.
Launched in November 2022 by OpenAI, ChatGPT represents a significant leap in conversational AI. Built upon the GPT (Generative Pre-trained Transformer) architecture, ChatGPT is fine-tuned with Reinforcement Learning from Human Feedback (RLHF). This model demonstrates remarkable abilities in understanding context, generating human-like text, and following complex instructions across a wide range of domains.
Mathematically, ChatGPT's innovation lies in its training process. It uses a reward model trained on human-rated model outputs to guide the fine-tuning of the language model. This process, known as Proximal Policy Optimization (PPO), involves iteratively updating the model to maximize the expected reward. The objective function balances between maximizing the reward and staying close to the initial pre-trained model, preventing the model from overfitting to the reward function.
LLaMA (Large Language Model Meta AI)
Date: February 2023
Founders: Hugo Touvron et al.
Location: Meta AI, USA
Unique capability: Efficient, open-source large language model for research purposes.
Introduced in February 2023 by Hugo Touvron and colleagues at Meta AI, LLaMA is a collection of foundation language models ranging from 7B to 65B parameters. LLaMA is designed to be more efficient and accessible for research purposes, achieving strong performance on various benchmarks while being smaller and more computationally efficient than many counterparts.
The mathematical approach in LLaMA focuses on efficient scaling laws and architectural choices. It uses a similar architecture to GPT-3 but with some modifications like the use of RMSNorm for layer normalization, SwiGLU activation functions, and rotary positional embeddings. LLaMA's training process emphasizes the importance of high-quality data, using a carefully curated dataset that allows for strong performance with fewer parameters.

Gideon Gartner https://GartnorGroup.com

Key Issue: What are foundational models ?

Key Issue: Does GartnorGroup represent where foundational models are headed ?

Market Note: Endpoint Security Market