Forget 32K of GPT4: LongNet Has a Billion Token Context

Raw Text

Member-only story

Tired of the limitation of 2048, 4096, to 32768 token-context of GPT-3 and GPT-4? Microsoft may have an answer for you (A positive take)

Dr. Mandar Karhade, MD. PhD. · Follow

Published in Towards AI · 12 min read · Jul 26

--

1

Share

On 19th July, Microsoft published a paper that is being considered as a major step forward in the development of architectures to develop large language models that could have a practically unlimited context length. Microsoft proposed and developed a transformer model that can scale to theoretically a billion tokens. This removes the major obstacle in the practical use case for the large language modes also known as “Context length restriction”.

In this article, we will walk through —

Large Language Models (LLMs)

Remember me! context matters

How to Achieve a Larger Context

Current Networks For LLMs

Difficulty of Scaling

Microsoft’s solution LongNet

Distributed Trainer

Results and Verification of Scaling to 1B Tokens

Closing Thoughts

So, let's get started.

Large Language Models (LLMs)

Large Language Models are the deep learning models that are deep, and have millions if not billions of parameters. These models are generally trained on the “General text” corpus from the internet. Such Corpus may have up to a trillion tokens (i.e., if it exists on the internet, the text was used to train the large language model).

Imagine a big matrix where each word is connected to each word in a given string. To put it simply, this is self-attention. We care about the words or placement of words that have a stronger relationship because they can predict the next word better than the weaker relationship. A relationship may go 3 layers deep or 30 layers deep it won't matter in the grand scheme. What is important is that self-attention determines (at least in part) the next token. A token is a word or a part of the word, and it is often used as a synonym for a functional unit of the sentence.

Large language models, therefore, create a map of the language where given the input text, an output is generated based on that map. The map is extremely complex. This map is generally represented by the…

Single Line Text

Member-only story. Tired of the limitation of 2048, 4096, to 32768 token-context of GPT-3 and GPT-4? Microsoft may have an answer for you (A positive take) Dr. Mandar Karhade, MD. PhD. · Follow. Published in Towards AI · 12 min read · Jul 26. -- 1. Share. On 19th July, Microsoft published a paper that is being considered as a major step forward in the development of architectures to develop large language models that could have a practically unlimited context length. Microsoft proposed and developed a transformer model that can scale to theoretically a billion tokens. This removes the major obstacle in the practical use case for the large language modes also known as “Context length restriction”. In this article, we will walk through — Large Language Models (LLMs) Remember me! context matters. How to Achieve a Larger Context. Current Networks For LLMs. Difficulty of Scaling. Microsoft’s solution LongNet. Distributed Trainer. Results and Verification of Scaling to 1B Tokens. Closing Thoughts. So, let's get started. Large Language Models (LLMs) Large Language Models are the deep learning models that are deep, and have millions if not billions of parameters. These models are generally trained on the “General text” corpus from the internet. Such Corpus may have up to a trillion tokens (i.e., if it exists on the internet, the text was used to train the large language model). Imagine a big matrix where each word is connected to each word in a given string. To put it simply, this is self-attention. We care about the words or placement of words that have a stronger relationship because they can predict the next word better than the weaker relationship. A relationship may go 3 layers deep or 30 layers deep it won't matter in the grand scheme. What is important is that self-attention determines (at least in part) the next token. A token is a word or a part of the word, and it is often used as a synonym for a functional unit of the sentence. Large language models, therefore, create a map of the language where given the input text, an output is generated based on that map. The map is extremely complex. This map is generally represented by the…