Forget 32K of GPT4: LongNet Has a Billion Token Context
Raw Text
Member-only story
Tired of the limitation of 2048, 4096, to 32768 token-context of GPT-3 and GPT-4? Microsoft may have an answer for you (A positive take)
Dr. Mandar Karhade, MD. PhD. · Follow
Published in Towards AI · 12 min read · Jul 26
--
1
Share
On 19th July, Microsoft published a paper that is being considered as a major step forward in the development of architectures to develop large language models that could have a practically unlimited context length. Microsoft proposed and developed a transformer model that can scale to theoretically a billion tokens. This removes the major obstacle in the practical use case for the large language modes also known as “Context length restriction”.
In this article, we will walk through —
Large Language Models (LLMs)
Remember me! context matters
How to Achieve a Larger Context
Current Networks For LLMs
Difficulty of Scaling
Microsoft’s solution LongNet
Distributed Trainer
Results and Verification of Scaling to 1B Tokens
Closing Thoughts
So, let's get started.
Large Language Models (LLMs)
Large Language Models are the deep learning models that are deep, and have millions if not billions of parameters. These models are generally trained on the “General text” corpus from the internet. Such Corpus may have up to a trillion tokens (i.e., if it exists on the internet, the text was used to train the large language model).
Imagine a big matrix where each word is connected to each word in a given string. To put it simply, this is self-attention. We care about the words or placement of words that have a stronger relationship because they can predict the next word better than the weaker relationship. A relationship may go 3 layers deep or 30 layers deep it won't matter in the grand scheme. What is important is that self-attention determines (at least in part) the next token. A token is a word or a part of the word, and it is often used as a synonym for a functional unit of the sentence.
Large language models, therefore, create a map of the language where given the input text, an output is generated based on that map. The map is extremely complex. This map is generally represented by the…
Single Line Text
Member-only story. Tired of the limitation of 2048, 4096, to 32768 token-context of GPT-3 and GPT-4? Microsoft may have an answer for you (A positive take) Dr. Mandar Karhade, MD. PhD. · Follow. Published in Towards AI · 12 min read · Jul 26. -- 1. Share. On 19th July, Microsoft published a paper that is being considered as a major step forward in the development of architectures to develop large language models that could have a practically unlimited context length. Microsoft proposed and developed a transformer model that can scale to theoretically a billion tokens. This removes the major obstacle in the practical use case for the large language modes also known as “Context length restriction”. In this article, we will walk through — Large Language Models (LLMs) Remember me! context matters. How to Achieve a Larger Context. Current Networks For LLMs. Difficulty of Scaling. Microsoft’s solution LongNet. Distributed Trainer. Results and Verification of Scaling to 1B Tokens. Closing Thoughts. So, let's get started. Large Language Models (LLMs) Large Language Models are the deep learning models that are deep, and have millions if not billions of parameters. These models are generally trained on the “General text” corpus from the internet. Such Corpus may have up to a trillion tokens (i.e., if it exists on the internet, the text was used to train the large language model). Imagine a big matrix where each word is connected to each word in a given string. To put it simply, this is self-attention. We care about the words or placement of words that have a stronger relationship because they can predict the next word better than the weaker relationship. A relationship may go 3 layers deep or 30 layers deep it won't matter in the grand scheme. What is important is that self-attention determines (at least in part) the next token. A token is a word or a part of the word, and it is often used as a synonym for a functional unit of the sentence. Large language models, therefore, create a map of the language where given the input text, an output is generated based on that map. The map is extremely complex. This map is generally represented by the…