A New Approach Trains Large Language Models in Half the Time

Raw Text

Skip to main content

Skip to secondary navigation

Stanford University (link is external)

About People Values Corporate Programs Get Involved

Centers Center for Research on Foundation Models Center for the Study of Language and Information Digital Economy Lab RAISE-Health RegLab Initiatives Partners

Research Fellowship Programs Grant Programs AI Index 2023 Research Publications Student Affinity Groups

Education Education Audiences Courses Executive Education

Policy Policy Publications National AI Research Resource Congressional Boot Camp on AI Tech Ethics & Policy Summer Fellowships AI Audit Challenge

News Blog Announcements HAI In the News Subscribe to HAI Newsletter

Events Upcoming Events Past Events Weekly Seminars

Page Content

Language Processing

Machine Learning

A Stanford team has developed Sophia, a new way to optimize the pretraining of large language models that’s twice as fast as current approaches.

https://twitter.com/StanfordHAI?ref_src=twsrc%5Egoogle%7Ctwcamp%5Eserp%7Ctwgr%5Eauthor

https://www.facebook.com/StanfordHAI/

https://www.youtube.com/channel/UChugFTK0KyrES9terTid8vA

https://www.linkedin.com/company/stanfordhai

https://www.instagram.com/stanfordhai/?hl=en

ChatGPT and other applications that rely on large language models (LLMs) are gaining widespread use and drawing abundant media attention. But a handful of large well-funded tech companies dominate the LLM space because pretraining these models is extremely expensive, with cost estimates starting at $10 million and potentially reaching tens or hundreds of times that.

“Large language models are not very accessible to smaller organizations or academic groups,” says Hong Liu, a graduate student in computer science at Stanford University.

To change that, Liu and his colleagues set out to improve on current LLM optimization methods. The result: an approach called Sophia that cuts the pretraining time in half.

Optimizing Optimization

To better optimize LLM pretraining, Liu and his colleagues, including Stanford postdoctoral fellow Zhiyuan Li, Stanford research engineer David Hall, Computer Science Assistant Professor Tengyu Ma , and Associate Professor Percy Liang , used two tricks. The first, known as curvature estimation, isn’t new, but the Stanford team found a way to make it more efficient.

To understand their approach, consider a factory assembly line. To function efficiently, the factory manager needs to optimize the number of steps it takes to turn raw materials into a final product and needs to understand and appropriately staff the workload at each step along the line.

The same is true for pretraining an LLM. These models have millions or even billions of parameters that Liu likens to factory workers striving toward the same goals. One property of these parameters is their curvature, which Liu thinks of as the maximum achievable speed they reach as they progress toward the final goal of a pretrained LLM. In the factory metaphor, curvature is akin to a factory worker’s workload.

If an optimization program can estimate that curvature (workload), it can make LLM pretraining more efficient. The problem is this: Estimating curvature with existing methods is remarkably difficult and expensive. “In fact, it’s more expensive than doing the actual work without making curvature predictions,” Liu says. That’s partially why the current state-of-the-art approaches to optimizing LLM pretraining (Adam and its variants) forgo the curvature estimation step.

Still, Liu and his colleagues noticed a possible inefficiency in the prior methods that used parametric curvature estimation: Prior researchers updated their curvature estimates at every step of the optimization. The Stanford team wondered if they could make the process more efficient by decreasing the number of updates.

To test that idea, the Stanford team designed Sophia to estimate parameters’ curvature only about every 10 steps. “That turned out to be a huge win,” Liu says.

The team’s second optimization trick, called clipping, addresses a related issue: The problem of inaccurate curvature estimation. “If the estimation is wrong, it’s like giving people with hard jobs even more work to do. It makes things worse than if there were no estimation at all.”

Clipping prevents that by setting a threshold, or a maximum curvature estimation. “In our factory metaphor, it’s like setting a workload limitation for all employees,” Liu says. Another metaphor often applied to optimization is a landscape of hills and valleys where the goal is to end up in the lowest valley. Without clipping, Liu says, it is possible to land at a saddle between two mountains. “In optimization, that’s not where you want to be,” he says.

Testing Sophia and Scaling Up

Liu and his colleagues used Sophia to pretrain a relatively small LLM using the same model size and configuration that were used to create OpenAI’s GPT-2.

Sophia’s combination of curvature estimation and clipping allowed the LLM pretraining optimization to smoothly proceed to the lowest valley in half the number of steps and half the time required by Adam.

“Sophia’s adaptivity sets it apart from Adam,” Liu says. “It’s harder for Adam to handle parameters with heterogeneous curvatures because it can’t predict them in advance.”

It’s also the first time in nine years that anyone has shown any substantial improvement over Adam on language model pretraining, Liu says. “This could mean a huge reduction in the cost of training real-world large models.” And as models scale, Sophia’s advantages should only increase, he says.

Next, Liu and his colleagues hope to develop a larger LLM using Sophia. He’s also hoping to see Sophia applied to other areas of machine learning such as computer vision models or multi-modal models. “It would take some time and resources to move Sophia to a new domain, but because it is open source, the community could certainly do it.”

Stanford HAI’s mission is to advance AI research, education, policy and practice to improve the human condition. Learn more .

More News Topics

Language Processing

Machine Learning

Navigate

Welcome

Values

News

Events

Careers

Participate

Get Involved

Grant Programs

Corporate Programs

Learn more about giving to HAI

Contact Us

Follow us

https://twitter.com/StanfordHAI

https://www.facebook.com/StanfordHAI

https://www.youtube.com/channel/UChugFTK0KyrES9terTid8vA

https://www.linkedin.com/company/stanfordhai

https://www.instagram.com/stanfordhai

Newsletter Sign Up

Don’t miss out. Get Stanford HAI updates delivered directly to your inbox.

Subscribe

Stanford University

Stanford Home (link is external)

Maps & Directions (link is external)

Search Stanford (link is external)

Emergency Info (link is external)

Terms of Use (link is external)

Privacy (link is external)

Copyright (link is external)

Trademarks (link is external)

Non-Discrimination (link is external)

Accessibility (link is external)

© Stanford University.

Stanford, California 94305.

Single Line Text

Skip to main content. Skip to secondary navigation. Stanford University (link is external) About People Values Corporate Programs Get Involved. Centers Center for Research on Foundation Models Center for the Study of Language and Information Digital Economy Lab RAISE-Health RegLab Initiatives Partners. Research Fellowship Programs Grant Programs AI Index 2023 Research Publications Student Affinity Groups. Education Education Audiences Courses Executive Education. Policy Policy Publications National AI Research Resource Congressional Boot Camp on AI Tech Ethics & Policy Summer Fellowships AI Audit Challenge. News Blog Announcements HAI In the News Subscribe to HAI Newsletter. Events Upcoming Events Past Events Weekly Seminars. Page Content. Language Processing. Machine Learning. A Stanford team has developed Sophia, a new way to optimize the pretraining of large language models that’s twice as fast as current approaches. https://twitter.com/StanfordHAI?ref_src=twsrc%5Egoogle%7Ctwcamp%5Eserp%7Ctwgr%5Eauthor. https://www.facebook.com/StanfordHAI/ https://www.youtube.com/channel/UChugFTK0KyrES9terTid8vA. https://www.linkedin.com/company/stanfordhai. https://www.instagram.com/stanfordhai/?hl=en. ChatGPT and other applications that rely on large language models (LLMs) are gaining widespread use and drawing abundant media attention. But a handful of large well-funded tech companies dominate the LLM space because pretraining these models is extremely expensive, with cost estimates starting at $10 million and potentially reaching tens or hundreds of times that. “Large language models are not very accessible to smaller organizations or academic groups,” says Hong Liu, a graduate student in computer science at Stanford University. To change that, Liu and his colleagues set out to improve on current LLM optimization methods. The result: an approach called Sophia that cuts the pretraining time in half. Optimizing Optimization. To better optimize LLM pretraining, Liu and his colleagues, including Stanford postdoctoral fellow Zhiyuan Li, Stanford research engineer David Hall, Computer Science Assistant Professor Tengyu Ma , and Associate Professor Percy Liang , used two tricks. The first, known as curvature estimation, isn’t new, but the Stanford team found a way to make it more efficient. To understand their approach, consider a factory assembly line. To function efficiently, the factory manager needs to optimize the number of steps it takes to turn raw materials into a final product and needs to understand and appropriately staff the workload at each step along the line. The same is true for pretraining an LLM. These models have millions or even billions of parameters that Liu likens to factory workers striving toward the same goals. One property of these parameters is their curvature, which Liu thinks of as the maximum achievable speed they reach as they progress toward the final goal of a pretrained LLM. In the factory metaphor, curvature is akin to a factory worker’s workload. If an optimization program can estimate that curvature (workload), it can make LLM pretraining more efficient. The problem is this: Estimating curvature with existing methods is remarkably difficult and expensive. “In fact, it’s more expensive than doing the actual work without making curvature predictions,” Liu says. That’s partially why the current state-of-the-art approaches to optimizing LLM pretraining (Adam and its variants) forgo the curvature estimation step. Still, Liu and his colleagues noticed a possible inefficiency in the prior methods that used parametric curvature estimation: Prior researchers updated their curvature estimates at every step of the optimization. The Stanford team wondered if they could make the process more efficient by decreasing the number of updates. To test that idea, the Stanford team designed Sophia to estimate parameters’ curvature only about every 10 steps. “That turned out to be a huge win,” Liu says. The team’s second optimization trick, called clipping, addresses a related issue: The problem of inaccurate curvature estimation. “If the estimation is wrong, it’s like giving people with hard jobs even more work to do. It makes things worse than if there were no estimation at all.” Clipping prevents that by setting a threshold, or a maximum curvature estimation. “In our factory metaphor, it’s like setting a workload limitation for all employees,” Liu says. Another metaphor often applied to optimization is a landscape of hills and valleys where the goal is to end up in the lowest valley. Without clipping, Liu says, it is possible to land at a saddle between two mountains. “In optimization, that’s not where you want to be,” he says. Testing Sophia and Scaling Up. Liu and his colleagues used Sophia to pretrain a relatively small LLM using the same model size and configuration that were used to create OpenAI’s GPT-2. Sophia’s combination of curvature estimation and clipping allowed the LLM pretraining optimization to smoothly proceed to the lowest valley in half the number of steps and half the time required by Adam. “Sophia’s adaptivity sets it apart from Adam,” Liu says. “It’s harder for Adam to handle parameters with heterogeneous curvatures because it can’t predict them in advance.” It’s also the first time in nine years that anyone has shown any substantial improvement over Adam on language model pretraining, Liu says. “This could mean a huge reduction in the cost of training real-world large models.” And as models scale, Sophia’s advantages should only increase, he says. Next, Liu and his colleagues hope to develop a larger LLM using Sophia. He’s also hoping to see Sophia applied to other areas of machine learning such as computer vision models or multi-modal models. “It would take some time and resources to move Sophia to a new domain, but because it is open source, the community could certainly do it.” Stanford HAI’s mission is to advance AI research, education, policy and practice to improve the human condition. Learn more . More News Topics. Language Processing. Machine Learning. Navigate. Welcome. Values. News. Events. Careers. Participate. Get Involved. Grant Programs. Corporate Programs. Learn more about giving to HAI. Contact Us. Follow us. https://twitter.com/StanfordHAI. https://www.facebook.com/StanfordHAI. https://www.youtube.com/channel/UChugFTK0KyrES9terTid8vA. https://www.linkedin.com/company/stanfordhai. https://www.instagram.com/stanfordhai. Newsletter Sign Up. Don’t miss out. Get Stanford HAI updates delivered directly to your inbox. Subscribe. Stanford University. Stanford Home (link is external) Maps & Directions (link is external) Search Stanford (link is external) Emergency Info (link is external) Terms of Use (link is external) Privacy (link is external) Copyright (link is external) Trademarks (link is external) Non-Discrimination (link is external) Accessibility (link is external) © Stanford University. Stanford, California 94305.