After a year of entrepreneurship in Silicon Valley, Jia Yangqing shared his observations on the AI industry: costs, market increment, and business models

Jia Yangqing is one of the most prominent global AI scientists. The deep learning framework he created, Caffe, has been adopted by companies such as Microsoft, Yahoo, and NVIDIA. After leaving Alibaba in 2023, he founded Lepton.AI

After a year of entrepreneurship, Jia Yangqing chose the direction of AI Infra.

Jia Yangqing is one of the most prominent global AI scientists. During his doctoral studies, he founded and open-sourced the well-known deep learning framework Caffe, which was adopted by companies such as Microsoft, Yahoo, and NVIDIA.

In March 2023, he left Alibaba to start his own business. In a subsequent podcast recording, he mentioned that he did not start the business because of the popularity of ChatGPT. The emergence of the entrepreneurial project later confirmed that he did not directly enter the field of large models. A16z, a well-known Silicon Valley venture capital firm, previously mentioned in an article about AIGC: "Currently, infrastructure providers are the biggest winners in this market."

In an article last year, Jia Yangqing also mentioned, "However, to be this winner, you have to design Infra more intelligently." On the official website of his company Lepton.AI, there is a striking slogan "Build AI The Simple Way."

Recently, at the "High Mountain Night Talk" event at the Silicon Valley station of High Mountain Academy, Jia Yangqing conducted a deep closed-door sharing session with visiting Chinese entrepreneurs. The content of the sharing session addressed industry pain points, starting with his expertise in AI Infra, analyzing in detail the new characteristics of AI Infra in the AI era. He then helped companies calculate a detailed economic account based on the characteristics of AI large models—how to achieve a good balance point among the impossible triangle of cost, efficiency, and effectiveness.

Finally, he also discussed the incremental opportunities in the entire AI industry chain and the current dilemma of the business model of large models:

"Every time a basic large model is trained, it has to start from scratch. To put it vividly, this training session 'invests 1 billion, and the next time you have to add another 1 billion,' and the model iteration speed is fast, the window of opportunity to make money may only be about a year. So everyone is thinking about this ultimate question, 'How can the business model of large models be truly effective?'"

Most of Jia Yangqing's past experiences are in TOB. He has also candidly stated in his sharing sessions multiple times, "I don't see TOC very clearly, but I have a clearer view of TOB."

"AI, from coming out of the laboratory or ivory tower to the application process, will experience all the pitfalls that need to be crossed." No matter how amazing large language models are to people, their development is not pie in the sky, and past experiences and paradigms have both changed and remained unchanged.

For easier reading, we have summarized a few key points at the beginning of the text, but we strongly recommend reading the entire content to understand Jia Yangqing's complete thought process:

While the effect of a universal large model is indeed very good, in actual enterprise applications, small and medium-sized models combined with their own data may achieve a better cost-effectiveness.

As for the cost issue, we have also calculated an economic account: a GPU server can support 7B and 13B models through fine-tuning, and the cost-effectiveness may be more than 10 times higher than directly using closed-source large models I personally believe that NVIDIA will continue to be the absolute leader among all AI hardware providers in the next 3 to 5 years, and I believe its market share will not be less than 80%. However, as AI models gradually become standardized, we also see another opportunity at the hardware level.

Currently, we see two major types of applications in AI that have crossed the "valley of death" and are starting to have sustained traffic: one is efficiency improvement, and the other is entertainment.

A large number of traditional industry applications are actually the deep waters worth exploring in the AI industry.

My personal opinion on the Super App may be slightly conservative, possibly because many of my own experiences involve providing TOB services. I believe that Super Apps will exist, but they will be rare.

The following is a summary of the shared content:

With the rise of large-scale language models, a new concept has emerged - Scaling Law. According to Scaling Law, the performance of large language models is power-law related to the number of parameters, the size of training data, and computational resources. In simple terms, using a general method to provide huge amounts of data to the model enables the model to have the ability to produce the desired results.

This makes AI computing different from "cloud computing". Cloud computing mainly serves the needs of the Internet era, focusing on resource pooling and virtualization:

How to turn computing, storage, and networking from physical resources into virtual concepts, "wholesale to retail";
How to increase utilization in this virtual environment, or in other words, overselling;
How to deploy software more easily, achieve maintenance-free operation of complex software (such as disaster recovery, high availability), and so on.

In more common terms, the main demand of the Internet is to process various web pages, images, videos, etc., distribute them to users, and make "data flow" happen. Cloud services focus on the elasticity and convenience of data processing.

However, AI computing focuses more on the following points:

Does not require particularly strong virtualization. Generally, training will "monopolize" physical machines, with no strong virtualization requirements other than simple tasks such as establishing virtual networks and packet forwarding.
Requires high-performance and high-bandwidth storage and networking. For example, networks often require RDMA bandwidth connections of several hundred gigabits or more, rather than the common bandwidth of several gigabits to tens of gigabits on cloud servers.
Does not have a strong requirement for high availability because many offline computing tasks do not involve issues such as disaster recovery.
Does not involve overly complex scheduling and machine-level fault tolerance. Because the failure rate of machines themselves is not very high (otherwise the GPU operations team would have to investigate), and training often involves checkpointing at the minute level, so the entire task can be restarted from the previous checkpoint in case of a failure.

Today's AI computing prioritizes performance and scale, while the capabilities involved in traditional cloud services come second This is actually very similar to the demand in the traditional high-performance computing field. Back in the 1970s and 1980s, we already had supercomputers, which were large in size and could provide a large amount of computing power, capable of performing services like weather simulations.

We once made a simple estimate: in the past, training a typical image recognition model required about 1 ExaFlop of computing power. To vividly describe this amount of computation, one can imagine everyone in Beijing performing a calculation of addition, subtraction, multiplication, or division every second. Even so, it would take thousands of years to complete the training of a model.

So, if a single GPU is not sufficient to meet the demand, how should we respond? The answer is to connect multiple GPUs to build a Super POD similar to NVIDIA's. This architecture is very similar to the earliest high-performance computers.

In this case, if one GPU is not enough, what should we do? We can connect a bunch of GPUs together to create a Super POD similar to NVIDIA's, which looks very similar to the earliest high-performance computers.

This means that we have shifted from the demand for "data flow" back to the demand for "massive computation", except now the "massive computation" has two advancements: higher performance GPUs for computation and more user-friendly software. With the development of AI, this will be a process that accelerates gradually. The new DGX cabinet introduced by NVIDIA this year provides almost 1 Exaflop per second, which means theoretically the computing power of one second can complete the training.

Last year, I co-founded Lepton AI with several colleagues. "Lepton" means "lepton" in physics. We all have experience in the cloud computing industry and believe that the current development of AI presents a transformative opportunity for the "cloud". So today, I want to focus on how we should rethink the infrastructure of the cloud in the era of AI.

For enterprises using large models, calculate the "economic account" first

As the scale of models continues to expand, we face a core issue: the high cost of computing resources required for large models. From the perspective of practical applications, we need to consider how to efficiently utilize these models.

Taking an application scenario as an example, we can vividly see the difference between a general large language model and a model fine-tuned for a specific domain.

We once tried to "train a conversational AI model in the financial domain".

Using a general model, we directly asked: "How is Apple Inc.'s recent financial report? What do you think of Apple Inc.'s investment in the AI field?" The response from the general large model was: "Sorry, I can't answer that question."

After fine-tuning for a specific domain, we used a 7B open-source model, let it "learn" the financial reports of all publicly listed companies in North America, and then asked the same questions. Its response was: "No problem, thank you for your question." The tone was very much like a CFO of a publicly listed company This example clearly shows that while large-scale general models have excellent performance, in practical applications, using small to medium-sized open-source models and fine-tuning them with specific data may ultimately achieve better results.

As for cost considerations, we have also done some economic calculations: a GPU server can support fine-tuning of 7B and 13B models, with a cost-effectiveness potentially more than 10 times higher than directly using closed-source large models.

As shown in the above figure, taking the Llama2 7B open-source model as an example, the cost for 1 million tokens is approximately $0.1-$0.3. Using an NVIDIA A10GPU server for training, with a peak speed of 2500 tokens per second, the cost for one hour is about $0.6. With this server, the annual cost is approximately $5256, which is not high.

If using closed-source models, the cost for consuming 1 million tokens is much higher than $0.6 per hour.

However, cost considerations also need to take into account the type of application and the output speed of the model. The faster the model output speed, the higher the cost. If mini-batches are used to run simultaneously, the overall performance will be better, but individual output performance may be slightly lower.

This leads to another question: what is the appropriate output speed for large models?

Taking Chatbot as an example, the average human speaking speed is about 120 words per minute, while the reading speed is around 350 words per minute. Calculating in reverse for tokens, about 20 tokens per second can achieve a good user experience. If the application traffic is sufficient, the running cost is not high.

However, whether the traffic can reach a "sufficient" level becomes a "chicken and egg" problem. We have discovered a practical pattern to solve this problem.

In North America, many companies first conduct experiments using closed-source large models (such as OpenAI's models). The experiment scale is typically in the hundreds of millions of tokens, with a cost of a few thousand dollars. Once the data flywheel starts turning, the existing data is saved, and smaller open-source models are used to fine-tune their own models. This has now become a relatively standard practice.

When considering AI models, companies are actually balancing various trade-offs. In North America, there is often talk of an impossible triangle, where you cannot have a car that is fast, cheap, and of high quality at the same time.

The standard pattern mentioned earlier is actually about prioritizing quality first, then considering cost. It is basically impossible to simultaneously satisfy all three aspects Half a year ago, I strongly believed that open-source models could quickly catch up with closed-source models. However, half a year later, I think there will continue to be a very reasonable gap between open-source and closed-source models. To illustrate this gap with a specific example, when closed-source models reach the level of GPT-4, open-source models may be around GPT3.5.

New Opportunities in the Hardware Industry

As early as the early 2000s, NVIDIA saw the potential of high-performance computing. In 2004, they developed CUDA, and now it has been 20 years. Today, CUDA has become the standard language at the lowest level for AI frameworks and AI software.

Initially, the industry believed that high-performance computing was inconvenient to write. NVIDIA introduced CUDA, convincing everyone that it was simple and easy to use. After trying it out, everyone found it easy to use and capable of achieving fast high-performance computing speeds. Subsequently, almost all major companies' researchers rewrote their AI frameworks based on CUDA.

CUDA established a good relationship with the AI community early on. Other companies also saw the huge market opportunity, but from the user's perspective, the motivation to switch to other products was not strong.

Therefore, there will still be a focus in the market on whether anyone can challenge NVIDIA's position. Besides NVIDIA, which new hardware providers might have a chance?

Firstly, my views do not constitute investment advice. Personally, I believe that in the next 3 to 5 years, NVIDIA will still be the absolute leader among AI hardware providers, with a market share of no less than 80%.

However, as AI models gradually standardize, we also see another opportunity at the hardware level. In the past decade, there was a problem that troubled the AI field. Although many companies could provide compatibility with CUDA, this layer was "fragile." "Fragile" means that there are various models, making the adaptation layer prone to issues, which could disrupt the entire workflow.

Today, fewer people need to write the lowest-level models, and there is an increasing demand for fine-tuning open-source models. Being able to run Llama and Mistral can meet about 80% of the demand. The need for adaptation for each corner case is gradually decreasing, and covering a few major use cases is sufficient.

Other hardware providers are working hard to be compatible with CUDA. Although it is still challenging, seizing a certain market share today is no longer impossible. Additionally, cloud service providers also want to diversify their investments. Therefore, this is an interesting opportunity we see, and it is also part of the continuous evolution of cloud infrastructure.

Generative AI Wave: What are the Incremental Opportunities?

Let's take a look at the situation of AI applications. Today, we can see that the supply of AI applications is continuously increasing. Looking at Hugging Face, in August 2022, there were only about 60,000 models. By September 2023, the number had already increased fivefold, showing a very rapid growth rate At present, we see two major categories of AI applications that have crossed the "death valley" and are starting to have more sustained traffic:

The first category is productivity. For example, in the e-commerce industry, using AIGC to quickly generate product display images. For example, Flair AI, an application scenario, for instance, if I want to take an advertising picture of bottled water, I just need to place the water in a convenient location, take a photo, and send this photo to a large model, telling it that I want the water to be placed on a snowy mountain with a background of blue sky and white clouds. It can then generate an image that can be directly uploaded to an e-commerce platform as a product display picture.

There are many other types as well, such as improving search and interaction functions in enterprise massive knowledge bases, for example, Glean.

The second category is entertainment, such as Soul, which uses AI for role-playing and interaction.

We have also noticed a trend where the number of "shell apps" is decreasing. In fact, it has been found that products directly using universal large models have a common problem of being particularly "robotic" in terms of interaction effects.

On the contrary, slightly smaller models like 7B and 13B have very good cost-effectiveness and adjustability. To give an intuitive analogy: large models are like "PhD" degrees, while undergraduates have stronger practical skills.

In terms of application layer, there are two main paths: the first is to train your own basic large model, or to fine-tune the model yourself.

Another approach is to have applications in a very vertical field with deep scenarios where using prompts directly is not feasible.

For example, in the medical field, when a user asks, "How were my test results from yesterday?" This actually requires a large model behind the scenes to not only provide professional analysis of test indicators but also give users advice on diet and other aspects.

This involves multiple segmented scenarios in the chain of industries such as testing, healthcare, and insurance, requiring deep experience in the medical industry chain. It is necessary to add an AI capability on top of existing experience to enhance user experience, which is a sustainable AI application model that we have discovered today.

As for the future, predicting the future is the most difficult. My experience has always been in the B-end, mainly focusing on supply and demand logic. The incremental demand brought by AI is primarily high-performance computing power. The second is high-quality models, as well as software layers that meet the high-performance, high-quality, and high-stability computing needs of the upper layers.

Therefore, from the perspective of high-performance computing power, NVIDIA has obviously become a winner. In addition, this market may accommodate 2-3 relatively good chip providers.

In terms of models, OpenAI is definitely a more certain winner, with a large enough market that should be able to accommodate 3-5 different model manufacturers, and there is likely to be a regional distribution bias.

AI Deepwater Areas in Traditional Industries

I also want to talk about the large number of applications in traditional industries, which is actually a deepwater area worth exploring in the AI industry The emergence of large language models once made everyone think that OpenAI had created a very powerful model that could handle anything with just a prompt.

However, Google wrote an article at the beginning of the century, and to this day, I still think this viewpoint is correct. The article mentioned that machine learning models are just a very small part of the entire AI process, with a lot of work outside, which is becoming increasingly important today. For example, how to collect data, ensure that the data is consistent with our application needs, how to do adaptation, and so on.

After the model goes online, there are three things to consider: first, it needs to run stably; second, it needs to continuously control the quality of results; and a very important point is to collect the data obtained in the application in a feedback manner, to train better models in the next wave.

This methodology still applies today. In industry competition, those who have data and can better adjust user feedback into data that can be more effectively applied in the next round of training have a competitive advantage.

Today, everyone has a feeling that the structure of large models is not much different, but the details of data and engineering capabilities are what determine the differences between models. OpenAI is continuously proving this to us.

Today, looking at the architecture of the entire technology stack, a16z has provided us with a very good summary (as shown in the figure below):

At the IaaS layer, NVIDIA is basically the leader, with other companies competing in hardware and cloud platforms, forming the solid foundation at the bottom.

Cloud platforms are constantly changing today. Recently, you may have heard a term in technology trends called "down cloud," which contrasts with the traditional concept of "full-stack cloud."

Why is there a trend of "down cloud"? It is because computing power itself is a huge cost, and it is a cost that can be "self-contained," so the industry is starting to separate the traditional cloud costs from today's AI computing costs.

Today, more and more PaaS are evolving into Foundation Models, some are closed source, some are open source, and then another layer of APP is built on top. Competition is fierce at every layer today. But personally, I feel that the most active layers are the model layer and the upper application layer.

The model layer is mainly about the battle between open source and closed source.

There are two trends in the application layer: one is models striving to move up to applications; the other is applications desperately trying to understand what capabilities the models have and then adding AI to make their applications more powerful.

I personally think it's a bit difficult for models to move up to applications, while applications adding their AI capabilities have more hope.

In China, there is a term called Super APP, and a key point of Super APP is the need to "solve problems end-to-end." a16z also describes on his chart that there will be some end-to-end APPs coming out, essentially requiring the model's reasoning and planning capabilities to be very good ChatGPT is all about end-to-end integration, with its own models and applications, embodying the state of a Super App.

However, my personal view on Super Apps may be slightly conservative, possibly due to my own experience mostly involving TOB services. I feel that while Super Apps will exist, they will be rare.

My personal feeling is that in the B-end applications, the trend will continue to be more like building with building blocks, using open-source models combined with a company's own data to construct their own applications.

Business Model of Large Models: Two Dilemmas and a Market Phenomenon

In the process of commercializing large models, I have observed two dilemmas in the market:

The first dilemma is the flow of revenue, which is different from the traditional model. The normal business model flow should be: charging users and then "retaining costs" for hardware service providers, such as NVIDIA. But today, it is lateral, obtaining financing from VCs and directly "retaining money" for hardware manufacturers. However, VC money is essentially an investment, and entrepreneurs may have to return 10 times the amount to VCs in the end, making this flow of funds the first dilemma.

The second dilemma is that large models today, compared to traditional software, can generate revenue in a much shorter time.

After developing software once, it takes a relatively long time to recoup costs. For example, with Windows, although it undergoes iterations every few years, many of its underlying codes do not need to be rewritten. So, after a software is developed, it may continue to iterate over the next 5-10 years, providing a time window for continuous iteration. And most of the investment is in the cost of programmers.

However, the characteristic of large models is that after training a model once, the next time it has to start training from scratch again. To put it more vividly, "today investing 10 billion, when iterating again, you have to add another 10 billion."

But the iteration speed of models is very fast, how long is the time window in which money can be made? Today, it seems to be about a year, or even shorter.

So, everyone starts to question, how can the business model of large models truly be effective?

I have also observed a market phenomenon. Last year, the entire market was in great pain, with a sudden surge in hardware demand, and the entire supply chain did not respond in time, resulting in long waiting times, possibly over 6 months.

A recent phenomenon we have observed is that the supply chain is not as tight as before. Firstly, the global supply chain is starting to recover; secondly, I personally judge that some suppliers who hoarded goods early due to anxiety now feel that it's time to recoup costs. The previously tense state of supply and demand imbalance will gradually improve, but it will not suddenly turn into a situation where everyone is worried about selling.

The above is my personal observation based on the outbreak of generative AI in this wave, and the impact it has had on the entire AI industry. It is also in this wave that Lepton is continuously helping enterprises and teams find the best balance of cost, effectiveness, and efficiency in the process of implementing generative AI Finally, it can actually be summarized with a quote from Richard S. Sutton, a pioneering mentor in the field of reinforcement learning, in 2019, "In the entire 70 years of AI research, the most important experience is to use a general method (today it is deep learning) to leverage a large number of computational models (today based on high-performance computing with heterogeneous GPUs represented by NVIDIA). This is the most effective and simplest way in the 70-year development of AI."

Author of this article: Guo Xiaojing, Source: Tencent Technology, Original title: "One year of Silicon Valley entrepreneurship, Jia Yangqing talked about his observations in the AI industry: cost, market increment, and business model"