From the perspective of Token, calculating the demand for AI computing power

This report analyzes the growth trend of AI inference computing power demand, pointing out that the growth rate of inference computing power demand is faster than the decline in unit computing power costs. As the number of AI application users increases, the demand for inference-side computing power continues to rise. The report mentions that the token call volume of Google and Microsoft has significantly increased, and it is expected that spending on inference-side computing power will accelerate in the future

This report aims to provide an analytical framework for the demand for inference computing power, from user penetration to Token calls, and then to hardware expenditures. Through calculations of the future Token call volume, total computing power demand, and future hardware expenditure rhythm for Google and Microsoft (OpenAI), we conclude that the growth rate of inference computing power demand exceeds the decline in unit computing power costs. We remain optimistic about the accelerated growth of inference computing power demand.

Core Viewpoints

The growth of inference computing power demand may outpace the decline in unit computing power costs

Total computing power expenditure is influenced by both computing power demand and costs. With the continuous increase in the number of AI application users and the rising penetration of Agents, the demand for inference computing power is continuously increasing; at the same time, due to hardware iteration and continuous improvements in infra algorithms, the unit cost of model inference is continuously declining, with model inference prices currently having dropped to below 1/100 of the early 2023 levels. Currently, there is a divergence in the market regarding the future rhythm of inference computing power expenditure. This report aims to provide an analytical framework for the demand for inference computing power, from user penetration to Token calls, and then to hardware expenditures. Through calculations of the future Token call volume, total computing power demand, and future hardware expenditure rhythm for Google and Microsoft (OpenAI), we conclude that the growth rate of inference computing power demand exceeds the decline in unit computing power costs. We remain optimistic about the accelerated growth of inference computing power demand.

Domestic and foreign major companies are experiencing rapid growth in Token call volume, leading to a swift increase in inference computing power demand

For overseas CSP vendors, according to data from the May 2025 Google I/O conference, Google's average monthly Token call volume increased from 97 trillion in April 2024 to 480 trillion in April 2025, a growth of 50 times. According to data from Microsoft's FY25Q3 conference call, Azure AI infrastructure processed over 100 trillion Tokens in the first quarter of 2025, a fivefold increase compared to the same period last year, with a single-month Token call volume reaching 50 trillion in March. For domestic internet giants, in May 2025, ByteDance's Volcano Engine had an average daily Token call volume of 16.4 trillion (monthly average of 508T), which is 137 times that of May 2024. We believe that there has been a significant acceleration inflection point in the Token call volume of major domestic and foreign companies, leading to rapid growth in inference computing power demand.

Considering the decline in unit computing power costs, Google’s computing power expenditure is still expected to grow significantly

The penetration of AI search is the main driver of the growth in Google’s Token call volume. Through calculations of AI Overview, AI Mode, Gemini 2C applications, and other inference demands, we predict that Google’s total Token volume in the second quarter of 2025 is expected to reach 200.9 trillion, a quarter-on-quarter growth of 223%, compared to 71 trillion Tokens in Q2 2024, which is nearly a 30-fold increase. The unit computing power cost = unit price / computing power, and is showing a downward trend due to software algorithm optimization and the deployment of new version chips, according to our calculations The unit computing costs for April, May, and June decreased by 14%/13%/13% month-on-month, with the decline being smaller than the month-on-month growth rates of computing demand for April, May, and June at 56%/38%/32%. According to our calculations, Google's inference computing expenditure in the second quarter of 2025 is still expected to grow by over 100% month-on-month.

User usage and the penetration of Deep Research are expected to drive high growth in Microsoft's computing demand

The increase in Microsoft's token call volume mainly stems from the rise in traffic to the OpenAI ChatGPT webpage and the penetration of Deep Research features. Considering the impact of these two factors, under our forecasting framework, Microsoft's total token call volume in the second quarter of 2025 is expected to reach 205 trillion, with a month-on-month growth of about 100%. From the perspective of computing costs, algorithm optimization has led to an increase in the utilization rate of floating-point operations in models, meaning that the same number of chips can provide more effective computing power. According to our calculations, after considering the decrease in computing costs brought about by algorithm optimization, Microsoft's demand for inference computing cards by the end of the year is still expected to be more than twice that of March, and we predict that Microsoft's demand for computing hardware will continue to maintain high-speed growth.

Main Text

Domestic and foreign major companies see rapid growth in token call volume, accelerating the demand for inference computing

Major companies are experiencing rapid growth in token call volume, accelerating the demand for inference computing. According to the Google I/O conference in May 2025, in April 2024, the total number of tokens processed monthly by Google's products and APIs was 9.7 trillion; by May 2025, this number had exceeded 480 trillion, growing a full 50 times. According to Microsoft's FY25Q3 conference call, Azure AI infrastructure processed over 100 trillion tokens this quarter, a fivefold increase compared to the same period last year, with the token call volume in March alone reaching 50 trillion.

Domestically, internet giants represented by ByteDance are also seeing rapid growth in token call volume. According to disclosures from the ByteDance Volcano Engine Spring Conference, the average daily token call volume on the ByteDance Volcano Engine at the end of May was 16.4 trillion (monthly average of 508T), which is 137 times that of May 2024 and 4 times that of December, roughly on par with the monthly average of 480T tokens disclosed by Google in April. Comparing the token call distribution between May this year and December last year, the consumption of AI tool-related tokens has rapidly increased, with AI search growing 10 times and AI programming growing 8.4 times. In other scenarios, the token consumption in K12 online education has increased 12 times within five months; visual understanding models have also driven token growth, with new scenarios such as intelligent inspection and video retrieval achieving a daily breakthrough of 10 billion tokens We believe that with the continuous enrichment of application scenarios, domestic inference demand is expected to accelerate growth.

Estimation of Token Call Volume and Computing Power Demand for North American Major Companies

Estimation of Google Token Call Volume and Computing Power Demand

The growth of Google Token call volume is mainly due to the expansion of AI search

Reasons for the rapid increase in Google Token volume: AI search (AI Overview)

The difference in Token call volume between Google and Microsoft does not stem from Chatbot-type products: Gemini and ChatGPT have certain similarities at the call level, with Gemini having only about 1/3 the number of users compared to ChatGPT, yet its Token volume is 6 times that of Microsoft. Therefore, the difference in Token volume between the two does not come from Chatbot-type applications.
The high growth in Google Token call volume is primarily driven by AI search (AI Overview): Search is the area where Google has the most advantage over Microsoft, with Google Search holding about 90% market share and an annual search volume reaching 5 trillion. The AI Overview launched in May 2024 is likely the reason for the significant difference in Token volume between Google and Microsoft. In the first quarter of this year, Google’s Token call volume rose sharply, with quarter-on-quarter growth of 81%/56% in Q1/Q2 2025, respectively. According to Google’s earnings call, the first quarter of 2025 saw the largest expansion of AI Overview in history, including growth in user numbers and the richness of answers to questions. The expansion of AI Overview is the core driver of the rapid growth in Token volume.

Google Token Usage Estimation

According to our calculations, the inference Token volume in May and June 2025 will reach 659/870 trillion respectively, with a month-on-month growth of 37%/32%. The total Token volume in the second quarter of 2025 will reach 2009 trillion, a month-on-month growth of 223%, compared to 71 trillion Tokens in Q2 2024, an increase of nearly 30 times.

Google's inference Tokens mainly consist of three parts: AI search, Gemini 2C applications, and other inference demands. The key indicators for Token volume changes are estimated as follows:

1) AI Search: AI Search Token Volume = Monthly AI Search Count x Token Consumption per Search

Monthly AI Search Count = Monthly Google Search Count x AI Search Penetration Rate. According to Google's disclosure in March 2025, the annual search volume reaches 5 trillion, which is used to estimate the search volume for March. From the user's perspective, the monthly active users of AI Overview in March reached 1.5 billion, while Google Search had about 2 billion monthly active users. From the perspective of keyword triggers, according to Semrush, the proportion of all search keywords triggering AI Overview is about 13.14%. Therefore, based on the actual proportion of AI Overview in all searches, the penetration rate should be below 75%, assuming a penetration rate of 55% in March.

Token consumption per search: The Token consumption for searches is between Chat and Agent, assuming that each AI Overview consumes 1200 Tokens. The AI mode was launched in March 2025, which can break down the original question into multiple sub-questions for searching, and its Token consumption should be several times that of the ordinary AI Overview, assuming a consumption of 5000 Tokens.

2) Gemini: Gemini Token Volume = 30DAU x Average Daily Usage per Person x Tokens per Use

DAU = MAU (DAU/MAU). In March 2025, Gemini's monthly active users and daily active users were 350 million and 35 million respectively. According to disclosures from the Google I/O conference, the monthly active users in April reached 400 million.

Average Daily Usage per Person: Assuming 10 times in March, extrapolating the situation for other months based on a neutral growth trend.

Tokens per Use: Due to significant differences in the number of Tokens used per session between Agent and Chat, the average number of Tokens used per session for Gemini is obtained through assumed proportions. Assuming that the proportion of Agent in March 2025 is 1% 3) Other inference requirements: Assuming the proportion of the overall Token remains basically unchanged.

Core indicators affecting Token volume changes:

AI search penetration rate: Monthly Google search volume exceeds 400 billion, thus the growth of AI search penetration rate will drive a rapid increase in Token usage.
Proportion of AI Mode: AI Mode was tested in March 2025 and officially launched to U.S. users in May 2025. Compared to traditional search, AI Mode runs multiple related searches, predicts sub-questions of interest to users, and generates comprehensive integrated answers. Therefore, the Token usage of AI Mode is several times that of AI Overview, and an increase in the proportion of AI Mode can exponentially increase the total amount of AI search Tokens.
Proportion of Gemini Agent: The Token usage of Agent may reach hundreds of times that of Chat, and an increase in the application proportion of Agent functions will significantly drive the growth of Gemini Tokens.

Rapid growth of Google Token volume is expected to bring sustained high growth in capital expenditure

We expect that Google’s demand for inference computing power in the second quarter will increase by 223% quarter-on-quarter. Assuming that Gemini Pro and Gemini Flash maintain a 50%/50% share of Tokens, the computing power required for the inference process can be estimated by the formula C≈2NBS, resulting in a 223% quarter-on-quarter increase in total computing power demand in the second quarter of 2025 compared to the first quarter.

Core indicators affecting changes in inference computing power: model parameter quantity and the proportion of large parameter model usage: with a similar proportion of Token numbers, the model parameter quantity directly affects the final computing power demand. The computing power demand per Token for Gemini Pro is 17 times that of Gemini Flash. If the application proportion of large parameter models increases or the model parameters grow in the future, it will directly drive the growth of inference computing power demand.

We expect that capital expenditure on inference computing power will increase by 159% quarter-on-quarter in the second quarter: According to our calculations, in the second quarter of 2025, the growth in Token volume will bring Google an additional $1.45 billion in chip capital expenditure, a quarter-on-quarter increase of 159%, mainly due to the significant increase in Token volume. The overall growth in capital expenditure on inference computing power indicates that the growth in inference demand (Token volume growth) offsets the decrease in inference costs caused by chip iteration and algorithm optimization, and we are optimistic about the continued growth of computing power capital expenditure The unit computing cost is a core indicator affecting changes in inference capital expenditure. Unit computing cost = unit price / computing power. The calculated computing cost shows a downward trend, with unit computing costs decreasing by 14%/13%/13% month-on-month in April/May/June. The main factors affecting unit computing costs are:

Iteration of new chips: This can be quantified as the proportion of new chip computing power / price. Taking the TPU iteration as an example, the FP16 computing power of TPU v7 has increased by 151% compared to TPU v6, and with the price increase being lower than the computing power increase, the unit computing cost decreases.
Algorithm iteration: Currently, it is assumed that algorithms lead to a 1/4 reduction in inference costs each year. If the speed of algorithm iteration slows down, it will lead to an increase in the number of chips required.

Microsoft Token Call Volume and Computing Power Demand Estimation

The growth in OpenAI Token call volume mainly stems from the increase in user numbers and the penetration of Deep Research features

Unlike the penetration of AI features in Google's traditional search, the increase in OpenAI's Token call volume is primarily due to the rise in traffic to the ChatGPT webpage and the penetration of Deep Research features.

ChatGPT Section: According to Semrush data, OpenAI's total monthly visits as of March 2025 are approximately 600 million, with an average visit duration of about 12 minutes. Assuming users interact with ChatGPT every 2 minutes, the total number of questions per visit corresponding to a 12-minute visit duration would be 6. Assuming the number of Tokens consumed per question is 1000, the total Token calls for OpenAI's ChatGPT section would be total visits x (single visit duration / single question duration) x single question Token count = 35.9 T Tokens. Currently, the total visits and average visit duration for ChatGPT are still rapidly increasing. Assuming total visits grow at a month-on-month rate of about 10%, the total Token call volume for ChatGPT is expected to reach 153T in the second quarter of 2025, an 85% increase compared to the first quarter. The specific calculations are shown in the table below:
Deep Research Section: OpenAI's Deep Research feature was officially launched on February 2, 2025, with the full version first made available to Pro users, and then further opened to more subscription levels Among them, Plus, Team, education, and enterprise users have 10 query opportunities per month, while Pro users have 120 query opportunities per month. This article estimates that OpenAI's paid users are considered the core user group for the Deep Research feature. According to OpenAI, the WAU (Weekly Active Users) number for December 2024 is 300 million, for February 2025 it is 400 million, and for April it is 500 million, with a month-on-month growth rate of over 10%. OpenAI disclosed that the number of paid subscription users reached 30,000 in May 2025, accounting for approximately 0.6% of the total WAU. The reasoning time for a single Deep Research question is generally 5-10 minutes, and the output text length is generally dozens of times more than the number of reference webpages for a single interaction. Based on this, we assume that the Token count for a single Deep Research is 50 times that of a single interaction, which is 50,000 Tokens. Based on the monthly Deep Research quota for paid users, we assume that each user uses Deep Research 40 times per month. The estimated number of Deep Research Tokens in March 2025 = number of paid users x Tokens per single Deep Research x number of times each user uses Deep Research per month = 4.8T Tokens.

In summary, based on our calculations, the total Token usage for OpenAI in March is approximately 40.7T Tokens. OpenAI's Token usage constitutes the largest portion of Microsoft's total Token consumption. Assuming OpenAI's Token usage accounts for 85% of Microsoft's total Token count, we estimate Microsoft's total Token usage in March to be approximately 48T, which is consistent with Microsoft's disclosed Token count for March (according to Microsoft's earnings call, the total Token usage for Microsoft in Q1 2025 is approximately 100T Tokens, with 50T Tokens used in March). According to our predictions, Microsoft's total Token usage in Q2 2025 is expected to reach 205T, with a quarter-on-quarter growth of approximately 100%.

Based on our calculations, Microsoft's reasoning computing power demand is expected to grow by 99% quarter-on-quarter in Q2. OpenAI has not disclosed the model parameter count data; assuming it is similar to Google, half uses a large parameter model of around 300B (GPT-3), and half uses a small parameter model of around 20B (GPT-4-mini). Using the formula C≈2NBS to estimate the impact of Token volume growth on computing power demand, we calculate that the overall reasoning computing power demand in March 2025 is 15.3 Trillion TFLOPs We predict that Microsoft's total computing power demand in the second quarter of 2025 will be 65.6 Trillion TFLOPs, a quarter-on-quarter increase of 99%. The specific calculation process is as follows:

According to our calculations, the demand for inference computing power cards at Microsoft by the end of the year will be more than twice that of March. According to NVIDIA's official website, the H100 has a single-chip computing power of 989 TFLOPs at FP16 precision. With the optimization and improvement of infra algorithms, the MFU (Model FLOPs Utilization) of the inference process shows an upward trend, with a neutral assumption of a 1% monthly increase. Based on our calculations, the equivalent H100 computing power demand for inference at Microsoft in March 2025 will be 43,000 units, with a future month-on-month growth rate of about 10%. According to our calculations, the demand for inference computing power cards at Microsoft by the end of the year will be more than twice that of March.

Conclusion: The growth rate of inference computing power demand is faster than the decrease in unit computing power costs.

Due to the continuous decline in unit inference costs, there is a divergence in computing power expenditure. The factors influencing computing power expenditure can be divided into computing power demand and computing power costs. The continuous growth in the number of AI application users and the penetration of Agents drive the continuous increase in computing power demand. At the same time, due to hardware iteration and algorithm improvements, the inference costs continue to decline. According to Artificial Analysis data, the model inference price has currently dropped to below 1/100 of that at the beginning of 2023, leading to a divergence in overall computing power expenditure.

According to our calculations, the growth rate of inference computing power demand is faster than the decrease in unit computing power costs. The chart below shows the comparison of Google in April 2025 versus March 2025, with a 56% quarter-on-quarter increase in computing power demand and a 14% quarter-on-quarter decrease in computing power costs. The impact of the growth in computing power demand is greater than the impact of the decrease in computing power costs, and future computing power expenditure will continue to grow. Additionally, Agents are still in the early stages; apart from Deep Research, more advanced Agent applications have not yet been included in the calculation scope. As more General Agents are implemented, leading to increased interaction frequency, task complexity, and usage frequency, and with multi-modal scenarios such as screen recognition further increasing Token consumption, we remain optimistic about the accelerated growth of inference computing power demand

Risk Warning and Disclaimer

The market has risks, and investment requires caution. This article does not constitute personal investment advice and does not take into account the specific investment goals, financial situation, or needs of individual users. Users should consider whether any opinions, views, or conclusions in this article are suitable for their specific circumstances. Investment based on this is at one's own risk