
Dialogue with Agora: Real-time interaction is driving the emergence of new AI tracks

Agora launched the world's first conversational AI engine, supporting rapid upgrades to multimodal large models, featuring five major capabilities including a super low latency response of 650ms. The engine is priced favorably at 0.098 yuan per minute. Agora regards AI as a core strategy, committed to advancing human-computer interaction, and expects AI opportunities to surpass those of the mobile internet. In Q4 2024, Agora, the parent company of Agora, achieved revenue of 34.45 million USD, a year-on-year increase of 3.6%
Author | Liu Baodan
Editor | Zhou Zhiyu
Since the beginning of the year, the wave triggered by DeepSeek is accelerating the landing of the AI industry. As an AI Infra company, Agora is undoubtedly an important driving force for AI implementation.
"Data cannot be shared, we can only say it exceeded expectations." Yao Guanghua, head of Agora's AI RTE (Real-Time Engagement) product line, told Wall Street Insights that the company opened the invitation testing for the conversational AI engine's Private Beta version on New Year's Eve, and the number of new customers exceeded expectations, all of whom are leading clients.
Recently, Agora officially launched its conversational AI engine, which, with its five major capabilities including 650ms ultra-low latency response, graceful interruption, and full model adaptation, can support any text large model to quickly upgrade into a "conversational and articulate" multimodal large model. At the same time, the pricing of the conversational AI engine is also more favorable, costing less than 0.1 yuan per minute, specifically 0.098 yuan/minute.
This is the world's first conversational AI engine. Agora's product head He Lipeng stated that AI large models are driving interaction between humans and machines, representing a greater expansion for the RTE track. Previously, large models were text-based; now, through RTE, large models can understand and perceive, enriching the scenarios and leading to more applications being implemented. "The opportunities for AI will be greater than those of mobile internet."
Founded in 2014, Agora primarily provides real-time audio and video interaction technology. In Q4 2024, Agora's parent company achieved revenue of 34.45 million USD, a year-on-year increase of 3.6%. Currently, Agora has positioned AI as its number one project, making significant layouts and investments.
The era of AI is accelerating its arrival. For Agora to seize this rare opportunity that comes once in decades and achieve leapfrog development, it must go all out.
Seizing AI Opportunities
Q: What role is Agora playing in this wave of AI?
He Lipeng: Agora is part of AI Infra. Previously, large models were text input; in the future, we want large models to understand you better, able to receive text, hear, and see you, allowing one-dimensional, two-dimensional, and three-dimensional information to enhance its understanding of you, leading to more output in communication.
Q: What is Agora's current core competitive barrier?
He Lipeng: Agora previously focused on real-time interaction between people; this time it is real-time interaction between humans and machines. Internally, we quickly adapt to this change based on our previous technological advantages, adjusting our algorithms and extending our capabilities.
If large model vendors directly provide multimodal capabilities, we also support that and maintain a cooperative relationship with them. One good aspect overseas is that the division of labor in the industrial chain is relatively clear, with each having its own advantages. OpenAI chose our brother company, and several domestic model vendors have also chosen Agora.
If a large model starts from scratch to do interaction, the requirements are quite high. Using multimodal technology involves another type of internet technology, which can cause delays and reliability issues. Agora has endpoints on every device, and we have adapted to tens of thousands of devices. If large model vendors were to do this now, they would need to readapt these endpoints, which would be costly for them Q: When expanding new AI businesses, what level of authority can Agora provide internally? How much determination is there to pursue this matter?
He Lipeng: This is definitely a top priority project, directly led by the boss.
This track is not just a fleeting trend, but a transformation. We must seize this opportunity; Agora has already made some accumulations in this area. Simply put, AI investment is definitely part of the company's strategy, and we will invest heavily when we see opportunities.
Q: DeepSeek is very popular right now. Do you think companies are integrating DeepSeek to ride the wave or as a long-term strategic investment?
He Lipeng: We have experienced many rounds of trends, and the opportunities in AI will be greater than those in mobile internet. The timing is basically ripe, and customers indeed have actual needs, such as clear educational demands. We are already addressing needs related to companionship and tools. After integrating AI, it can help companies reduce costs; DeepSeek has relatively low costs and high accuracy. Many traditional enterprises involve repetitive labor, and replacement will increase.
Q: Have you tried to communicate and collaborate with DeepSeek?
He Lipeng: When DeepSeek will launch its own multimodal capabilities depends on their priorities, but if they focus on real-time interaction, they will likely need to collaborate with us. In the future, every large model will have its own advantages and strengths. Our conversational engine is designed to schedule in real-time based on scenarios, which is the essence of our product design philosophy.
AI Demand Exceeds Expectations
Q: What is the current market feedback on the conversational AI engine?
Yao Guanghua: We opened the Private Beta version for invitation testing on New Year's Eve. I can't share the data, but I can say it has exceeded expectations. We will directly send unpublished data to existing customers, who have given positive feedback. The number of new customers has also exceeded expectations, and the positive feedback is very solid, coming from top-tier clients.
Q: How have the types of customers for Agora changed since the Spring Festival?
He Lipeng: Internally, we have more than a dozen scenarios, the largest being companionship, including social entertainment, child companionship based on IoT devices, digital humans in educational scenarios, and a lot of outbound calls and AI interviews.
Yao Guanghua: There is a new demand for overseas phone orders for food delivery, where users order takeout, and the other end is an AI robot that takes the order directly in the restaurant's system.
He Lipeng: We strive to provide capabilities while our partners innovate scenarios. If we talk about currently popular AI applications, there aren't any yet. My understanding is that we are still in the early stages of innovation; everyone is trying things out. The day a breakout application emerges will lead to rapid growth.
Q: How do you view the market space for real-time interaction?
He Lipeng: Large AI models have driven interaction between humans and machines, leading to greater expansion in the RTE track. Previously, large models were text-based, but now, through RTE, large models can understand and interpret audio and visual inputs, enriching the scenarios and leading to more applications being implemented.
We believe this is a transformation of the human-machine interaction interface. Previously, we always used keyboards, and mobile phones did not have touch screens. The next transformation should be that all touch and keyboard inputs become voice-based. We are already seeing some signs of this; many companies that previously focused on traditional software are now rewriting their code, either adding intelligent assistants or audio input The interface for human-computer interaction has changed; voice interaction must be real-time, which is a significant shift in the AI landscape.
We also see that current models are cloud-based, and in the future, there will be a combination of edge and cloud. Agora also has its own relatively real-time network that needs better connectivity and coverage, which will contribute to the rapid development and iteration of the AI industry.
Question: Less than 1 dime for less than 1 minute, what is the future market capacity?
He Lipeng: There are only so many people for human-to-human interaction, but there are more machines than people for human-to-machine interaction, so the space for this track is even larger, providing us with significant growth potential. As for whether we can quickly recover costs, we don't think so; AI is a long-term, significant opportunity, and we must quickly enter this opportunity. Once we reach certain expectations, we will definitely have good revenue.
Question: Is there still room for price reduction?
He Lipeng: We may not necessarily lower prices because we first need to ensure that the experience continues to improve, bringing emotional value in the future. Instead, we want to further enhance quality so that users feel it is worth it. Of course, if everyone thinks the cost is relatively high, we will consider it later, but for now, we need to continuously improve quality.
Yao Guanghua: Because we have already set the price very low.
AI hallucinations cannot be eliminated, but can be reduced
Question: What problems and bottlenecks have you encountered in the process of developing and implementing the conversational AI engine, and how did you solve them?
Yao Guanghua: The conversational engine involves many departments, including algorithms, experience, engineering, testing, products, etc. During the New Year, we specifically found a small dark room where everyone worked overtime, with about a dozen people in total. After DeepSeek was released, all of us working in AI were working overtime, and seeing the positive impact DeepSeek brought to China's technology sector, we also wanted to participate in this wave.
He Lipeng: The current product development process is dynamic, with potential users providing constant feedback and competing with some peers. Our products need to respond quickly and iterate rapidly. Agora has been building this development capability for 11 years, especially in real-time interaction, and we are quite confident in this area.
Question: Have you encountered significant challenges?
Yao Guanghua: Instant interaction is compressed to milliseconds, especially response delays. We aim to achieve 1 second, and then we need to compress further. We set a clear goal to reach world-class standards and ultimately deliver the experience.
He Lipeng: Conversational AI emphasizes experience, including latency, response interruptions, and voice locking. Previously, Agora achieved instant communication between people; this time, it is about communication between humans and machines, which changes the communication model and requires different technical demands. When I communicate with you, I only need to allocate the network, but if the other side is a machine, there may be interruptions and quick responses, which presents many challenges in the implementation engineering aspect.
Yao Guanghua: The AI user experience is like a no-man's land; no one knows which metrics to measure. For example, the point of locking human voice was never mentioned before; it was brought up by customers. To maintain the ability to avoid real-time interruptions, we need to filter the conversation. If there was no noise reduction foundation before, we need to develop a new one We turn cognition into standards, turn standards into metrics, and then present them in the products released today.
Question: Manus has coded the entire network steps; what is the difference between a voice-based Agent and a graphic-based Agent?
He Lipeng: Human interaction modes are definitely multimodal, and real-time interaction is a very important part. As the industry develops, we are thinking about whether we can eliminate text input. Voice contains emotions, so the information will be richer. I think this is the current form; can we use cameras for interaction, allowing the camera to accomplish certain tasks? Perhaps AI can screen resumes, and after turning on the camera and microphone, can we do other things? These are areas we are particularly focused on.
Looking at it now, there are definitely many forms of Agents and various input methods. In the future, voice may also be included, or multiple people may work on the same task simultaneously. The industry is developing too quickly; we are preparing foundational capabilities to allow everyone to innovate within it.
Yao Guanghua: The reason we call the AI engine product an engine is that we do not create Agents; we only want to build conversational interfaces, and there will be other adjustments in the future. We believe this is a disruptive interaction method. If the emotional value of dialogue can develop very well, and Humanlike performs exceptionally well, it becomes something beyond just a tool, possibly a companion, something between a pet and a friend.
Question: Dialogue products like Minimax and ChatGPT have serious hallucination issues; how can we eliminate hallucinations?
He Lipeng: Reducing hallucinations is certainly something the model itself needs to iterate on. Besides that, we need to be aware of the surrounding noise; if your voice is unclear, it can also lead to misunderstandings. Agora needs to focus on the person's voice and eliminate background noise to keep the original sound clean.
Hallucinations cannot be completely eradicated; they can be reduced. Just like today's interview, communication between people can also lead to misunderstandings, but when you realize there is a misunderstanding, you can inform the other party with more context to let them know they were wrong. There will always be hallucinations in human communication; our knowledge backgrounds are different, and your understanding may differ from mine. However, through several exchanges, we can generally understand what each other means.
Yao Guanghua: I agree; if the model's parameter size is smaller and focuses on a specific vertical, the increasing context will reduce hallucinations.
In the future, we will all be involved in reasoning and decision-making, which is the core. We need to identify the paths that lead to hallucinations and inform the other party that they are mistaken, prompting them to rethink the issue and participate in the final decision-making. This is the only way to eliminate hallucinations.
Question: This reduces the possibility of real-time interaction, similar to autonomous driving; waiting for the output of a thought chain is unlikely.
He Lipeng: We are also discussing that we must differentiate scenarios. Some scenarios are real-time and cannot have waiting times. We are currently receiving demands for embodied robots, where the delay requirements are very high, including customer service calls, etc.; we cannot wait half a day for a response. So this is indeed quite segmented; not all scenarios need to use Agora; we still need to find the most suitable one, considering delay, interaction, companionship, etc Recently, I have also noticed that the demand for smart hardware is indeed quite high. We are collaborating with chip manufacturers to create different shapes, but they all incorporate conversational AI. After using DeepSeek, children become full of questions, wanting quick interactions. It's not about whether the answers are accurate; they just want to have fun.
Risk Warning and Disclaimer
The market has risks, and investment requires caution. This article does not constitute personal investment advice and does not take into account the specific investment goals, financial situation, or needs of individual users. Users should consider whether any opinions, views, or conclusions in this article align with their specific circumstances. Investing based on this is at your own risk