Meet the 26-Year-Old Chinese Genius Making Waves in the AI Field

  • Meet the 26-year-old Chinese genius who is making waves in the AI field. This article explores the story of a young expert who has achieved remarkable success in a rapidly growing industry. With keywords such as “future of AI”, “achievements”, and “industry”, this article provides insights into the work of a rising star in the field. Read on to learn about his background, his vision for the future of AI, and his contributions to the industry.

At the end of April 2023, an Asian face appeared on the cover of the famous magazine “Forbes”. This is a 26-year-old Chinese entrepreneur named Alexandr Wang.
He holds the technology company with the highest valuation of $7.3 billion.
In 2017, Alexandr founded a company called Scale AI, which is on the hottest AI track. Seven years later, what he made is irreplaceable. According to “Forbes” report, Scale AI now takes over the services of many leading self-driving car companies, and Google’s Waymo and Toyota Motor are its fans. Since 2020, it has also won multiple sky-high orders from the US Department of Defense.
In 2022, the U.S. Department of Defense is already using the company’s technology to analyze satellite images of Ukraine.
The path that Scale AI is taking is a direction that is often overlooked by leading companies and AI entrepreneurs, called AI labeling data sets.
This is the oil in the field of AI. Only data can continuously provide fuel for deep learning. According to a data, as of 2021, among the top 10 million websites in the world, English content accounts for 60.4%, and Chinese content accounts for only 1.4%. Chinese AI needs to rely on a large number of English data sets for training.
In fact, there are many companies in China that do AI datasets and data annotation. The listed company Haitian AAC, the leading start-up companies Yunce Data, Datatang, etc., are the best in the industry.
Compared with the intuition of “tall and superior” given by the artificial intelligence industry, data work faces cumbersome cleaning, labeling, processing and other processes. Therefore, there is a popular saying in the industry, “How powerful artificial intelligence is, how powerful intelligence is.”
Data from the AI ​​analysis company Cognilytica shows that in AI projects, data-related processing takes up more than 80% of the time.
Jia Yuhang, general manager of Cloud Measurement Data, concluded to Nanfengchuang that major Internet companies and start-up companies are more researching algorithms, and AI data service companies are doing engineering.
At the moment when various giants are chasing Open AI, it is time to pay attention to the first step to support AI deep learning – data.
Here comes the opportunity

Regardless of whether the business is linked to the large model or not, domestic AI data service companies have recently received a lot of attention.
Haitian AAC, a listed company in the data set, only spent 3 trading days at the end of March, with a cumulative increase of nearly 33%. The stock price hit a record high, more than tripled from the beginning of the year, although the company has already posted a risk warning: “The natural language business contributes about 10% to the company as a whole.” “The company has not yet cooperated with OpenAI, and its ChatGPT products and services It has not yet brought business income to the company.”
Because of ChatGPT, Jia Yuhang, general manager of Cloud Measurement Data, also received attention and inquiries from various industries on large models and data sets in 2023. “Everyone has their own views on the big model, and we learn from each other.” He told Nanfeng Chuang.
From a technical point of view, the large model represented by ChatGPT adopts a different technical path from the previous AI data labeling. In the past, the mainstream of machine learning relied on human-in-the-loop, that is, supervised learning.
Supervised learning relies on a large amount of manual data preprocessing and labeling. For example, a picture of a cat requires human beings to mark it in advance and tell it that it is just a cat in a language that the machine can understand. The generally accepted rule in the industry is that the more and more accurate labeled data uploaded by humans, the better the effect of machine learning.
The large model represented by ChatGPT uses a self-supervised learning model. Simply put, the test is the machine’s self-learning ability.
Liu Zhiyuan, an associate professor of the Natural Language Processing Laboratory of the Department of Computer Science, Tsinghua University, told Nanfengchuang: “The difference between the big model is that it does not assume in advance what tasks or specific capabilities need to be completed. It exhausts the Internet to obtain as much data as possible, so that the model automatically Learn knowledge from these data.”
OpenAI has disclosed that the model for training GPT is based on data from public websites, including various high-quality texts such as Wikipedia, professional forums, e-book websites, and media reports.
According to US media reports, ChatGPT, which has a developed level of intelligence, is backed by a group of data labelers from Kenya, Africa. They work 9 hours a day, and finally get a salary of about 2,500 to 3,000 yuan a month.

Although the demand for data labeling has decreased, the success of ChatGPT has given everyone a more useful inspiration: high-quality data sets are crucial for training AI large models. ChatGPT based on GPT-3.5 uses reinforcement learning and human feedback (RLHF), which also involves a lot of data labeling work.
According to the disclosure, the RLHF annotation of ChatGPT requires a lot of professional talents. To this end, Open AI specially recruited dozens of doctoral students to make annotations and provide feedback based on human logic for the machine’s answers and instructions. According to Forbes, Open AI also used outsourcing services, and Alexandr Wang’s Scale AI also participated in training ChatGPT.
Zheng Shuliang, the co-founder of Lingxin Intelligent, an AI startup backed by Tsinghua University, told Nanfengchuang that the generative AI represented by ChatGPT puts forward higher requirements for data quality.
“Every word and every dialogue generated by AI is based on the generation of a previous word, or the question itself, through the derivation of probability.” Zheng Shuliang said.
In this mode, once the data quality is not high, the generated effect is nonsense and unreliable AI. Zheng Shuliang said: “Therefore, on the one hand, we need to collect more and more accurate corpora, and on the other hand, we must strengthen the cleaning and labeling of these corpora.” According to
US media reports, ChatGPT, which has a developed level of intelligence, has a group of Data labeler from Kenya, Africa. They work 9 hours a day, read 150-200 paragraphs of text, mark content with sex, violence and hate speech, and finally get paid about 2,500-3,000 yuan a month.
Behind artificial intelligence is still human effort. Jia Yuhang analyzed that in the long run, the AI ​​data service industry supported by the accumulation of manpower will not change much.
“After the arrival of the large model, many people think that one of the links of AI data services in the future – the data labeling work will be reduced.” He said, “But in fact, one point is ignored. With more and more AI functions, many times it has reached the point of no entry. In the field, manual processing may be required.”
He believes that data annotation will not decrease with the birth of generative large models, “instead there may be more.”

The “Foxconn” of the AI ​​industry

ChatGPT’s exit from the circle brought domestic data set companies not a raging fire, but rain after a long drought.
The rise time of Chinese data set companies is similar to that of Scale AI, both in 2016-2017. The core goal of such companies is to help AI companies minimize the impact of poor-quality data.
Still, very few data companies make a living selling datasets. Among the leading companies in China’s AI data, only Haitian AAC, a company listed on the Science and Technology Innovation Board, explicitly mentions the data set business on its official website. According to the company’s disclosure, based on years of technical accumulation in the field of speech recognition and synthesis, it has built a deep technical barrier in the field of multilingualism. As of the first quarter of 2022, Haitian AAC has covered 190 languages ​​and has accumulated more than 10 million entries. Its customers include Alibaba, Tencent, Baidu, Microsoft and other major companies.

At the end of April 2023, Alexandr Wang appeared on the cover of the famous magazine “Forbes”

Compared with selling data sets, more companies are doing the next step of data, data annotation.
Jia Yuhang introduced to Nanfengchuang that the data set business accounts for a very small part of the cloud measurement data. The main application scenario of this business is at the stage when artificial intelligence products are just approved. “When the project is just established or previewed, some open source or industry-based data sets are needed to quickly complete the verification of the algorithm.” The
needs of more enterprises will explode in the later stage, that is, when AI products enter the formal During R&D and continuous iteration.
“At this time, the corresponding sensor or scene is clear, and data collection, cleaning, and labeling need to be completed based on a specific scene. Therefore, we provide high-quality, scene-based data labeling and other services.” Jia Yuhang said.
The “guild” failed to promote the prosperity of the data labeling industry. On the contrary, lower and lower marked prices have intensified competition within the industry.

According to statistics from the Qianzhan Industry Research Institute, Chinese data labeling companies emerged from 2014 and reached their peak in 2017. In 2017, there were 9 financing events related to data annotation.
This number is also the peak for the next few years.
Labor-intensive is the characteristic of the data labeling industry at this stage. According to 36 Krypton, a senior data labeling company revealed that each data crowdsourcing platform in the industry has an average of tens of thousands of people. Therefore, some people say that the data labeling industry is like “Foxconn behind artificial intelligence”.
In 2018, the Shanxi Transformation and Comprehensive Reform Demonstration Zone in Taiyuan reached a cooperation with Baidu to create what is known as “the single data labeling base with the largest scale of personnel and output value in the country”. According to Baidu’s disclosure, the base covers an area of ​​over 10,000 square meters, and has driven at least 200 companies engaged in data services.
The relatively low technical threshold makes data labeling companies mostly located in small and medium-sized cities. Taking Baidu as an example, the company disclosed that its data crowdsourcing platform Baidu Zhongce is not only located in Taiyuan, but also in Linfen, Shanxi, Fengjie, Chongqing, Dazhou, Sichuan, Jiuquan, Gansu, Xinyu, Jiangxi, Lishui, Zhejiang, Qingyuan, Guangdong, Chenzhou, Hunan, and Harbin, Heilongjiang. Set up points in other places.
The other side of labor-intensive means low threshold. In the 2021 edition of the “National Occupational Skills Standards for Artificial Intelligence Trainers”, the ability characteristics of this occupation are described as “having certain learning ability, expression ability, and calculation ability”, and the general education level is written as “graduated from junior high school”. According to media reports, many of the data labelers are graduates of technical secondary schools and colleges, and they also accommodate various groups such as mothers and retired soldiers.
While the threshold is low, small workshops in the data labeling industry are blooming everywhere.
Compared with Scale AI, which has entered the E round of financing and occupied the overseas market, it occupies a major share of my country’s data labeling market, but it is a small company that exists in the form of a studio.
They are called “guilds” and “teams”, and they usually take orders on crowdsourcing platforms, or take orders from third-party intermediary companies for subcontracting.
The “guild” failed to promote the prosperity of the data labeling industry. On the contrary, lower and lower marked prices have intensified competition within the industry.
Since 2017, the amount of financing for AI data companies has begun to decline. In 2018, there were only 5 related financings for AI data companies, each with an average of tens of millions. By 2021, there will only be two related financings a year.
Chu Rufeng, CEO of Yingshi Technology, once said in an interview that the competition for data labeling in my country is fierce, and the main reason for the failure of a unicorn giant like Scale AI to appear is that “there are too many small workshops doing labeling in China, and the market is not enough.” concentrated”.

Dataset companies change with the rise and fall of the AI ​​industry. Similar to Scale AI, it is the emergence of a large number of autonomous driving companies that has brought about a turning point for Chinese data labeling companies.
Wu Hequan, an academician of the Chinese Academy of Engineering, once analyzed: “Intelligent driving requires the car to automatically recognize the road. But if the video is simply transmitted to the computer, the computer cannot recognize it, and the road needs to be manually framed in the video. After the computer receives information many times, it gradually Learn to recognize the road in videos and photos.”
Intelligent driving has brought a lot of demand. Domestic leading data companies, such as Yunce Data, Datatang, Chinchilla Data, etc., have turned to provide services for car companies.
According to reports, a number of mainstream domestic OEMs, such as Geely, SAIC, and GAC, have increased their investment in autonomous driving data labeling since 2021. By 2022, the investment budget of the above-mentioned car companies has more than doubled on the basis of hundreds of thousands of yuan.
The relevant person in charge of Datatang also said in an interview in 2022: “(Car companies) still have a gap in data demand, and the market is far from saturated. Good time.”

Labeling data methods displayed on Scale AI official website, including 3D, images, etc.

The increasingly fierce competition poses real challenges to the data labeling industry. Data annotation companies generally began to transform.
An industry-recognized direction is from labor-intensive to AI-assisted labeling.
“Human-computer interaction.” Jia Yuhang concluded.
He explained that with the development of the past few years, the types and contents of data annotation have become more and more complex. “In the earliest face recognition, you only need to make a frame mark on the face to complete the corresponding training. Now, it is also required to check the key points of the face, expressions, or some attributes or postures of the face. For example, when half of the face is blocked, etc., mark it.”
Similar to Scale AI, it is the emergence of a large number of autonomous driving companies that has brought about a turning point for Chinese data labeling companies.

Changes in the market demand higher levels of data processing capabilities. International data companies, including Scale AI, Appen, etc., are focusing on data annotation platforms and tools. Hangzhou data labeling company Manfu Technology once concluded to the media: “Scale AI’s platform tools have largely weakened the decisive role of people in it, which has become the key to corporate competitiveness.” Jia Yuhang told Nanfengchuang that
in With the current emphasis on quality and efficiency, the trend of data annotation AI engineering is becoming more and more obvious.
That is to say, how to organize people, interact with machines, and efficiently run the process of AI data processing has become the direction that various companies are competing to “volume”.
In addition to AI that serves AI data, there are also talents that need to adapt to changes. Jia Yuhang told Nanfengchuang: “Now, the requirements for labeling personnel must be getting higher and higher.” What is
missing now, he said, are professionals who understand various vertical fields. For example, in order to improve timeliness and reduce error rates, medical data requires professional medical students. But often, such talents are rarely engaged in the data industry.
In 2019, the data service platform CrowdFlower also conducted a set of research.
It conducted a survey of around 80 data scientists and found that data scientists spend:
60% of their time organizing and cleaning data;
19% of their time collecting datasets;
9% of their time mining data;
5% of the time is spent on other tasks.
Most of a data scientist’s time is spent on data preparation, which is collecting, cleaning and labeling data. Among them, 57% said that cleaning and processing data is the most boring and unpleasant task.
And now, with the AI ​​boom brought about by ChatGPT, the “most boring and unpleasant” industry is taking off.

error: Content is protected !!