Artificial intelligence is bringing us into a new era, and this is true in many dimensions.
The first is that a concept called “Artificial Intelligence Generated Content” (AIGC) has begun to be recognized. It is different from the previous User Generated Content (UGG) and the earlier Professionally Generated Content (PGC). This process not only shows the switch of the content production subject, but also means that the subject with production capacity and publishing power is shifting from an individual who symbolizes “content democratization” to a “super individual” who makes good use of AI-assisted creation—even can It is AI itself that works independently; at the same time, it also means that the territory where AI can perform magic is crossing a dividing point: from the “discriminative domain” to the “generative domain”.
In the past, AI was considered only good at discriminative work. For example, judging whether the face in a picture is a specific person, whether an email from an unknown address is spam, whether an article shared on a social network has negative emotions, or whether a self-driving car Whether the shaking in front of the car is a real person who needs to be avoided or a tree shadow that needs no attention.
The two Vincent graph products that will be launched in 2022 will change people’s prejudice against AI capabilities. One is DALL·E 2, released by OpenAI, a Silicon Valley start-up company known for launching ChatGPT. The other is Stable Diffusion, from London-based startup Stability AI. The image generation level of the two products made it possible for the industry to see commercial use for the first time. Previously, the best image generation tool in the industry was the Generative Adversarial Network (GAN), which could only generate specific images—such as human faces, but it would not work if replaced with a puppy, and had to be retrained—DALL·E 2 and StableDiffusion did not have this kind of limitation.
The last AI technology that made the industry see the possibility of commercial use and achieved great success was image recognition. In 2015, the recognition accuracy of computer vision algorithms based on deep learning in the ImageNet database surpassed that of humans for the first time. Since then, the face recognition system has quickly replaced digital passwords and has become the latest identity mark; the self-service settlement system that can identify commodities has also quickly entered various offline stores; even autonomous driving that pursues safety first uses AI visual judgment .
Evolution of AI technology
The business prospects of Stable Diffusion and DALL·E2 are beyond doubt, but their opening of a new era of AI is at best just an announcer, and ChatGPT is the protagonist, because only it solves the language problem—at least it seems so.
The solution of language problems means a new interactive revolution, which is another meaning of the new era of AI.
Science fiction writer Ted Chiang (Ted Chiang) compared the AI model (GPT) behind it to the “lossy compression” of the Internet after experiencing ChatGPT, which means that when it learns the statistical laws of all online words, it is quite The result is a compressed version of the information on the Internet—some information is lost, but not as much, and importantly, the files we need to save are smaller. If aliens strike and the Internet is destroyed, as long as GPT is still there, in theory we can get all the things that were originally stored on the Internet by asking it.
In fact, instead of fantasizing about an alien invasion, Ted Chiang imagines that day may not be too long away. When people can communicate with machines in natural language, the machine can not only understand these natural languages, but also talk to people and act according to people’s words—answer people’s questions, draw a picture or create a video, and generate a game , revise again according to the feedback until the person who put forward the demand is satisfied—at this time, it is worth reconsidering whether everyone needs to install so many applications on their computers and mobile phones. Perhaps, only one ChatGPT is enough.
At this moment, I believe that you have enough awareness of what ChatGPT or AIGC in a larger scope means, and you may have heard a lot of compliments from the industry. Gates believes that the importance of the AI revolution is no less than the birth of the Internet, and Microsoft CEO Nadella said that this technology diffusion is comparable to the industrial revolution.
We intend to stop here and not state too much about the industrial revolution that generative AI, including ChatGPT, may set off—the next few articles will continue to discuss it from different perspectives. Here, we take a step back and walk behind AIGC, especially ChatGPT, to see what kind of cornerstone these latest AI stars are standing on.
The Power of Transformers
After ChatGPT was released, members of the OpenAI team were interviewed, saying that the public’s enthusiasm surprised them because “most of the technology behind ChatGPT is not new.” This statement is true, and similar conclusions from the outside world are: ChatGPT is a new era of “alchemy”, which puts together a language statistical model and reinforcement learning based on human feedback, and then uses available corpus, Estimate the number of feasible artificial neural network layers and put them together for “alchemy”.
But compared to the AI models before 2018, there is at least one new thing about the GPT behind ChatGPT, which is the perspective of language problems.
The next word that a person will say is often the next most likely word in statistics—this concept has long existed in the linguistics field, but it is the first time to develop this idea into a dialogue language model. Prior to this, almost all robots that claimed to use natural language to talk to people, from Baidu Xiaodu to Microsoft Xiaobing, from Amazon Alexa to Apple Siri, and even Sophia who obtained Japanese citizenship, were essentially based on search tree queries system. The field of natural language processing (NLP) is also divided into dozens of tasks such as text classification, machine translation, reading comprehension, and article classification, each of which corresponds to one or several algorithm models.
Behind these seemingly different problems is actually the same problem. For example, if a conversational bot is “smart enough” to predict the next word in a movie review, it must be able to do a simple positive or negative classification—becoming a movie classifier.
The password for customs clearance is the Transformer (converter) written by the Google Brain team in the paper in 2017. The previous models of GPT are based on this algorithm architecture. As it works, it computes the dependencies of each word on other previously input and generated words (often referred to as a “self-attention mechanism”). In the latest release, GPT-4, the model can notice as many as 24,576 words.
Transformer believes that the internal data of a language are interdependent over a long period of time. What Transformer does is to convert the “internal dependencies” of existing texts into future texts, that is, “generate”.
The fundamental elements within information are interdependent and predictive—a perspective on language that was later applied to images. In 2021, the Google Brain team once again launched a model called “Vision Transformer” (ViT), which recognizes images by calculating the dependencies between pixels in the same image.
Before that, language and vision were seen as different things. Language is linear and sequential, while vision is spatially structured and parallel data. But Transformer proves that pictures can also be solved as a sequence problem. A picture is a sentence composed of pixels.
Not only pictures, most problems can be transformed into sequence problems. Do not underestimate this shift in thinking. In 2018, AlphaFold released by DeepMind has the ability to predict protein structure, relying on the learning of amino acid sequences, and the architecture behind it is Transformer.
the value of language
Language is the holy grail in human intelligence, and it is equally true in artificial intelligence. No matter how hot the word AIGC is now, before ChatGPT solves the language problem, people’s attitude towards AIGC is no different from the previous treatment of the metaverse: enthusiastic, but skeptical. At least in the AIGC wave before the end of 2022, no one has mentioned the term Artificial General Intelligence (AGI).
Whether it is “emergence” or “qualitative change”, ChatGPT proves that machines can get more from language than we expected. First, it allows us to see that the reasoning part can be imitated by “seeing enough”. It is of course an illusion to claim that ChatGPT has the ability to understand. We understand that it is only inferred based on statistical associations. But “really thinking” and “acting like thinking” are sometimes just philosophical differences.
Secondly, the “Chain-of-Thought prompting (CoT)” technology based on Let’s think step by step shows that as long as the language is used more logically, the machine can learn more correct things, not just playing with words game. An example that Amazon has used in its CoT-related papers is to show AI a picture of biscuits and chips, and then ask it what the two have in common. The topic gives two options, A. Both Soft; B. Both are salty. During training, the engineer does not train the AI to directly make a simple association such as choosing A or choosing B, but trains it to generate a piece of logically sufficient text: for biscuits and French fries, AI must be trained to say their respective characteristics, For example, French fries are salty, and some biscuits are also salty; French fries will deform when pinched, so French fries are soft, and biscuits will not deform when pinched, so biscuits are not soft; so the commonality between French fries and biscuits The point is that they are all salty, and the answer is B.
Multimodal Generative Models
Data source: Organized according to public information
You should have seen the similar logic of dismantling questions step by step in ChatGPT’s answers many times. They are all based on the dismantling of enough questions by the Prompt Engineer. The education of logical jumps often makes students lose the point, and the logical and meticulous problem-solving ideas allow children to draw inferences from one instance. The same is true for AI that learns in human language.
Linguistic ability is intelligence in itself, but it carries much more intelligence—from reasoning to mathematics—something linguists have previously underestimated. If you regard each AI model as a big family, most of the previous AIs can only learn from data with limited information and intelligence, such as product pictures, faces, traffic lights, even if they do text recognition or translation, they only As a picture or a pair of signals, only a Transformer-based language model such as GPT learns directly from the internal structure of the language for the first time. As long as there are things in the language, geometry, color, taste, speed, emotion… With time and correct education (such as a better prompt), a model like GPT can learn, unless there is no language.
Distance from AGI
ChatGP T and Transformer let people see the hope of general artificial intelligence from two levels of user experience and algorithm respectively. Especially after the launch of the multi-modal GPT-4, AI seems to have become a real all-round assistant—at least on the Internet: understand people’s natural language, can help people make meeting summaries, make PPT, analyze the stock market, think about advertising copywriting, When creating a novel, you can constantly modify the picture according to your opinions, and even generate a web page code similar to the sketch with one click. It seems that it will not be long before AI and AI will begin to communicate in human language.
Note, however, that none of this means that AGI has arrived. All Transformer-based large language models (Large Language Model) are still essentially word games. They cannot solve the problem of factual errors within generative techniques, nor can they grasp all the logical reasoning capabilities of language alone, such as “what if something didn’t happen” counterfactual reasoning.
When it comes to the world outside the text, it is also a question of how well Transformer-based multimodal models (such as GPT-4) can perform information conversion between various modalities. In the past, the information conversion between pictures and text, sound and picture, touch, smell, etc., was a black box in AI. In the e-commerce scene, the degree of adaptation of the text description that can be obtained for a picture depends on the quality of the paired graphic and text corpus used for training. Once the picture is novel, it is open to question whether a suitable text description can be obtained. Although the pairwise learning between modalities has been broken down to the pixel level, the problem that Transformer cannot convert text and images well enough still exists.
Protein Prediction (Sequence Generation) Models
Data source: Organized according to public information
If you want ChatGPT to make a poster, you draw up the title and part of the text to be written on the poster and determine the style of the poster. The thing generated by ChatGPT may look like a poster, but every word on it is not a real word. It’s a monster with disordered strokes. Quite simply, when ChatGPT started making posters, it entered the painting mode of understanding pixel relationships and moving pixels, rather than the language mode of outputting text. A good idea is that maybe after more training, ChatGP T can learn to write Chinese characters. After all, Chinese calligraphy and painting have the same origin, and English is phonetic, so it may not be so easy for ChatGP T to learn.
In any case, an AI that turns text into monsters is hardly AGI. So Transformer is not necessarily the future of AI.
Like Transformer, the algorithm that is being sought after is the diffusion model (Diffusion Model). At present, there are several star products in the field of Vincent graphs, from Stability AI’s Stable Diffusion to Google’s Imagen and Parti, as well as Midjourney, the product with the same name as the company. The algorithm behind it Neither is a Transformer, but a diffusion model.
In early March, by loading the diffusion model on the functional magnetic resonance imaging (f MRI) data, two scholars from Japan reconstructed the visual image contained in the f MRI data, initially indicating that the diffusion model—rather than Transformer—is biological rationality.
”Humans will not be like the current AI system. There is a generative system on the right, and another set (discriminant system) on the left. Humans only have one closed-loop system. That is to construct a set of ‘world models’ internally. ’, and then make predictions for all the problems,” Ma Yi, dean of the Hong Kong University Tongxin Foundation Data Science Research Institute, said in an online forum in March. As early as 1950, when Turing first proposed to use random questions to judge whether a machine can answer questions like a human, “whether it is like a human” is the standard for measuring the intelligence of AI, and this standard will never be outdated.