Claude 3.5 officially launched, introducing the new feature "Computer Control", making it a true intelligent agent that can understand user intent and autonomously perform tasks. The performance of the new models Claude 3.5 Sonnet and Claude 3.5 Haiku has significantly improved, especially in reasoning, knowledge, and programming capabilities. This upgrade marks a significant advancement in AI's autonomous exploration and problem-solving abilities
At 11 p.m. at night, it's already 8 a.m. on the other side of the ocean.
Claude appeared with a grand entrance.
The upgraded Claude 3.5 Sonnet, the new model Claude 3.5 Haiku, and a brand new feature: computer use, which I refer to as "computer control" after translation.
Let's break it down one by one.
First, the new model upgrade Claude 3.5 Sonnet.
Claude's models have always been divided into three sizes: Opus, Sonnet, and Haiku, from large to small.
In March, Claude officially launched the Claude 3 series models, from Opus to Haiku.
Then in June, Claude released the Claude 3.5 Sonnet, the only one released, without 3.5 Opus and Haiku. Refer to this article: I experienced the just released Claude 3.5 and found that the strongest is this new feature.
At that time, the ability of Claude 3.5 Sonnet surpassed the old model with the highest parameters.
And today, what is being introduced is the upgraded Claude 3.5 Sonnet, along with the new Claude 3.5 Haiku.
Interestingly, Claude 3.5 Haiku is still post-training, with knowledge cut-off in July, while the upgraded Claude 3.5 Sonnet's knowledge time has not changed, adding more reinforced learning synthetic data and training for "computer control".
In terms of overall performance, Claude 3.5 Sonnet is basically superior to others.
Whether it's reasoning, undergraduate knowledge, or programming abilities, Claude ranks No.1. Moreover, Claude's scores are trustworthy, unlike many models that boost their scores.
I believe that after the launch of Claude 3.5 Sonnet in June, there will be a direct technological leap, bringing A programming like cursor to the sky, and no one will doubt Claude's coding abilities.
The most special evaluation benchmark is actually the SWE-bench Verified on the seventh line, which is probably a test of the real ability to write code to solve problems. This evaluation benchmark was proposed by OpenAI in August, and this wave of Claude 3.5 directly incorporated this benchmark into its scores GPT4o scored 33.2% in this task, and o1 is unknown.
But according to Claude, o1 is something dirty and unfamiliar.
The new version Claude 3.5 Sonnet is now online on the Claude official website.
You can see the new label.
I directly sent the simplest sentence: Generate a very exquisite Tetris game for me.
Then, the upgraded version Claude 3.5 Sonnet started generating with a buzzing sound.
It generated 280 lines of code at once, and this game can actually be played directly.
It can also generate an interactive motion simulator that can be adjusted at any time, completely changing the way of learning.
It's really cool.
Next is Claude 3.5 Haiku.
There's not much to say about this one, just a regular upgrade, but it's currently the fastest and most cost-effective model.
At the same cost and speed as Claude 3 Haiku, it directly defeated the largest parameter model, Claude 3 Opus.
In coding tasks, it even managed to beat the pre-upgraded Claude 3.5 Sonnet, which is really surprising.
It can only be said that Anthropic's reinforcement learning paradigm is still too advanced, and the quality of synthetic data is exceptionally high.
Finally, the most important point, Claude's "computer use," which is the new feature, computer control.
This point is very sci-fi, being able to analyze the user's computer screen activities in real-time and autonomously perform online tasks such as browsing, clicking, and inputting.
Let me share an official case.
Anthropic describes the "computer control" feature as follows: "Claude 3.5 Sonnet can move the cursor on the computer screen, click on relevant positions, and input information through a virtual keyboard according to the user's commands, simulating the way people interact with their computers." This is a true Agent that can understand user intent and help them achieve it independently.
Previous Agents, to be honest, looked more like RPA, following preset workflows step by step. But what should a real Agent look like?
In my opinion, it should be like a person, able to understand your complex semantics, and translate this complexity into actionable steps. For example, when I say now, "It's 3:30 in the morning and I'm too tired, but the article isn't finished yet. Can you check if there's a coffee shop nearby? If there is, please buy me a cup. If not, never mind."
If it were a person, they would definitely open Meituan or Ele.me, check if there's a coffee shop nearby open, see if they have my favorite iced Americano, if not, ask me what flavor I want to switch to, then place the order and wait for delivery.
If all the coffee shops are closed at 3:30, it should also tell me that there's nowhere nearby to buy, buddy, you'll have to manage on your own for a bit, you'll be able to sleep soon.
This is AI, this is the coolest AI assistant that can enter the lives of ordinary people.
And this kind of AI assistant, it inevitably needs to learn how to operate a phone or computer.
We not only need to teach AI to write articles, draw pictures, but also to teach it how to operate.
Only then can it have a strong ability to explore independently and solve problems.
And the upgraded version Claude 3.5, after training on some simple software, has the ability to operate some not-so-complex software, and even self-correct and keep trying, isn't this a form of reinforcement learning, self-play?
Anthropic, really playing Self-Play to the fullest.
Currently, in a benchmark evaluation where developers test models using computers (OSWorld), Claude currently scores 14.9%.
While human performance is usually 70-75%, although there is a big gap, there is still some way to go, but it is already far higher than the 7.7% score of the best AI models currently available.
However, this feature is not yet available to ordinary users, only open to developers with API access. Anthropic is still in the early testing phase, fearing potential risks, so developers are asked to help test it first.
We also spent a long time integrating the API and conducting some simple tests.
First, we installed something similar to a simulation system, where all actions will run in this simulation system. Anthropic is still concerned about any irreversible damage to your system I tested many cases, but to be honest, firstly, the speed is really too slow... secondly, the success rate is indeed a bit low.
For example, this case: "Open the Taobao website, find the official flagship store of Xiaomi phones, find a phone around 2000 RMB, and add it to the shopping cart."
Actually, it's not that difficult, to be honest.
But Claude messed up, and the funny thing is, the mistake was in entering the store name. The store is clearly called "Xiaomi Official Flagship Store," but he insisted on writing "方店" (fangdian), then tried again, this time he didn't even write two characters, just wrote one character "舰" (jian), it's a miracle if he can find it like that...
And, I already set the video to double speed, you can feel how slow it is...
However, when asked to play 2048, he was very happy. This time, it was at triple speed.
He played quite well, I feel like he can play it forever by himself. It's quite interesting.
Of course, he can also do some practical things, like installing a plugin in my browser to block ads.
He actually memorized the plugin address, entered it directly, searched for it, and installed it in one go.
Take off.
Although the overall success rate of the tasks is still relatively average, it's okay, after all, Claude himself also said that the success rate is not that high.
And, this is only the first generation.
They firmly believe that adapting the model to the tool is inevitable, and the model can also be integrated into the environment we use every day, becoming a part of our daily lives.
Their goal is to make Claude use existing computer software, just like a human. Just like a human.
It's great. I hope this vision can be achieved in the near future.
I really, really want to have one of my own.
Jarvis