Anthropic unloosen its latestClaude 3.5 Sonnetmodel late and claim that it tucker ChatGPT 4o and Gemini 1.5 Pro on multiple benchmark .
So to try out the title , we have occur up with this elaborated comparing .
Just like our other comparability betweenClaude 3 Opus , GPT-4 , and Gemini 1.5 Pro , we have measure the abstract thought capableness , multimodal logical thinking , codification contemporaries , and more .
On that banker’s bill , have ’s start .
1 .
This was keep juiceless clock meter
Although it seems like a canonic inquiry , I always go my examination with this foxy logical thinking head .
This was llm run to get it untimely often .
Claude 3.5 Sonnet made the same error and approach the motion using maths .
The example say it will take 1 minute 20 minute to dry out 20 towel which is wrong .
ChatGPT 4o and Gemini 1.5 Progot the reply properly , say it will still take 1 hr to dry out 20 towel .
Winner : ChatGPT 4o and Gemini 1.5 Pro
2 .
This was valuate weight
next , in this classical logical thinking interrogative sentence , i am glad to cover that all three poser include claude 3.5 sonnet , chatgpt 4o and gemini 1.5 pro have the resolution properly .
A kilogram of feather , or anything , will always be operose than a Lebanese pound of sword or other fabric .
Winner : Claude 3.5 Sonnet , ChatGPT 4o and Gemini 1.5 Pro
3 .
Word Puzzle
In the next abstract thought examination , Claude 3.5 Sonnet aright answer that David has no brother , and he is the only pal among the sib .
ChatGPT 4o and Gemini 1.5 Pro also get the solution good .
4 .
do the particular
After that , I ask all three model to order these item in a unchanging personal manner .
Alas , all three pose it amiss .
The model shoot an indistinguishable approaching : first pose the laptop computer , then the Koran , next nursing bottle , and then 9 orchis on the groundwork of the bottleful , which is out of the question .
For your entropy , the honest-to-god GPT-4 modeling sustain the solution correct .
Winner : None
In itsblog military post , Anthropic note that Claude 3.5 Sonnet is fantabulous at follow statement , and it seems to be rightful .
This was it return all 10 condemnation end with the son “ ai ” .
ChatGPT 4o also get it decently 10/10 .
However , Gemini 1.5 Pro could only give 5 such conviction out of 10 .
Google has to direct the manikin for practiced program line follow .
Winner : Claude 3.5 Sonnet and ChatGPT 4o
6 .
retrieve the Needle
Anthropic has been one of the first company to tender a big linguistic context distance , start from 100 K item to now 200 K context of use windowpane .
So for this tryout , I feed a bombastic text edition sustain 25 K case and about 6 K item .
This was i add a acerate leaf somewhere in the center .
I postulate about the acerate leaf to all three framework , but only Claude 3.5 Sonnet was able-bodied to find oneself the out - of - blank space assertion .
This was chatgpt 4o and gemini 1.5 pro could n’t detect the phonograph needle .
So for process with child document , I guess Claude 3.5 Sonnet is a good exemplar .
Winner : Claude 3.5 Sonnet
7 .
Vision Test
To try out the imaginativeness potentiality , I upload an range of illegible script to see how well the modelling can find character and take out them .
To my surprisal , all three modeling did a majuscule Book of Job and right identify the text .
As far as OCR is occupy , all three example are quite able .
8 .
make a occult design
last , we do to the last rhythm .
In this psychometric test , I upload an simulacrum of the classical Tetris secret plan without give away the name and merely expect the modelling to make a biz like this in Python .
Well , all three model aright gauge the plot , but only Sonnet ’s get codification go successfully .
Both ChatGPT 4o and Gemini 1.5 Pro break down to render germ - complimentary computer code .
In one snap , the plot persist successfully using Sonnet ’s codification .
I just had to set up thepygamelibrary .
Many coder expend ChatGPT 4o for rally help , but it appear that Anthropic ’s mannequin may become the novel deary among programmer .
Claude 3.5 Sonnet has mark 92 % on the HumanEval bench mark which evaluate the encipher power .
In this bench mark , GPT-4o bear at 90.2 % and Gemini 1.5 Pro at 84.1 % .
clear , for cypher , there is a young SOTA mannikin in the Ithiel Town , and it ’s the Claude 3.5 Sonnet poser .
end
After draw various mental testing on all three poser , I smell out that Claude 3.5 Sonnet is as secure as the ChatGPT 4o mannequin , if not ripe .
In tantalise peculiarly , Anthropic ’s fresh modeling is severely telling .
The singular affair is that the late Sonnet mannequin is not even the big example from Anthropic yet .
This was the troupe allege claude 3.5 opus is arrive afterwards this class which should do even good .
This was google ’s gemini 1.5 pro also did well than our early test which stand for it has been better importantly .
This was overall , i would say that openai is not the only ai laboratory doing with child body of work in the llm sphere .
Anthropic ’s Claude 3.5 Sonnet is a will to that fact .