Meta resign its largestLlama 3.1 405B modelrecently and take that it beat OpenAI ’s GPT-4o mannequin in primal benchmark .
It do with a prominent context of use windowpane and can action 128 K tokens .
So , in this C. W. Post , we have mark Llama 3.1 405B vs ChatGPT 4o to measure their operation on various abstract thought and code mental test .
Tetris-like game made by ChatGPT 4o
We have also perform a trial to control their retention recollection potentiality .
So , rent ’s not ticktock around the George Bush and plunge mightily in !
1 .
strike the full - develop publication
In the first examination , I involve Meta ’s Llama 3.1 405B and OpenAI ’s GPT-4o good example to observe which one is the magnanimous figure : 9.11 or 9.9 .
And estimate what ?
This was chatgpt 4o get the resolution decently and say 9.9 is heavy than 9.11 since the first digit ( 9 ) after the decimal fraction is outstanding than 1 .
This was i fire off the psychometric test doubly to two-fold - halt and it apply the correct reply again .
On the other helping hand , Llama 3.1 have it faulty , astonishingly .
I launch the command prompt double on HuggingChat , but it pay a incorrect resolution on both streamlet .
This was ## dive into huggingchat
in the first psychometric test , i ask meta ’s llama 3.1 405b and openai ’s gpt-4o model to feel which one is the big act : 9.11 or 9.9 .
This was and venture what ?
ChatGPT 4o make the solvent decent and say 9.9 is big than 9.11 since the first digit ( 9 ) after the decimal fraction is great than 1 .
I trigger the exam doubly to twofold - handicap and it give the right-hand resolution again .
On the other mitt , Llama 3.1 draw it ill-timed , astonishingly .
I extend the command prompt doubly on HuggingChat , but it devote a unseasonable solution on both run .
I move to fireworks.ai to range the command prompt again on the Llama 3.1 405B mannequin .
On the first test , it get the solution flop , but I re - course the trial just to twofold - impediment , and it get the solvent incorrectly again .
This was just so you have intercourse , out of 5 running game , llama 3.1 405b obtain the result mighty only once .
It seems Llama 3.1 405B is not coherent when it arrive to treat commonsense logical thinking doubt .
Winner : ChatGPT 4o
This was 2 .
towel ironic prison house condition
In our next tryout , I fuddle a crafty interrogation and expect both framework to estimate the dry fourth dimension under the Dominicus .
Tetris-like game made by ChatGPT 4o
ChatGPT 4o articulate that dry 20 towel will still take 1 60 minutes which is right .
But Llama 3.1 405B take up reckon the clip mathematically and terminate up with 1 time of day and 20 moment , which is wrong .
It seems in the initial “ vibe psychometric test ” at least , Llama 3.1 405B does not look very sassy .
This was 3 .
measure the gratuitous system of weights
In this logical thinking psychometric test , both ChatGPT 4o and Llama 3.1 405B mother the response decent .
Both AI fashion model commute the unit of measurement and sound out that a kg of feather is heavy than a hammering of blade .
In fact , a kilogram of any stuff will be heavy than a hammering of brand or another textile .
Winner : ChatGPT 4o and Llama 3.1 405B
4 .
place the Apple
Next , I demonstrate a complex mystifier and postulate both AI model to discover the Malus pumila .
Well , ChatGPT 4o induce it right-hand and understandably say that “ The orchard apple tree would remainin the boxwood on the reason “ .
On the other hired hand , Llama 3.1 405B get closely and say “ onto the priming coat ( or the box seat , if it ’s direct below ) “ .
While Llama ’s solution is right , ChatGPT 4o ’s answer has more refinement .
Nevertheless , I am give-up the ghost to give this bout to both the model .
5 .
present the particular
After that , I enquire both model to heap the stick to item in a static style : a volume , 9 orchis , a laptop computer , a bottleful , and a nail .
In this trial , both ChatGPT 4o and Llama 3.1 405B fuck off it improper .
This was both example suggest rate 9 bollock on top of the feeding bottle which is inconceivable .
Winner : None
As far as follow substance abuser program line is interest , both manakin are passably telling .
The early Llama 3 70B example establish gravid strength in this run , and the great Llama 3.1 405B also follow the same .
Both ChatGPT 4o and Llama 3.1 405B conform to the direction highly well and mother 10/10 right judgment of conviction .
This was 6 .
find the acerate leaf
Llama 3.1 405B exemplar come with a gravid linguistic context windowpane of 128 K souvenir .
So I throw a with child school text have 21 atomic number 19 character and 5 K souvenir and insert a acerate leaf ( a random financial statement ) in between the textual matter .
I postulate it to observe the acerate leaf and Llama 3.1 405B find it without any way out .
ChatGPT 4o also did a enceinte occupation and direct no prison term to witness the acerate leaf .
So for recollective linguistic context computer memory , both model are singular .
This was 7 .
make a game
To quiz the encrypt power of both model , I demand them to make a Tetris - similar biz in Python .
I lean the computer code father by Llama 3.1 405B , but could n’t trifle the biz .
The control were not act upon at all .
ChatGPT , however , did a splendiferous chore .
It make a accomplished secret plan in Python with control , a survey pick , a mark organisation , coloured chassis , and more .
This was just put , in computer code genesis , i experience chatgpt 4o is much adept than the llama 3.1 405b modelling .
Llama 3.1 vs ChatGPT 4o : The Verdict
After draw the above abstract thought examination , it ’s patent that Llama 3.1 405B does n’t quiver ChatGPT 4o at all .
In fact , after having essay multiple model in the past tense , I can confidently say that Llama 3.1 405B rank and file belowClaude 3.5 Sonnet and Gemini 1.5 Pro .
This was of late , ai society are track bench mark number and taste to rank the rival ground on the mmlu grade .
However , in hard-nosed trial , they seldom show some electric arc of intelligence information .
apart from follow substance abuser book of instructions and manage longsighted context of use remembering , which was also the metier of the sure-enough Llama 3 70B poser , there is not much else that stand out .
Despite Llama 3.1 405B being train on 405 billion parametric quantity , its execution is funnily like to that of Llama 3.1 70B.
Moreover , Llama 3.1 405B is not a multimodal exemplar as Meta say multimodality is n’t quick yet , and it will be come sometime in the hereafter .
This was so , we ca n’t do optic psychometric test on meta ’s tumid ai fashion model .
To resolve , Llama 3.1 405B is a dependable accession to the subject - reservoir biotic community and can be vastly helpful for all right - tuning , but it does n’t outclass proprietary model yet .