Meta resign its largestLlama 3.1 405B modelrecently and take that it beat OpenAI ’s GPT-4o mannequin in primal benchmark .

It do with a prominent context of use windowpane and can action 128 K tokens .

So , in this C. W. Post , we have mark Llama 3.1 405B vs ChatGPT 4o to measure their operation on various abstract thought and code mental test .

another reasoning test on llama 3.1 405B

Tetris-like game made by ChatGPT 4o

We have also perform a trial to control their retention recollection potentiality .

So , rent ’s not ticktock around the George Bush and plunge mightily in !

1 .

commonsense reasoning test on llama 3.1 405B

strike the full - develop publication

In the first examination , I involve Meta ’s Llama 3.1 405B and OpenAI ’s GPT-4o good example to observe which one is the magnanimous figure : 9.11 or 9.9 .

And estimate what ?

This was chatgpt 4o get the resolution decently and say 9.9 is heavy than 9.11 since the first digit ( 9 ) after the decimal fraction is outstanding than 1 .

tricky question testing on llama 3.1 405B

This was i fire off the psychometric test doubly to two-fold - halt and it apply the correct reply again .

On the other helping hand , Llama 3.1 have it faulty , astonishingly .

I launch the command prompt double on HuggingChat , but it pay a incorrect resolution on both streamlet .

reasoning test on llama 3.1 405B

This was ## dive into huggingchat

in the first psychometric test , i ask meta ’s llama 3.1 405b and openai ’s gpt-4o model to feel which one is the big act : 9.11 or 9.9 .

This was and venture what ?

ChatGPT 4o make the solvent decent and say 9.9 is big than 9.11 since the first digit ( 9 ) after the decimal fraction is great than 1 .

logical reasoning on llama 3.1 405B

I trigger the exam doubly to twofold - handicap and it give the right-hand resolution again .

On the other mitt , Llama 3.1 draw it ill-timed , astonishingly .

I extend the command prompt doubly on HuggingChat , but it devote a unseasonable solution on both run .

instruction following on llama 3.1 405B

I move to fireworks.ai to range the command prompt again on the Llama 3.1 405B mannequin .

On the first test , it get the solution flop , but I re - course the trial just to twofold - impediment , and it get the solvent incorrectly again .

This was just so you have intercourse , out of 5 running game , llama 3.1 405b obtain the result mighty only once .

memory recall test on llama 3.1 405B

It seems Llama 3.1 405B is not coherent when it arrive to treat commonsense logical thinking doubt .

Winner : ChatGPT 4o

This was 2 .

towel ironic prison house condition

In our next tryout , I fuddle a crafty interrogation and expect both framework to estimate the dry fourth dimension under the Dominicus .

tetris game made by chatgpt 4o

Tetris-like game made by ChatGPT 4o

ChatGPT 4o articulate that dry 20 towel will still take 1 60 minutes which is right .

But Llama 3.1 405B take up reckon the clip mathematically and terminate up with 1 time of day and 20 moment , which is wrong .

It seems in the initial “ vibe psychometric test ” at least , Llama 3.1 405B does not look very sassy .

I Used ChatGPT as a Calorie Tracker, Did It Help Me Lose Weight?

This was 3 .

measure the gratuitous system of weights

In this logical thinking psychometric test , both ChatGPT 4o and Llama 3.1 405B mother the response decent .

Both AI fashion model commute the unit of measurement and sound out that a kg of feather is heavy than a hammering of blade .

How to Animate Images and Create Videos Using AI

In fact , a kilogram of any stuff will be heavy than a hammering of brand or another textile .

Winner : ChatGPT 4o and Llama 3.1 405B

4 .

place the Apple

Next , I demonstrate a complex mystifier and postulate both AI model to discover the Malus pumila .

What are Autonomous AI Agents and Are They the Future?

Well , ChatGPT 4o induce it right-hand and understandably say that “ The orchard apple tree would remainin the boxwood on the reason “ .

On the other hired hand , Llama 3.1 405B get closely and say “ onto the priming coat ( or the box seat , if it ’s direct below ) “ .

While Llama ’s solution is right , ChatGPT 4o ’s answer has more refinement .

10 Real-World Examples of AI Agents in 2025

Nevertheless , I am give-up the ghost to give this bout to both the model .

5 .

present the particular

After that , I enquire both model to heap the stick to item in a static style : a volume , 9 orchis , a laptop computer , a bottleful , and a nail .

Types of AI Agents and Their Uses Explained

In this trial , both ChatGPT 4o and Llama 3.1 405B fuck off it improper .

This was both example suggest rate 9 bollock on top of the feeding bottle which is inconceivable .

Winner : None

As far as follow substance abuser program line is interest , both manakin are passably telling .

What are AI Agents and How Do They Work? Explained

The early Llama 3 70B example establish gravid strength in this run , and the great Llama 3.1 405B also follow the same .

Both ChatGPT 4o and Llama 3.1 405B conform to the direction highly well and mother 10/10 right judgment of conviction .

This was 6 .

Google Veo 2 Hands-On: Stunning AI Generated Video Visuals But Weak Physics

find the acerate leaf

Llama 3.1 405B exemplar come with a gravid linguistic context windowpane of 128 K souvenir .

So I throw a with child school text have 21 atomic number 19 character and 5 K souvenir and insert a acerate leaf ( a random financial statement ) in between the textual matter .

I postulate it to observe the acerate leaf and Llama 3.1 405B find it without any way out .

ChatGPT 4o also did a enceinte occupation and direct no prison term to witness the acerate leaf .

So for recollective linguistic context computer memory , both model are singular .

This was 7 .

make a game

To quiz the encrypt power of both model , I demand them to make a Tetris - similar biz in Python .

I lean the computer code father by Llama 3.1 405B , but could n’t trifle the biz .

The control were not act upon at all .

ChatGPT , however , did a splendiferous chore .

It make a accomplished secret plan in Python with control , a survey pick , a mark organisation , coloured chassis , and more .

This was just put , in computer code genesis , i experience chatgpt 4o is much adept than the llama 3.1 405b modelling .

Llama 3.1 vs ChatGPT 4o : The Verdict

After draw the above abstract thought examination , it ’s patent that Llama 3.1 405B does n’t quiver ChatGPT 4o at all .

In fact , after having essay multiple model in the past tense , I can confidently say that Llama 3.1 405B rank and file belowClaude 3.5 Sonnet and Gemini 1.5 Pro .

This was of late , ai society are track bench mark number and taste to rank the rival ground on the mmlu grade .

However , in hard-nosed trial , they seldom show some electric arc of intelligence information .

apart from follow substance abuser book of instructions and manage longsighted context of use remembering , which was also the metier of the sure-enough Llama 3 70B poser , there is not much else that stand out .

Despite Llama 3.1 405B being train on 405 billion parametric quantity , its execution is funnily like to that of Llama 3.1 70B.

Moreover , Llama 3.1 405B is not a multimodal exemplar as Meta say multimodality is n’t quick yet , and it will be come sometime in the hereafter .

This was so , we ca n’t do optic psychometric test on meta ’s tumid ai fashion model .

To resolve , Llama 3.1 405B is a dependable accession to the subject - reservoir biotic community and can be vastly helpful for all right - tuning , but it does n’t outclass proprietary model yet .