Llama 3 vs GPT-4: Meta Challenges OpenAI on AI Turf

Meta recentlyintroduced its Llama 3 modelin two sizes with 8B and 70B parameters and open-sourced the models for the AI community. While being a smaller 70B model, Llama 3 has shown impressive capability, as evident from theLMSYS leaderboard. So we have compared Llama 3 with the flagship GPT-4 model to evaluate their performance in various tests. On that note, let’s go through our comparison between Llama 3 and GPT-4.

1. Magic Elevator Test

1. Magic Elevator Test

Let’s first run themagic elevator testto evaluate the logical reasoning capability of Llama 3 in comparison to GPT-4. And guess what?Llama 3 surprisingly passes the testwhereas the GPT-4 model fails to provide the correct answer. This is pretty surprising since Llama 3 is only trained on 70 billion parameters whereas GPT-4 is trained on a massive 1.7 trillion parameters.

Keep in mind, we ran the test on the GPT-4 model hosted on ChatGPT (available to paid ChatGPT Plus users). It seems to be using the older GPT-4 Turbo model. We ran the same test on the recently-released GPT-4 model (gpt-4-turbo-2024-04-09) via OpenAI Playground, and it passed the test. OpenAI says that they are rolling out the latest model to ChatGPT, but perhaps it’s not available on our account yet.

Winner: Llama 3 70B, and gpt-4-turbo-2024-04-09

Note: GPT-4 loses on ChatGPT Plus

2. Calculate Drying Time

2. Calculate Drying Time

Next, we ran the classicreasoning questionto test the intelligence of both models. In this test, both Llama 3 70B and GPT-4 gave the correct answer without delving into mathematics. Good job Meta!

Winner: Llama 3 70B, and GPT-4 via ChatGPT Plus

3. Find the Apple

After that, I asked another question to compare the reasoning capability of Llama 3 and GPT-4. In this test, the Llama 3 70B model comes close to giving the right answer butmisses outon mentioning the box. Whereas, the GPT-4 model rightly answers that “the apples are still on the ground inside the box”. I am going to give it to GPT-4 in this round.

Winner: GPT-4 via ChatGPT Plus

4. Which is Heavier?

While the question seems quite simple, many AI models fail to get the right answer. However, in this test, both Llama 3 70B and GPT-4 gave thecorrect answer. That said, Llama 3 sometimes generates wrong output so keep that in mind.

Winner: Llama 3 70B, and GPT-4 via ChatGPT Plus

5. Find the Position

Next, I asked a simple logical question andboth models gave a correct response. It’s interesting to see a much smaller Llama 3 70B model rivaling the top-tier GPT-4 model.

Winner: Llama 3 70B, and GPT-4 via ChatGPT Plus

6. Solve a Math Problem

Next, we ran a complexmath problemon both Llama 3 and GPT-4 to find which model wins this test. Here, GPT-4 passes the test with flying colors, butLlama 3 failsto come up with the right answer. It’s not surprising though. The GPT-4 model has scored great on the MATH benchmark. Keep in mind that I explicitly asked ChatGPT to not use Code Interpreter for mathematical calculations.

Winner: GPT-4 via ChatGPT Plus

7. Follow User Instructions

Following user instructions is very important for an AI model and Meta’sLlama 3 70B model excelsat it. It generated all 10 sentences ending with the word “mango”. GPT-4 could only generate eight such sentences.

Winner: Llama 3 70B

8. NIAH Test

Although Llama 3 currently doesn’t have a long context window, we still did the NIAH test to check its retrieval capability. The Llama 3 70B model supports acontext length of up to 8K tokens. So I placed a needle (a random statement) inside a 35K-character long text (8K tokens) and asked the model to find the information. Surprisingly, the Llama 3 70B found the text in no time. GPT-4 also had no problem finding the needle.

Of course, this is asmall context, but when Meta releases a Llama 3 model with a much larger context window, I will test it again. But for now, Llama 3 shows great retrieval capability.

Winner: Llama 3 70B, and GPT-4 via ChatGPT Plus

Llama 3 vs GPT-4: The Verdict

In almost all of the tests, the Llama 3 70B model has shown impressive capabilities, be it advanced reasoning, following user instructions, or retrieval capability. Only in mathematical calculations, it lags behind the GPT-4 model. Meta says that Llama 3 has been trained on a larger coding dataset so itscoding performanceshould also be great.

Bear in mind that we are comparing amuch smaller modelwith the GPT-4 model. Also, Llama 3 is a dense model whereas GPT-4 is built on the MoE architecture consisting of 8x 222B models. It goes on to show that Meta has done a remarkable job with the Llama 3 family of models. When the 500B+ Llama 3 model drops in the future, it will perform even better and may beat the best AI models out there.

It’s safe to say that Llama 3 has upped the game, and by open-sourcing the model, Meta hasclosed the gap significantlybetween proprietary and open-source models. We did all these tests on an Instruct model. Fine-tuned models on Llama 3 70B would deliver exceptional performance. Apart from OpenAI, Anthropic, and Google, Meta has now officially joined the AI race.

Arjun Sha

Passionate about Windows, ChromeOS, Android, security and privacy issues. Have a penchant to solve everyday computing problems.

Add new comment

Name

Email ID

Δ

01

02

03

04

05