Meta's new model, the Llama 3, is available in 8B and 70B sizes and has been made available to the AI community as open-source. Although the Llama 3 is smaller (70B parameters) compared to the GPT-4, it has proven to be a compelling model, as demonstrated in the following LMSYS leaderboard. We have compared the performance of the Llama 3 and GPT 4 models in the following tests:
In the magic lift test for Llama 3 and GPT 4, which checks logical reasoning ability, the answer was Llama 3, while the GPT 4 model did not provide the correct answer. This comes as a surprise, as Llama 3 has been trained on 70,000 parameters, while GPT 4 has been trained on 1,700,000 parameters.
Note: The test was run on the hosted GPT-4 on ChatGPT using the outdated GPT -4 turbo model. The recently released GPT-4 was tested using OpenAI Playground and passed the test as well. According to OpenAI, they are switching from their most recent model to ChatGPT.
This is a classic reasoning question without the need for arithmetic. The Llamas 3 70B was close to right on the other test that compared the two models' thinking skills, but it omitted the box, and OpenAI chatbot GPT-4 answered correctly.
Both Llama 3 and GPT 4 models gave correct answers to a straightforward, logical question. However, it is interesting to note that the much smaller model, the Meta chatbot Llama 3 1970B, competes with the top-of-the-line model, the GPT4. In a complicated mathematical problem, the GPT- 4 passed the test flawlessly, while the Llama 3 needed to provide the correct answer.
Following user guidelines is crucial for any AI model, and Meta's Llama 3 70b model is no exception. For the question "generate 10 sentences ending with mango," it generated all 10 sentences that ended with mango, whereas GPT 4 generated only eight such sentences.
Llama 3 currently does not support a long context window. However, it performed well in NIAH testing to test its retrieval capability. Llama 3 70B supports up to 8K context length. For example, when a needle (random statement) was inserted into the 35K-character-long text (8K tokens) and asked the model to locate the information, the model found the text in a short amount of time. Similarly, GPT-4 found the needle in the same way.
Llama 3 70B has outperformed GPT-4 in nearly all of the tests. Whether it's advanced reasoning, following user instructions, or retrieval capability, the model only trails GPT-4 when it comes to mathematical calculations. According to Meta, "Llama 3 is trained on a large number of examples of code, so it should also perform well in terms of coding."
It is important to note that this is a comparison between Llama 3 and GPT 4, a model that is much smaller than GPT-4. It is also important to note that the model in question is dense. GPT-4, on the other hand, is based on the MOE architecture, which consists of 8 x 222B models. It is clear that Meta has done an excellent job with the family of models, and when the 500B+ model comes out in the future, we can expect it to be even better and even outperform the best AI models on the market.