Fine-tuned Meta Code Llama outperforms GPT-4 in key benchmark

Midjourney prompted by THE DECODER

Update

Open-source development continues at its usual pace, with the WizardLM developers releasing the fine-tuned code llama WizardCoder-34B, which scores 73.2 percent in the first pass of HumanEval.
WizardLM also notes that the GPT-4 version tested via API at the end of August achieved a score of 82 in HumanEval and GPT-3.5 of 72.5.
WizardCoder-34B is available on Github.

Shortly after the release of Meta's Code Llama code model, the open-source community tries to fine-tune it - and immediately achieves a new top score, surpassing OpenAI's GPT-4.

Phind, an AI co-programming startup, has announced that it has achieved a new high score on the HumanEval benchmark, an important evaluation test for AI programming tasks, with a fine-tuned 34B variant of Meta's just-released Code Llama.

In the first run, the fine-tuned standard and Python models scored 67.6 and 69.5 percent, respectively. OpenAI's GPT-4 scored 67 percent on the same benchmark when it was released in March. The standard Code Lama model with 34 billion parameters scored 48.8 percent, according to Meta, while the Python variant scored 53.7 percent.

Model	HumanEval Result
Phind 34B standard model	67.6%
Phind 34B Python model	69.5%
GPT-4 (OpenAI, March 2023)	67%
Meta Code-Llama 34B	48.8%
Meta Code-Llama 34B Python	53.7%
Meta Unnatural Code Llama (not released)	62%

The two Phind models were fine-tuned natively on a custom dataset of about 80,000 high-quality programming tasks and solutions. According to Phind, Meta already fine-tuned Code Llama with a 62 percent success rate on HumanEval. However, Meta only used 15,000 examples to refine Unnatural Code Llama.

The Phind models were trained using 32 A100-80 GB GPUs and a sequence length of 4096 tokens in three hours. The researchers used DeepSpeed ZeRO 3 and Flash Attention 2 for faster and more efficient training.

Phind publishes both models under the Llama license on Huggingface.

Open-Source Community accelerates Meta's AI development

The Llama license allows both scientific and commercial use, but the latter is restricted as a special license is required for use in widespread applications. In addition, data generated with Llama 2 may not be used to train new AI models.

Meta's Llama 2 language model also now has numerous refinements that outperform Meta's original release in benchmarks. This is likely Meta's goal: to improve their models faster thanks to the open-source community.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Fine-tuned Meta Code Llama outperforms GPT-4 in key benchmark

Open-Source Community accelerates Meta's AI development

Anthropic says that AI can learn risky behaviors even when the training data looks completely safe

AI Math Olympiad wins revive the debate over symbols, reasoning, and the nature of intelligence

OpenAI’s new agent moves its 2017 vision for AI closer to reality

Google DeepMind's Gemini wins Mathematical Olympiad gold using only natural language

OpenAI launches new ChatGPT agent that automates complex tasks for Pro, Plus, and Team

Kimi-K2 is the next open-weight AI milestone from China after Deepseek

Fine-tuned Meta Code Llama outperforms GPT-4 in key benchmark

Open-Source Community accelerates Meta's AI development

Anthropic says that AI can learn risky behaviors even when the training data looks completely safe

AI Math Olympiad wins revive the debate over symbols, reasoning, and the nature of intelligence

OpenAI’s new agent moves its 2017 vision for AI closer to reality