Fine-tuned Meta Code Llama outperforms GPT-4 in key benchmark

Midjourney prompted by THE DECODER

Update

Open-source development continues at its usual pace, with the WizardLM developers releasing the fine-tuned code llama WizardCoder-34B, which scores 73.2 percent in the first pass of HumanEval.
WizardLM also notes that the GPT-4 version tested via API at the end of August achieved a score of 82 in HumanEval and GPT-3.5 of 72.5.
WizardCoder-34B is available on Github.

Shortly after the release of Meta's Code Llama code model, the open-source community tries to fine-tune it - and immediately achieves a new top score, surpassing OpenAI's GPT-4.

Phind, an AI co-programming startup, has announced that it has achieved a new high score on the HumanEval benchmark, an important evaluation test for AI programming tasks, with a fine-tuned 34B variant of Meta's just-released Code Llama.

In the first run, the fine-tuned standard and Python models scored 67.6 and 69.5 percent, respectively. OpenAI's GPT-4 scored 67 percent on the same benchmark when it was released in March. The standard Code Lama model with 34 billion parameters scored 48.8 percent, according to Meta, while the Python variant scored 53.7 percent.

Model	HumanEval Result
Phind 34B standard model	67.6%
Phind 34B Python model	69.5%
GPT-4 (OpenAI, March 2023)	67%
Meta Code-Llama 34B	48.8%
Meta Code-Llama 34B Python	53.7%
Meta Unnatural Code Llama (not released)	62%

The two Phind models were fine-tuned natively on a custom dataset of about 80,000 high-quality programming tasks and solutions. According to Phind, Meta already fine-tuned Code Llama with a 62 percent success rate on HumanEval. However, Meta only used 15,000 examples to refine Unnatural Code Llama.

The Phind models were trained using 32 A100-80 GB GPUs and a sequence length of 4096 tokens in three hours. The researchers used DeepSpeed ZeRO 3 and Flash Attention 2 for faster and more efficient training.

Phind publishes both models under the Llama license on Huggingface.

Open-Source Community accelerates Meta's AI development

The Llama license allows both scientific and commercial use, but the latter is restricted as a special license is required for use in widespread applications. In addition, data generated with Llama 2 may not be used to train new AI models.

Meta's Llama 2 language model also now has numerous refinements that outperform Meta's original release in benchmarks. This is likely Meta's goal: to improve their models faster thanks to the open-source community.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Fine-tuned Meta Code Llama outperforms GPT-4 in key benchmark

Open-Source Community accelerates Meta's AI development

Researchers reveal that AI models have distinct strategic fingerprints in classic game theory tests

Sakana AI's new algorithm lets large language models work together to solve complex problems

Alibaba's new GPT-4o competitor Qwen VLo is no longer open source

"Cat attack" on reasoning model shows how important context engineering is

Apple's claims about large reasoning models face fresh scrutiny from a new study

Cloudflare CEO Matthew Prince sees trouble ahead for the open web

Fine-tuned Meta Code Llama outperforms GPT-4 in key benchmark

Open-Source Community accelerates Meta's AI development

Researchers reveal that AI models have distinct strategic fingerprints in classic game theory tests

Sakana AI's new algorithm lets large language models work together to solve complex problems

Alibaba's new GPT-4o competitor Qwen VLo is no longer open source