summary Summary
  • Open-source development continues at its usual pace, with the WizardLM developers releasing the fine-tuned code llama WizardCoder-34B, which scores 73.2 percent in the first pass of HumanEval.
  • WizardLM also notes that the GPT-4 version tested via API at the end of August achieved a score of 82 in HumanEval and GPT-3.5 of 72.5.
  • WizardCoder-34B is available on Github.

Shortly after the release of Meta's Code Llama code model, the open-source community tries to fine-tune it - and immediately achieves a new top score, surpassing OpenAI's GPT-4.

Phind, an AI co-programming startup, has announced that it has achieved a new high score on the HumanEval benchmark, an important evaluation test for AI programming tasks, with a fine-tuned 34B variant of Meta's just-released Code Llama.

In the first run, the fine-tuned standard and Python models scored 67.6 and 69.5 percent, respectively. OpenAI's GPT-4 scored 67 percent on the same benchmark when it was released in March. The standard Code Lama model with 34 billion parameters scored 48.8 percent, according to Meta, while the Python variant scored 53.7 percent.

Model HumanEval Result
Phind 34B standard model 67.6%
Phind 34B Python model 69.5%
GPT-4 (OpenAI, March 2023) 67%
Meta Code-Llama 34B 48.8%
Meta Code-Llama 34B Python 53.7%
Meta Unnatural Code Llama (not released) 62%

The two Phind models were fine-tuned natively on a custom dataset of about 80,000 high-quality programming tasks and solutions. According to Phind, Meta already fine-tuned Code Llama with a 62 percent success rate on HumanEval. However, Meta only used 15,000 examples to refine Unnatural Code Llama.


The Phind models were trained using 32 A100-80 GB GPUs and a sequence length of 4096 tokens in three hours. The researchers used DeepSpeed ZeRO 3 and Flash Attention 2 for faster and more efficient training.

Phind publishes both models under the Llama license on Huggingface.

Open-Source Community accelerates Meta's AI development

The Llama license allows both scientific and commercial use, but the latter is restricted as a special license is required for use in widespread applications. In addition, data generated with Llama 2 may not be used to train new AI models.

Meta's Llama 2 language model also now has numerous refinements that outperform Meta's original release in benchmarks. This is likely Meta's goal: to improve their models faster thanks to the open-source community.

Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
  • Phind's optimized CodeLlama-34B and CodeLlama-34B-Python models outperform GPT-4 on the HumanEval benchmark after fine-tuning.
  • The developers used a custom dataset of 80,000 high-quality programming tasks and solutions, as well as OpenAI's decontamination procedures, to ensure the validity of the results.
  • Both models are now available on the Huggingface platform for independent verification and support by the open-source community.
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.