Summary

  • The Allen Institute of AI (Ai2) has launched RewardBench 2, offering a more holistic view of AI model performance and assessing how they align with an enterprise’s goals and standards, rewarding models that mirror the behaviour of the model they are trying to train.
  • RewardBench 2 covers six different domains to test how well the models work including factuality, precise instruction following, math, safety, focus and ties, with the best performers overall being variants of Llama-3.1 Instruct.
  • Using RewardBench 2, enterprises can select models that best suit their domain and see correlated performance and choose models that work best with their own needs.
  • However, Ai2 has caveated the launch by stating that model evaluation should be used as a guide and not a predictor as a good response from a model is highly dependent on the context and goals of the user, and human preferences can be very nuanced.

By Emilia David

Original Article