NaviAI logoNaviAI

Categories

Chat Assistants131Writing & Text225Image & Design326Audio & Video114Development131Education82Business246Gaming & Fun22Health20Travel11Finance2
NaviAI logoNaviAI
HomeAI NewsTutorialsAbout
中文
HomeSearch

Found 4 results for “AI Model Evaluation”

MMLU
MMLUother

MMLU, short for Massive Multitask Language Understanding, is an evaluation of language understanding capabilities for large models. It is currently one of the most well-known semantic understanding benchmarks for large models, introduced in September 2020 by researchers at UC Berkeley.

FlagEval
FlagEvalother

FlagEval (Tiancheng) is a scientific, fair, and open large model evaluation system and open platform launched by the Beijing Academy of Artificial Intelligence (BAAI), providing researchers with tools and methods to comprehensively evaluate the performance of foundation models and training algorithms. FlagEval adopts a three-dimensional evaluation framework of "Capability-Task-Metric" to assess the cognitive abilities of large models from multiple dimensions, covering various application scenarios such as dialogue, question answering, and sentiment analysis, and providing more than 22 datasets and 80,000 evaluation questions.

H2O EvalGPT
H2O EvalGPTother

H2O EvalGPT is an open tool from H2O.ai for evaluating and comparing large language models (LLMs). It provides a platform to understand model performance across a large number of tasks and benchmarks. Whether you want to use large models to automate workflows or tasks, H2O EvalGPT offers detailed rankings of popular, open-source, high-performance large models to help you choose the most effective model for specific tasks in your project.

LLMEval3
LLMEval3other

LLMEval is a large model evaluation benchmark launched by the Fudan University NLP Laboratory. The latest LLMEval-3 focuses on professional knowledge capability evaluation, covering the 13 disciplinary categories designated by the Ministry of Education, including philosophy, economics, law, education, literature, history, science, engineering, agriculture, medicine, military science, management, and arts, as well as more than 50 secondary disciplines, with a total of about 200,000 standard generative question-answering items.