nlg evaluation
23 papers with code • 0 benchmarks • 0 datasets
Evaluate the generated text by NLG (Natural Language Generation) systems, like large language models
Benchmarks
These leaderboards are used to track progress in nlg evaluation
Most implemented papers
Near-Negative Distinction: Giving a Second Life to Human Evaluation Datasets
Precisely assessing the progress in natural language generation (NLG) tasks is challenging, and human evaluation to establish a preference in a model's output over another is often necessary.
EffEval: A Comprehensive Evaluation of Efficiency for MT Evaluation Metrics
In this work, we provide a comprehensive evaluation of efficiency for MT evaluation metrics.
Not All Errors are Equal: Learning Text Generation Metrics using Stratified Error Synthesis
Is it possible to build a general and automatic natural language generation (NLG) evaluation metric?
CLSE: Corpus of Linguistically Significant Entities
Using the CLSE's entities and a small number of human translations, we create a linguistically representative NLG evaluation benchmark in three languages: French (high-resource), Marathi (low-resource), and Russian (highly inflected language).
Is ChatGPT a Good NLG Evaluator? A Preliminary Study
In detail, we regard ChatGPT as a human evaluator and give task-specific (e. g., summarization) and aspect-specific (e. g., relevance) instruction to prompt ChatGPT to evaluate the generated results of NLG models.
Describe me an Aucklet: Generating Grounded Perceptual Category Descriptions
Human speakers can generate descriptions of perceptual concepts, abstracted from the instance-level.
Evaluating Evaluation Metrics: A Framework for Analyzing NLG Evaluation Metrics using Measurement Theory
We address a fundamental challenge in Natural Language Generation (NLG) model evaluation -- the design and evaluation of evaluation metrics.
DecompEval: Evaluating Generated Texts as Unsupervised Decomposed Question Answering
Existing evaluation metrics for natural language generation (NLG) tasks face the challenges on generalization ability and interpretability.
LLM Comparative Assessment: Zero-shot NLG Evaluation through Pairwise Comparisons using Large Language Models
Current developments in large language models (LLMs) have enabled impressive zero-shot capabilities across various natural language tasks.
Towards Multiple References Era -- Addressing Data Leakage and Limited Reference Diversity in NLG Evaluation
To address this issue, we propose to utilize \textit{multiple references} to enhance the consistency between these metrics and human evaluations.