nlg evaluation

23 papers with code • 0 benchmarks • 0 datasets

Evaluate the generated text by NLG (Natural Language Generation) systems, like large language models

Most implemented papers

Near-Negative Distinction: Giving a Second Life to Human Evaluation Datasets

salesforce/nnd_evaluation 13 May 2022

Precisely assessing the progress in natural language generation (NLG) tasks is challenging, and human evaluation to establish a preference in a model's output over another is often necessary.

EffEval: A Comprehensive Evaluation of Efficiency for MT Evaluation Metrics

nl2g/effeval 20 Sep 2022

In this work, we provide a comprehensive evaluation of efficiency for MT evaluation metrics.

Not All Errors are Equal: Learning Text Generation Metrics using Stratified Error Synthesis

xu1998hz/sescore 10 Oct 2022

Is it possible to build a general and automatic natural language generation (NLG) evaluation metric?

CLSE: Corpus of Linguistically Significant Entities

google-research-datasets/clse 4 Nov 2022

Using the CLSE's entities and a small number of human translations, we create a linguistically representative NLG evaluation benchmark in three languages: French (high-resource), Marathi (low-resource), and Russian (highly inflected language).

Is ChatGPT a Good NLG Evaluator? A Preliminary Study

krystalan/chatgpt_as_nlg_evaluator 7 Mar 2023

In detail, we regard ChatGPT as a human evaluator and give task-specific (e. g., summarization) and aspect-specific (e. g., relevance) instruction to prompt ChatGPT to evaluate the generated results of NLG models.

Describe me an Aucklet: Generating Grounded Perceptual Category Descriptions

gu-clasp/describe-me-an-auklet 7 Mar 2023

Human speakers can generate descriptions of perceptual concepts, abstracted from the instance-level.

Evaluating Evaluation Metrics: A Framework for Analyzing NLG Evaluation Metrics using Measurement Theory

isle-dev/metriceval 24 May 2023

We address a fundamental challenge in Natural Language Generation (NLG) model evaluation -- the design and evaluation of evaluation metrics.

DecompEval: Evaluating Generated Texts as Unsupervised Decomposed Question Answering

kepei1106/decompeval 13 Jul 2023

Existing evaluation metrics for natural language generation (NLG) tasks face the challenges on generalization ability and interpretability.

LLM Comparative Assessment: Zero-shot NLG Evaluation through Pairwise Comparisons using Large Language Models

adianliusie/comparative-assessment 15 Jul 2023

Current developments in large language models (LLMs) have enabled impressive zero-shot capabilities across various natural language tasks.

Towards Multiple References Era -- Addressing Data Leakage and Limited Reference Diversity in NLG Evaluation

sefazeng/llm-ref 6 Aug 2023

To address this issue, we propose to utilize \textit{multiple references} to enhance the consistency between these metrics and human evaluations.