Biased by Design? Evaluating Bias and Behavioral Diversity in LLM Annotation of Real-World and Synthetic Hotel Reviews

Voutsa, Maria C.; Tsapatsoulis, Nicolas; Djouvas, Constantinos

doi:10.3390/ai6080178

Biased by Design? Evaluating Bias and Behavioral Diversity in LLM Annotation of Real-World and Synthetic Hotel Reviews

Journal

AI

Date Issued

August 4, 2025

Author(s)

Voutsa, Maria C.

Tsapatsoulis, Nicolas

Djouvas, Constantinos

DOI

10.3390/ai6080178

Abstract

As large language models (LLMs) gain traction among researchers and practitioners, particularly in digital marketing for tasks such as customer feedback analysis and automated communication, concerns remain about the reliability and consistency of their outputs. This study investigates annotation bias in LLMs by comparing human and AI-generated annotation labels across sentiment, topic, and aspect dimensions in hotel booking reviews. Using the HRAST dataset, which includes 23,114 real user-generated review sentences and a synthetically generated corpus of 2000 LLM-authored sentences, we evaluate inter-annotator agreement between a human expert and three LLMs (ChatGPT-3.5, ChatGPT-4, and ChatGPT-4-mini) as a proxy for assessing annotation bias. Our findings show high agreement among LLMs, especially on synthetic data, but only moderate to fair alignment with human annotations, particularly in sentiment and aspect-based sentiment analysis. LLMs display a pronounced neutrality bias, often defaulting to neutral sentiment in ambiguous cases. Moreover, annotation behavior varies notably with task design, as manual, one-to-one prompting produces higher agreement with human labels than automated batch processing. The study identifies three distinct AI biases—repetition bias, behavioral bias, and neutrality bias—that shape annotation outcomes. These findings highlight how dataset complexity and annotation mode influence LLM behavior, offering important theoretical, methodological, and practical implications for AI-assisted annotation and synthetic content generation.

Subjects

AI bias

annotation bias

large language models...

sentiment analysis

topic modeling

aspect classification...

synthetic data

inter-coder reliabili...

ChatGPT

hospitality reviews

neutrality bias

Biased by Design? Evaluating Bias and Behavioral Diversity in LLM Annotation of Real-World and Synthetic Hotel Reviews

Explore by

Useful Links

Copyright Policies

Deposit your work to Ktisis