Biased by Design? Evaluating Bias and Behavioral Diversity in LLM Annotation of Real-World and Synthetic Hotel Reviews

Voutsa, Maria C.; Tsapatsoulis, Nicolas; Djouvas, Constantinos

doi:10.3390/ai6080178

Παρακαλώ χρησιμοποιήστε αυτό το αναγνωριστικό για να παραπέμψετε ή να δημιουργήσετε σύνδεσμο προς αυτό το τεκμήριο: https://hdl.handle.net/20.500.14279/35667

Τίτλος:	Biased by Design? Evaluating Bias and Behavioral Diversity in LLM Annotation of Real-World and Synthetic Hotel Reviews
Συγγραφείς:	Voutsa, Maria C. Tsapatsoulis, Nicolas Djouvas, Constantinos
Major Field of Science:	Social Sciences
Field Category:	Media and Communications
Λέξεις-κλειδιά:	AI bias;annotation bias;large language models;sentiment analysis;topic modeling;aspect classification;synthetic data;inter-coder reliability;ChatGPT;hospitality reviews;neutrality bias
Ημερομηνία Έκδοσης:	4-Αυγ-2025
Πηγή:	AI, 2025
Volume:	6
Issue:	8
Περιοδικό:	AI
Περίληψη:	As large language models (LLMs) gain traction among researchers and practitioners, particularly in digital marketing for tasks such as customer feedback analysis and automated communication, concerns remain about the reliability and consistency of their outputs. This study investigates annotation bias in LLMs by comparing human and AI-generated annotation labels across sentiment, topic, and aspect dimensions in hotel booking reviews. Using the HRAST dataset, which includes 23,114 real user-generated review sentences and a synthetically generated corpus of 2000 LLM-authored sentences, we evaluate inter-annotator agreement between a human expert and three LLMs (ChatGPT-3.5, ChatGPT-4, and ChatGPT-4-mini) as a proxy for assessing annotation bias. Our findings show high agreement among LLMs, especially on synthetic data, but only moderate to fair alignment with human annotations, particularly in sentiment and aspect-based sentiment analysis. LLMs display a pronounced neutrality bias, often defaulting to neutral sentiment in ambiguous cases. Moreover, annotation behavior varies notably with task design, as manual, one-to-one prompting produces higher agreement with human labels than automated batch processing. The study identifies three distinct AI biases—repetition bias, behavioral bias, and neutrality bias—that shape annotation outcomes. These findings highlight how dataset complexity and annotation mode influence LLM behavior, offering important theoretical, methodological, and practical implications for AI-assisted annotation and synthetic content generation.
URI:	https://hdl.handle.net/20.500.14279/35667
ISSN:	2673-2688
DOI:	10.3390/ai6080178
Rights:	Attribution-NonCommercial-NoDerivatives 4.0 International
Type:	Article
Affiliation:	Cyprus University of Technology
Publication Type:	Peer Reviewed
Εμφανίζεται στις συλλογές:	Άρθρα/Articles

CORE Recommender

Δείξε την πλήρη περιγραφή του τεκμηρίου

Page view(s)

73

Last Week
5

Last month
9

checked on 20 Μαϊ 2026

Google Scholar^TM

Check

Altmetric

Αυτό το τεκμήριο προστατεύεται από άδεια Άδεια Creative Commons

Page view(s)

Google ScholarTM

Altmetric

Google Scholar^TM