Data Lake Semantic Enrichment via Traditional Systems and LLMs
Date Issued
May 2025
Author(s)
Advisor
Abstract
In todays era, Big Data is often referred to as the ”new oil”. Businesses heavily rely on Data Lakes to
store massive amounts of heterogeneous data, however without proper metadata mechanisms in place,
these repositories turn into data swamps. This thesis explores alternative systems for traditional semantic
enrichment of data sources by adapting a pre-existing semantic blueprint model in Apache Hive and comparing
it in terms of insertion time, query performance and storage efficiency against a well established
system, Apache Jena. Experimental results show that Hive offers significant scalability and storage efficiency
benefits over Jena, however Jena is more suitable for small to medium-size Data Lakes that require
dynamic schema evolution, complex relationships between data sources and perform infrequent queries
for metadata retrieval. Additionally, this thesis explores the feasibility of LLM-driven approaches for
semantic enrichment by proposing two novel pipelines and evaluating four different configurations. The
results demonstrate that LLMs can be used as an alternative solution but often rely on high quality metadata
to produce maximum accuracy. Expert-curated metadata produced the highest accuracy and low
response times, while LLM-generated metadata offered a promising, semi-automated alternative with
important trade offs. Finally, FAISS-based retrieval excelled in reducing operational costs as well as
response times.
store massive amounts of heterogeneous data, however without proper metadata mechanisms in place,
these repositories turn into data swamps. This thesis explores alternative systems for traditional semantic
enrichment of data sources by adapting a pre-existing semantic blueprint model in Apache Hive and comparing
it in terms of insertion time, query performance and storage efficiency against a well established
system, Apache Jena. Experimental results show that Hive offers significant scalability and storage efficiency
benefits over Jena, however Jena is more suitable for small to medium-size Data Lakes that require
dynamic schema evolution, complex relationships between data sources and perform infrequent queries
for metadata retrieval. Additionally, this thesis explores the feasibility of LLM-driven approaches for
semantic enrichment by proposing two novel pipelines and evaluating four different configurations. The
results demonstrate that LLMs can be used as an alternative solution but often rely on high quality metadata
to produce maximum accuracy. Expert-curated metadata produced the highest accuracy and low
response times, while LLM-generated metadata offered a promising, semi-automated alternative with
important trade offs. Finally, FAISS-based retrieval excelled in reducing operational costs as well as
response times.
File(s)![Thumbnail Image]()
Name
Papageorgiou-BSC-2025-abstract.pdf
Size
139.81 KB
Format
Adobe PDF
Checksum (MD5)
bc2a4fff77934b2000badfb6c7395dde

