An Evaluation of Apache Jena and Apache Hive for Data Meshes: Using Semantic Enrichment and LLMs for data products recommendations
Date Issued
May 2025
Author(s)
Advisor
Abstract
This thesis investigates two key challenges in the world of Data Lakes: The semantic organization of
metadata and the automated generation of meaningful data products using an LLM. The first part of
the thesis evaluates which system is most suitable for implementing a metadata enrichment mechanism,
specifically comparing Apache Hive and Apache Jena in terms of scalability, query efficiency and storage
performance. Experimental results demonstrate that Hive outperforms Jena in large scale environments.
The second part of the thesis proposes a framework in which an LLM autonomously suggests data products
by providing to the LLM a user-defined concept, metadata retrieved from Hive and sample records
from HDFS. The purpose of this study is to examine whether the LLM can act as a domain expert by
reasoning over structured and unstructured input and by performing external web searches. Specifically,
the framework is evaluated across two domains and three complexity levels, measuring the precision and
quality of the suggested data products. The results show that while the LLM performs well in simple
scenarios, its effectiveness declines as concept complexity and dataset pool size increases.
metadata and the automated generation of meaningful data products using an LLM. The first part of
the thesis evaluates which system is most suitable for implementing a metadata enrichment mechanism,
specifically comparing Apache Hive and Apache Jena in terms of scalability, query efficiency and storage
performance. Experimental results demonstrate that Hive outperforms Jena in large scale environments.
The second part of the thesis proposes a framework in which an LLM autonomously suggests data products
by providing to the LLM a user-defined concept, metadata retrieved from Hive and sample records
from HDFS. The purpose of this study is to examine whether the LLM can act as a domain expert by
reasoning over structured and unstructured input and by performing external web searches. Specifically,
the framework is evaluated across two domains and three complexity levels, measuring the precision and
quality of the suggested data products. The results show that while the LLM performs well in simple
scenarios, its effectiveness declines as concept complexity and dataset pool size increases.
File(s)![Thumbnail Image]()
Name
AF-BSC-2025-abstract.pdf
Size
138.42 KB
Format
Adobe PDF
Checksum (MD5)
7185ede6c31aa2af1673a339e7bcf2a0

