An Evaluation of Apache Jena and Apache Hive for Data Meshes: Using Semantic Enrichment and LLMs for data products recommendations

Photiou, Artemis

An Evaluation of Apache Jena and Apache Hive for Data Meshes: Using Semantic Enrichment and LLMs for data products recommendations

Date Issued

May 2025

Author(s)

Photiou, Artemis

Advisor

Andreou, Andreas S.

Abstract

This thesis investigates two key challenges in the world of Data Lakes: The semantic organization of
metadata and the automated generation of meaningful data products using an LLM. The first part of
the thesis evaluates which system is most suitable for implementing a metadata enrichment mechanism,
specifically comparing Apache Hive and Apache Jena in terms of scalability, query efficiency and storage
performance. Experimental results demonstrate that Hive outperforms Jena in large scale environments.
The second part of the thesis proposes a framework in which an LLM autonomously suggests data products
by providing to the LLM a user-defined concept, metadata retrieved from Hive and sample records
from HDFS. The purpose of this study is to examine whether the LLM can act as a domain expert by
reasoning over structured and unstructured input and by performing external web searches. Specifically,
the framework is evaluated across two domains and three complexity levels, measuring the precision and
quality of the suggested data products. The results show that while the LLM performs well in simple
scenarios, its effectiveness declines as concept complexity and dataset pool size increases.

Subjects

Semantic Enrichment

Apache Hive

Apache Jena

Data Lakes

Large Language Models...

File(s)

Name

AF-BSC-2025-abstract.pdf

Size

138.42 KB

Format

Adobe PDF

Checksum (MD5)

7185ede6c31aa2af1673a339e7bcf2a0

An Evaluation of Apache Jena and Apache Hive for Data Meshes: Using Semantic Enrichment and LLMs for data products recommendations

Explore by

Useful Links

Copyright Policies

Deposit your work to Ktisis