Repository logoCyprus University of Technology
Log In(current)
Ελληνικά
English
  1. Home
  2. Cyprus University of Technology (Research Output)
  3. Πτυχιακές Εργασίες/ Bachelor's Degree Theses
  4. Data Lake Semantic Enrichment via Traditional Systems and LLMs
  • Details

Data Lake Semantic Enrichment via Traditional Systems and LLMs

Date Issued
May 2025
Author(s)
Papageorgiou, Panagiotis  
Advisor
Andreou, Andreas S.  
Abstract
In todays era, Big Data is often referred to as the ”new oil”. Businesses heavily rely on Data Lakes to
store massive amounts of heterogeneous data, however without proper metadata mechanisms in place,
these repositories turn into data swamps. This thesis explores alternative systems for traditional semantic
enrichment of data sources by adapting a pre-existing semantic blueprint model in Apache Hive and comparing
it in terms of insertion time, query performance and storage efficiency against a well established
system, Apache Jena. Experimental results show that Hive offers significant scalability and storage efficiency
benefits over Jena, however Jena is more suitable for small to medium-size Data Lakes that require
dynamic schema evolution, complex relationships between data sources and perform infrequent queries
for metadata retrieval. Additionally, this thesis explores the feasibility of LLM-driven approaches for
semantic enrichment by proposing two novel pipelines and evaluating four different configurations. The
results demonstrate that LLMs can be used as an alternative solution but often rely on high quality metadata
to produce maximum accuracy. Expert-curated metadata produced the highest accuracy and low
response times, while LLM-generated metadata offered a promising, semi-automated alternative with
important trade offs. Finally, FAISS-based retrieval excelled in reducing operational costs as well as
response times.
Subjects

Large Language Models...

FAISS

Data Lakes

Semantic Enrichment

File(s)
Thumbnail Image
Name

Papageorgiou-BSC-2025-abstract.pdf

Size

139.81 KB

Format

Adobe PDF

Checksum (MD5)

bc2a4fff77934b2000badfb6c7395dde

Explore by
  • Collections
  • Research Outputs
  • Researchers
  • Faculty & Departments
  • Theses
  • Patents
  • Projects
  • Journals
  • Conferences
Useful Links
  • Researcher Portfolio Guide
  • Researcher Profile
  • Create an ORCID ID
  • CUT Open Access Author Fund
  • ETDS Guide
Copyright Policies

Use Sherpa/Romeo to find publisher copyright policies

Go
Go
  • SPARC Author Addendum Engine
  • National Open Access Policy in Cyprus
Deposit your work to Ktisis
  • Self-archiving. Please sign in to Ktisis.
  • Email your work to:
    library.dspace@cut.ac.cy
  • Contact your subject librarian

Member of

OpenAIREre3dataOpenDOARCOREDART
Cyprus University of Technology
Library and
Information
Services

Copyright © 2022 - Library and Information Services Feedback - Built with DSpace-CRIS - 4Science

  • Accessibility settings
  • Privacy policy
  • End User Agreement
COAR NotifyCOAR Notify