Design and Implementation of Data Lakehouse Architecture for Self-Service Analytics

Soon Kien Yuan Soon; Nor Azuana Ramli; Mohd Zaid Waqiyuddin Mohd Zulkifli

doi:10.58915/amci.v15i2.2267

Authors

Soon Kien Yuan Soon Centre for Mathematical Science Universiti Malaysia Pahang Al-Sultan Abdullah, Lebuhraya Persiaran Tun Khalil Yaakob, 26300 Gambang, Pahang, Malaysia.
Nor Azuana Ramli Centre for Mathematical Science, Universiti Malaysia Pahang Al-Sultan Abdullah, Lebuh Persiaran Tun Khalil Yaakob, 26300 Kuantan, Pahang, Malaysia. https://orcid.org/0000-0002-4158-2890
Mohd Zaid Waqiyuddin Mohd Zulkifli Credence, 1 Jalan Damansara, Damansara Kim, 60000 W.P. Kuala Lumpur, Malaysia.

DOI:

https://doi.org/10.58915/amci.v15i2.2267

Keywords:

architecture, data lakehouse, data warehouse, cloud, data management

Abstract

This paper focused on designing data lakehouse architecture for self-service analytics. The objectives include creating a collaborative analytics environment, streamlining the management of multiple extract, transform and load (ETL) processes, adopting cost-effective and non-proprietary architecture, integrating with business intelligence (BI) tools, ensuring high query performance for interactive visualization, enabling data warehousing capabilities, and offering a self-service data discovery and metadata platform. An iterative development methodology that involved requirement gathering and planning, design, implementation, testing, deployment, and maintenance phases was utilized in this research. The logical design comprises six layers: data ingestion, storage, catalog, semantics, processing, and consumption. For physical design, Dremio was used as the core component, while Apache Iceberg was used for data format and query processing. The case study presented in this paper adopted an Integrated Multi-Zone Analytics Framework to handle data tasks and workloads. As this paper concludes, it suggests future enhancements, such as considering the Dremio Enterprise Edition for advanced features, and exploring Databricks and MLflow if expecting extensive machine learning workloads. These enhancements can further improve the architecture and its outcomes.