Design and Implementation of Data Lakehouse Architecture for Self-Service Analytics
DOI:
https://doi.org/10.58915/amci.v15i2.2267Keywords:
architecture, data lakehouse, data warehouse, cloud, data managementAbstract
This paper focused on designing data lakehouse architecture for self-service analytics. The objectives include creating a collaborative analytics environment, streamlining the management of multiple extract, transform and load (ETL) processes, adopting cost-effective and non-proprietary architecture, integrating with business intelligence (BI) tools, ensuring high query performance for interactive visualization, enabling data warehousing capabilities, and offering a self-service data discovery and metadata platform. An iterative development methodology that involved requirement gathering and planning, design, implementation, testing, deployment, and maintenance phases was utilized in this research. The logical design comprises six layers: data ingestion, storage, catalog, semantics, processing, and consumption. For physical design, Dremio was used as the core component, while Apache Iceberg was used for data format and query processing. The case study presented in this paper adopted an Integrated Multi-Zone Analytics Framework to handle data tasks and workloads. As this paper concludes, it suggests future enhancements, such as considering the Dremio Enterprise Edition for advanced features, and exploring Databricks and MLflow if expecting extensive machine learning workloads. These enhancements can further improve the architecture and its outcomes.


