How to build an EDW on Hadoop ?

How to build an EDW on Hadoop using Talend ?

Hadoop can be both a destination data warehouse, and also an efficient staging and ETL source for an existing data warehouse

Model 1 – Main steps to build Exploratory DW  :

  • Copy source tables into HDFS with Talend for Big data
  • Declare the schema ( Hive or Impala ) , no data copying or reloading , this will be to create the metadata in Hcatalog
  • Use SQL or plug BI tools
How to build an EDW on Hadoop ?

How to build an EDW on Hadoop ?

 

 

 

 

 

 

 

 

Model 2 – Main steps to build High performance DW HDFS (x10)

 

  • Copy source tables into HDFS with Talend for Big data
  • ***Create parquet columnar HDFS files***
  • Declare the schema ( Hive or Impala ) , no data copying or reloading , this will be to create the metadata in Hcatalog
  • Use SQL or plug BI tools
Hadoop DW High performance

Hadoop DW High performanc

 

 

Good to know :

 

  • Parquet is not a database, it’s a columnar file
  • Parquet data can be updated and schema modified
  • Querying parquet data with hive or Impala at least 10x performance gain over HDFS raw file
  • Hive launches map-reduce jobs : ideal for ETL and transfer to conventional EDW
  • Impala launches in-memory individual queries, ideal for interactive query in Hadoop destination DW and 10x performance gain over Hive
  • Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.

Schema DW

DW

DW schema

Building + populating the Dim table

DimTableSchame

Preparing the queries

 

 

 

Leave a Reply