Tutorial : Load Parquet files using Talend and Impala

Tutorial : Load Parquet files using Talend and Impala

Tutorial : Load Parquet files using Talend and Impala

Impala helps you to create, manage, and query Parquet tables. Parquet is a column-oriented binary file format intended to be highly efficient for the types of large-scale queries that Impala is best at. Parquet is especially good for queries scanning particular columns within a table, for example to query “wide” tables with many columns, or to perform aggregation operations such as SUM() and AVG() that need to process most or all of the values from a column. Each data file contains the values for a set of rows (the “row group”). Within a data file, the values from each column are organized so that they are all adjacent, enabling good compression for the values from that column. Queries against a Parquet table can retrieve and analyze these values from any column quickly and with minimal I/O.

In this example above, we will do the following :

 

 

 

Leave a Reply