How to configure HIVE metastore if you don't have one?

This article will guide you on how to set up a HIVE metastore in case you don't have a metastore.

In case you don't have a HIVE metastore, we have provided this step-by-step guide to create one for you. But before we jump into that, it's important to understand what HIVE metastore brings to the table.

Hive metastore is a component that stores all the structure information(metadata) of objects like tables and partitions in the warehouse including column and column type information. To know more about HIVE metastore and its architecture, please refer to this article https://cwiki.apache.org/confluence/display/hive/design#Design-Metastore

Now let's go ahead and set up the HIVE metastore.

We have tried to simplify this process as much as possible but if you still face any issues you can email us at hello@e6data.com.

  1. We have created terraform scripts so that you can launch HIVE metastore in your own environment.

  2. You can also build your own Metastore (on AWS or GCP) or you can use a managed Metastore service by GCP (Dataproc Metastore)

  3. Once the Hive Metastore is setup, We need to make sure that we,

    1. Create a Schema SQL file according to your S3/GCS data

    2. Connect Presto with Hive Metastore

    3. Run Schema SQL in presto to populate Schema and statistics in Hive Metastore using Presto

Create a Schema SQL file according to your S3/GCS data

The customer needs to prepare the SQL file with respect to the data, they want to query. We have provided the sample SQL files for you to go ahead and try it out.

Connecting Presto with Hive Metastore

Currently, Presto, Spark & many other engines can add schemas to HIVE metastore. We prefer Presto for the same. In case you don't have Presto with you, we have already created terraform scripts so that you can launch Presto in your own environment.

Run Schema SQL in presto to populate Schema and statistics in Hive Metastore using Presto

According to the Schema SQL file, the following operations will be performed.

  1. Create Database

  2. Create Table

  3. Repair Table ( Only If the table is partitioned)

  4. Analyze Table - Collect stats about a table so that it can be used for Cost Based optimization

For AWS: Login to the presto machine on EMR and run the following command:

./presto-cli --catalog hive --file input.sql

For GCP: Login to the presto machine on Dataproc and run the following command:

./presto --catalog hive --file input.sql

Last updated