Apache Hive is a data warehouse system built on top of Hadoop and is used for...

Question

Question

Apache Hive is a data warehouse system built on top of Hadoop and is used for...

Apache Hive is a data warehouse system built on top of Hadoop and is used for analysing structured and semi-structured data. It provides a mechanism to project structure onto the data and perform queries written in a similar way to SQL statements.

Suppose the company “XYZ” has installed Apache Hive on top of the Hadoop cluster using default metastore configuration. Discuss in detail the concept of metastore in Hive and the significance for analysing company XYZ’s data. Explain what will happen if they have multiple clients trying to access Hive at the same time?

engineering Computer-Science

Add a comment Improve this question Transcribed image text

Answer 1

Answer #1

Discuss in detail the concept of metastore in Hive and the significance for analysing company XYZ’s data.

Ans: Metastore is the central repository of Hive Metadata. It stores the meta data for Hive tables and relations. For example, Schema and Locations etc. It provides client access to this information by using metastore service API.
Hive metastore consists of two fundamental units:

A service that provides metastore access to other Apache Hive services.
Disk storage for the Hive metadata which is separate from HDFS storage.

Now when you run your Hive query and you are using the default Derby database, you will find that your current directory now contains a new sub-directory, metastore_db. Also, the metastore will be created if it doesn’t already exist. The property of interest here is javax.jdo.option.ConnectionURL.

The default value of this property is jdbc:derby:;databaseName=metastore_db;create=true. This value specifies that you will be using the embedded Derby as your Hive metastore, and the location of the metastore is metastore_db.

We can also configure the directory for the Hive to store table information. By default, the location of the warehouse is file:///user/hive/warehouse and we can also use the hive-site.xml file for the local or remote metastore.

We have three modes for hive metastore deployment:

a) Embedded Metastore: In Hive by default, metastore service runs in the same JVM as the Hive service. It uses embedded derby database stored on the local file system in this mode. Thus both metastore service and hive service runs in the same JVM by using embedded Derby Database. But, this mode also has limitation that, as only one embedded Derby database can access the database files on disk at any one time, so only one Hive session could be open at a time.

b) Local Metastore: Hive is the data-warehousing framework, so hive does not prefer single session. To overcome this limitation of Embedded Metastore, for Local Metastore was introduced. This mode allows us to have many Hive sessions i.e. many users can use the metastore at the same time. We can achieve by using any JDBC compliant like MySQL which runs in a separate JVM or different machines than that of the Hive service and metastore service which are running in the same JVM.

c) Remote Metastore: Moving further, another metastore configuration called Remote Metastore. In this mode, metastore runs on its own separate JVM, not in the Hive service JVM. If other processes want to communicate with the metastore server they can communicate using Thrift Network APIs. We can also have one more metastore servers in this case to provide more availability. This also brings better manageability/security because the database tier can be completely firewalled off. And the clients no longer need share database credentials with each Hiver user to access the metastore database.

As we know that, the company XYZ have installed Apache Hive using default metastore configuration, we can easily analyze data by performing some operations on it. The data analysis part is really important as it helps us to extract information from the given raw data and to make future decisions, prediction and so on.

We can analyze data from the hive tables by clicking on the setting icon, then a list of options appears among those options we have analyze data, we can use that option to analyze the data.

Explain what will happen if they have multiple clients trying to access Hive at the same time?

Ans: By default, the local metastore configurations will allow only one Hive session to be opened at the time of accessing the metastore. Moreover, if multiple clients tries to access the metastore at the same time, they will get an error. One has to use a standalone metastore, i.e. Local or remote metastore configuration in Apache Hive for allowing access to multiple clients concurrently.

Hope I answered the questions.

If you have any doubts/queries, feel free to ask by commenting down below. I will respond within 24 hours

And if you like my answer, then please do upvote for it, your feedback really matters alot to me.

STAY HOME STAY SAFE

Add a comment

Answer 2

Apache Hive is a data warehouse system built on top of Hadoop and is used for...

Homework Answers

Add Answer to:
Apache Hive is a data warehouse system built on top of Hadoop and is used for...

Post as a guest

Earn Coins

Apache Hive is a data warehouse system built on top of Hadoop and is used for...

Homework Answers

Add Answer to: Apache Hive is a data warehouse system built on top of Hadoop and is used for...

Post as a guest

Earn Coins

Add Answer to:
Apache Hive is a data warehouse system built on top of Hadoop and is used for...