An Australia based higher education institute “XYZ” has international students across the 10 countries of continent Asia. The institute stores student’s details and has branches across all the 10 countries in Hive. The data stored in Hive includes records such as student_id, student_name, student_school, home_country, and enrollment_year. Each table in hive can have one or more partition columns to organize Hive table and optimize it. Propose column(s) that can be picked up as partition key to the given table in Hive? Give justification for your selection.
Answer:
As mentioned above, we have huge amount of raw data of Australia based higher education institute, the data is collected from 10 countries of continent Asia.
Now we need to choose a partition column/s in order to organize Hive table and optimize it.
From the given records (i.e. tudent_id, student_name, student_school, home_country, and enrollment_year), we can make home_country as a partition column.
If we take home_country as partition key and perform partitions, we can able to get a number of partitions (10 partitions) which is equal to number of countries (10) present in the data. Such that each country's data can be viewed separately in partitions tables. With the help of this partition, we can look how many number of students are present in each country and can analyze which country has highest number of students.
Similarly, enrollment_year can also be used as a partition column, which helps in partitioning the data based on distinct number of years present in the data and we can check how many students enrolled in a specific year.
Hope I answered the questions.
If you have any doubts/queries, feel free to ask by commenting down below. I will respond within 24 hours
And if you like my answer, then please do upvote for it, your feedback really matters alot to me.
STAY HOME STAY SAFE
An Australia based higher education institute “XYZ” has international students across the 10 countries of continent...