Big Data Architect’s Handbook
上QQ阅读APP看书,第一时间看更新

Apache HBase

HBase is another Apache Project designed to manage the NoSQL data store. It is designed to make use of Hadoop Ecosystem's features, including reliability, fault tolerance, and so on. It utilizes HDFS as a file system for storage purposes. There are multiple data models that NoSQL works with and Apache HBase belongs to the column-oriented data model. HBase was originally based on Google Big Table, which is also related to the column-oriented model for unstructured data.

HBase stores everything in the form of a key-value pair. The important thing to note is that in HBase, a key and a value are in the form of bytes. So, to store any information in HBase, you have to convert information into bytes. (In other words, its API doesn't accept any thing other then byte array.) Be careful with HBase, as when you store data, you should remember its original type. Data that was originally a string will return as a byte array if recalled incorrectly. As a result, it will create a bug in your application and crashes your application. 

HBase is written in Java. It also exposes an API for Java integration. Oher programming languages have to use RESTful web services or the Thrift gateway.

HBase doesn't have any SQL tool built-in, so as an alternate, you have to use another application or third party tools to write SQL queries that directly run on HBase. Some of these projects include Apache Phoenix, Hive, and Presto.

When we think about the design consideration of an HBase database, we should not think of it as a relational database where you try to find a relationship between each object. If you think of it in this way, you will end up with a poorly-performing, un-optimized database. In relational database design, you first identify the relationship between different entities, build an index, and do performance tuning in order to get quicker results when you execute SQL queries. In the case of HBase, it is the total opposite. You have to build your queries first so as to identify how you will approach data. You have to normalize and denormalize your data based on which entities have been accessed together and not on the basis of their relationship.

Apache HBase is a free and open-source tool for unstructured data management. We will discuss it in detail in our next chapter, which is all about NoSQL database and its related frameworks.