Workings of Sqoop
For your data lake, you will definitely have to ingest data from traditional applications and data sources. The ingested data, being big, will definitely have to fall into the Hadoop store. Apache Sqoop is one technology that allows you to ingest data from these traditional enterprise data stores into Hadoop with ease.
SQL to Hadoop == SQOOP
The figure below (Figure 03) shows the basic workings of Apache Sqoop. It gives tools to export data from RDBMS to the Hadoop filesystem. It also gives tools to import data from a Hadoop filesystem back to RDBMS.
In our use case, we will be exporting the data stored in RDBMS (PostgreSQL) to the Hadoop File System (HDFS). We will not be looking at Sqoop's import capability in detail, but we will briefly cover that aspect also in this chapter so that you have pretty good knowledge of the different capabilities of this great tool.
As of writing this book, Sqoop has two variations (flavours) called by its major versions as Sqoop 1 and Sqoop 2. We have detailed sections below which explain both Sqoop 1 and 2, jotting down comparisons between the two for easy understanding. In this book, as detailed earlier, we will be working with Sqoop 1, as Sqoop 2 is still a work in progress and we wouldn't want to start solving its inherent problems while constructing the code for our use case.
Below is a figure taken from official Sqoop documentation, and it shows the architecture view for Apache Sqoop 1.
The workings of Sqoop are pretty straightforward, as detailed conceptually in the preceding figure. The user interacts with Sqoop using command prompts using various commands. These commands, when executed, kick off map tasks in Hadoop, which connects with the supplied RDBMS (using JDBC - Java DataBase Connectivity) and then connects to the Hadoop filesystems and stores data. One of the inherent problems with Sqoop 1 is very fundamental and this is due to the usage of JDBC for connectivity, as this can be quite clunky for different use cases.
The next section gives the reader a glimpse of Sqoop 2, as this is logically the next step in the upgrade process for Sqoop 1.