Lambda Architecture-driven Data Lake
As we discussed in earlier sections, there exist multiple ways of processing data, however they can be broadly classified into batch and real-time data processing. While there can be scenarios where one of them provides the desired outcomes, there can be additional scenarios that may need data from both batch as well as real-time data processing components. This drives us to a problem of merging batch data with real-time data. This problem is addressed by the Lambda Architecture pattern, which will be discussed in further detail in the next chapter. Here, we are discussing the initial view of a Lambda-Architecture-driven data lake.
Lambda Architecture, as a pattern, provides the ways and means to perform highly scalable and performant distributed computing on large sets of data and yet (eventually) provides consistent data with the required processing, both in batch as well as in near real time. Lambda Architecture defines the ways and means to enable scale-out architecture across various data load profiles in an enterprise, with low latency expectations.
Figure 02: Layers in a Data Lake
The way the Lambda Architecture pattern achieves this is by dividing the overall architecture into layers. Each of these layers are covered on high-level in below sections of this chapter.