Data Lake for Enterprises
上QQ阅读APP看书,第一时间看更新

Speed layer

Speed layer is also known as the Real-time Layer and (as the name suggests) caters to real-time analysis requirements. The batch layer operates in specified intervals and between the completion of one batch execution and start of another; the business users still wants analysis to be conducted on the data. The responsibility of merging the batch view along with real-time data is the role of the Speed layer. Now, the batch processing window can be reduced, but since batch deals with a good amount of data, processing by the batch layer usually takes time and the business cannot wait for this lag in processing of the batch layer. For achieving near real-time data for analysis, data is incremented to the speed layer in a low latency fashion. Once the batch layer is executed and catches up with the data in the speed layer, the speed layer views are discarded and the process continues.

The query, when fired by the user, queries both the batch and speed layer. It merges the result to get the results for the user according to the desired parameters sent.

To achieve fault tolerance and to recover from the errors introduced, both the batch and speed layer at any moment can resort to re-computation (batch layer from the raw data) or roll back (to a previous state for batch layer) and just flush (for speed layer and regenerating the view).

There are some important concepts that this layer complies with to achieve its basic objective, which are as follows:

  • Incremental computation: A new real-time view is created using an existing real-time view and new data. As detailed before, this is done so as to reduce the time taken to make the data available for analysis in near real time.
  • Eventual consistency: To achieve some of the computations in real time is really complex and time consuming. In that case, the system goes for approximations (closer or more approximate to correct answer). After some time (usually not too much), the data becomes correct.

The store requirements for speed layer should support both random read and write to cater to incremental updates. The speed layer does allow mutation as against the batch layer but deals with a very small dataset (say, a day as the batch process frequency) compared to the batch layer, which does deal with a huge amount of data (spanning months or years).

Following the generic approach as explained in the previous section, the speed layer also creates a so-called speed view (intermediate view) catering to the requirement, as shown in this diagram:

Figure 06: Lambda Architecture - speed layer

With respect to the single customer view, this is how the speed layer can be realized:

Figure 07: Single customer view - speed layer

The new data flows to the speed layer, where the appended data is processed and appropriate speed views are created. When required by the serving layer, the speed view gets merged with the batch views and results are sent across.