Mastering Apache Spark 2.x(Second Edition)
上QQ阅读APP看书,第一时间看更新

RDDs versus DataFrames versus Datasets

To make it clear, we are discouraging you from using RDDs unless there is a strong reason to do so for the following reasons:

  • RDDs, on an abstraction level, are equivalent to assembler or machine code when it comes to system programming
  • RDDs express how to do something and not what is to be achieved, leaving no room for optimizers
  • RDDs have proprietary syntax; SQL is more widely known

Whenever possible, use Datasets because their static typing makes them faster. As long as you are using statically typed languages such as Java or Scala, you are fine. Otherwise, you have to stick with DataFrames.