
The cluster structure
The size and structure of your big data cluster are going to affect performance. If you have a cloud-based cluster, your IO and latency will suffer, in comparison to an unshared hardware cluster. You will be sharing the underlying hardware, with multiple customers and the cluster hardware may be remote. There are some exceptions to this. The IBM cloud, for instance, offers dedicated bare metal high-performance cluster nodes, with an InfiniBand network connection, which can be rented on an hourly basis.
Additionally, the positioning of cluster components on servers may cause resource contention. For instance, think carefully about locating Hadoop NameNodes, Spark servers, Zookeeper, Flume, and Kafka servers in large clusters. With high workloads, you might consider segregating servers to individual systems. You might also consider using an Apache system such as Mesos that provides better distributions and assignment of resources to the individual processes.
Consider potential parallelism as well. The greater the number of workers in your Spark cluster for large Datasets, the greater the opportunity for parallelism. One rule of thumb is one worker per hyper-thread or virtual core respectively.