As the size of the available data increases, massive data sets cannot be stored in their entirety in memory of a single machine. Furthermore due to the small amount of memory and computation power available on a single node, one needs to distribute the data and computation among multiple machines when processing tasks.
However, transferring the big data is very expensive. In fact, it is more expensive than the computations on the datasets. Thus, in the distributed model, the amount of communication plays an important role in the total cost of an algorithm and the aim is to minimize the amount of communication among processors (CPUs). This is one of the main motivations to study the theory of Communication Complexity, which originates from Big Data processing. Continue reading →