Big Data, Communication and Lower Bounds

As the size of the available data increases, massive data sets cannot be stored in their entirety in memory of a single machine. Furthermore due to the small amount of memory and computation power available on a single node, one needs to distribute the data and computation among multiple machines when processing tasks.

However, transferring the big data is very expensive. In fact, it is more expensive than the computations on the datasets. Thus, in the distributed model, the amount of communication plays an important role in the total cost of an algorithm and the aim is to minimize the amount of communication among processors (CPUs). This is one of the main motivations to study the theory of Communication Complexity, which originates from Big Data processing. Continue reading

Software Packages for Theoreticians by Theoreticians

Brown University’s ICERM recently hosted a workshop titled “Electrical Flows, Graph Laplacians, and Algorithms,” where top researchers convened to present and discuss their recent progress in spectral graph theory and algorithms. Richard Peng opened up the workshop with an overview talk on efficient solvers for linear systems with graph Laplacians as the coefficient matrix. He presented a thorough history of the topic and set the stage for the variety of technical talks on fast algorithms for graph sparsification, spectral clustering, computing max flow, as well as a variety of other local and approximation algorithms.

His talk (as well as many of the rest) are archived and available thanks to ICERM. I will focus on one highlight – a point that resonated with the conclusion of Richard Peng’s talk – a call for more software implementing these new, fast algorithms. In this light, I’d like to briefly discuss some of the software packages out there for spectral graph theory and the analysis of large graphs being developed by theoreticians active in the area. Continue reading

The Thesis

All PhD candidates around the world know about the thesis. You always knew about the thesis. It marks the beginning of the end for your career as a PhD and if you actually do it, you can have that  cool “Dr.” title that you always wanted in your business card. What is the problem then? Why it seems so frustrating when you are sitting down to do it? The following is based on a true story, actually my story. How I managed to write it down and track my progress. Continue reading

The Evolution of Local Graph Partitioning

As a follow-up to my previous post on the discussion of where theory and experimentation meet in the study of large-scale networks, I would like to discuss in more detail one of the empirically best-performing algorithms which also has a sound theoretical background: spectral partitioning. In this post I will examine the history of the problem, outline some key results, and present some future ideas for the problem. Continue reading

Theory Behind Big Data

As a PhD student who does research on theory and algorithms for massive data analysis, I am interested in exploring current and future challenges in this area, which I’d like to share it here. There are two major points of view when we talk about big data problems:

One is more focused on industry and business aspects of big data, and includes many IT companies who work on analytics. These companies believe that the potential of big data lies in its ability to solve business problems and provide new business opportunities. To get the most from big data investments, they focus on questions which companies would like to answer. They view big data not as a technological problem but as a business solution, and their main goals are to visualize, explore, discover and predict.

Continue reading