Working with Big Data: What is #MapReduce
Throughout this blog I have written many things about databases, programming in different languages, but also, I have written about Big Data, one of the most fascinating topics if you like data.
Unfortunately, I have not had work options to work with Big Data tools, and in my research work at the University, some tools would be good for me.
MapReduce is the parallel computing paradigm on large data collections, and is based on two functions: map and reduce, very common in many programming languages such as #Python.
The most classic of MapReduce’s examples is the word count within a text to analyze what is the most repeated.
Google and the PageRank
Google was the driver of this model to implement the calculation of PageRank, which is an algorithm to apply to web pages for the calculation of relevance.
How the implementation of this model is not a set of data, but to the global of all websites, it is necessary to have a file distribution system, so before starting, we have already raised a serious problem: how to organize information .
Reading and processing data
Executing a process by a CPU is a process that involves, first, reading the information from disk to memory, and then applying the algorithm. This “simple” process can be terribly slow when the volume of data is large (as in the case of PageRank calculation), because loading data into memory takes time (a few data is very fast, but when we go to several million data, ….. ummm ….. it’s not that fast anymore).
The reason for loading the data from disk to memory is because it is in memory where the algorithm is executed, but once loaded, they can be used as many times as necessary. But … what if the volume of data is so large that it cannot be loaded even in memory? Yes, this can also happen. This is where Data Mining, which allows you to load portions of data into memory, process them, dump the results to disks, and reload new data, … and so on until all are processed. But even this process has bandwidth limitation.
Obviously, for the scenario we are painting here, it is clear that we will need several disks and several CPUs to execute the work, in parallel. Therefore, the infrastructure to execute the work is not feasible for any company, but it is necessary to have a certain size and physically space to house the servers and disks necessary for the task. Few companies can boast about it, the most important being Google, Amazon, Yahoo, Microsoft.
But this model is not perfect, since even having 1 million servers (it is estimated that Google has), there will be failures, so it is necessary to distribute the information so that it is persistent and even with failures, the data is still available , and in case of failure, that the process or processes continue working. Fortunately, these failures only affect these companies (I also assume that what I have told is quite superficial regarding the real problems arising from these infrastructures).
MapReduce helps to solve cloud computing
And it does so by challenging the 3 big issues that we have raised so far:
- Store reducing data on multiple nodes to ensure persistence and availability.
- Perform calculations near the data to minimize data movement.
- Apply a simple programming model to hide the complexity.
It is essential to use a Distributed File System to handle this problem, such as Google GFS or Hadoop HDFS. In addition, it is almost necessary that the data be rarely updated (never change), but it is very frequent read, and additions of new data.
MapReduce, it is simple but complex at the same time
As you can guess, the MapReduce algorithm is simple in concept, but with many intrinsic complications due to its execution, due to the problems involved in having high volumes of data, …
Have a nice day!