Making Sense of the Wild World of Hadoop PART II: Clustering

In my last blog I introduced us to Hadoop, which allows companies to process, store and analyze Petabytes, Exabytes and even Yottabytes worth of data. And not just the kind of data you’d find on a spreadsheet–Hadoop can handle a wide variety of data (some would say just about any kind of data). As promised, this blog dives into its ‘secret sauce’ to explore the major computing principles Hadoop uses to perform its magic: clustering, schema-on-read and map + reduce. As these are in-depth topics, in this blog we’ll focus on clustering.

Dividing and Conquering: Four regular guys are stronger than one muscleman
Traditional computing is done on one or two computers through the ‘client/server’ model. This means that one computer is directing either another part of itself, or another computer (in most cases referred to as a ‘server’) to do something. And that model has served (no pun intended) us well for several decades now. But, in the era of massive datasets and applications with millions of worldwide users, the demands are often beyond a single computer to handle. Hadoop gets around this by using multiple computers to do a job. 

Think of it like any other job. If you need to lift a piano up a flight of stairs, even the world’s strongest muscleman will throw out his back. But for four or five regular-sized guys, this is no problem. Computing works pretty much the same way. Hadoop uses what’s known as a ‘clustering’ paradigm, which uses groups of servers (computers) vs. a single server for storage, processing and computation. This goes by multiple names, and you may have heard of it referred to as ‘distributed computing’, or ‘scaling out’. With Hadoop, rather than ‘scaling up’, i.e. moving data to a more powerful server, organizations are able to ‘scale out’ by simply adding servers to the job. The really nice thing about this is that with this model, those servers (computers) can be ‘commodity’, which is a fancy way of saying that they’re just normal, run of the mill servers (regular guys vs. musclemen).

You boss around the ‘name node’, which bosses around the data nodes
Tying back to the previous post, these are working under the coordination of Hadoop’s core components, HDFS and YARN (we’ll get to MapReduce shortly). The ‘secret sauce’ involves dividing and conquering the work, so that each server is only involved in part of the task. Just like building a house–rather than a single builder doing everything, a team does various jobs in concert. Each server is described as a ‘node’, and you can start with a single node cluster and add computers as you go (technically a single node isn’t a cluster yet–multiple nodes make up a cluster, but no one will yell at you for referring to a single node cluster). 

Each cluster has a kind of a ‘boss’ server, referred to as a name node, that receives your requests and coordinates all the activity, and one or more data nodes that actually do the work according to the direction provided by the name node. Carrying on with the house building analogy, the name node is like the contractor or foreman, and the data nodes are the various construction workers and specialists. In the traditional ‘client/server’ model, you (the client) are directing a single builder. In Hadoop you’re directing the contractor, who then in turn directs an entire group of workers.

Blowing the lid off the limits of computing power
The ability to run programs on a cluster, vs. just one computer, is key to Hadoop’s ability to handle Big Data, and Hadoop can literally ‘scale out’ from a single server to thousands of servers. Essentially, this has blown the lid off the limits of computing power and enabled things to be done that we couldn’t have dreamed of a decade ago. Facebook? Not possible without clustering. Google? Don’t even think about trying anything akin to this on a single computer. 

Why not use Hadoop for everything? Experts are quick to point out that Hadoop does sacrifice some efficiency in favor of massive scalability, and may present problems when you’re dealing with smaller scale data.  However, things are changing fast, and in a few years, who knows? Hadoop may be a viable solution for small datasets as well, but for now we’ll have to wait and see.

So how does all of this stuff get done? How do four, five or even a thousand computers work together to perform a job? They do it with MapReduce, and with schema-on-read, which we’ll talk about in my next blog–stay tuned!