Big Data — What Is Apache Hadoop
At present, there is a lot of hype surrounding Big Data. In the proceeding article, I will attempt to clarify what people mean by Big Data and provide a high level overview of a core technology called Hadoop which is used in industry for storing and processing large quantities of data.
The term Big in Big Data refers to the amount of data. With the advent of technologies like e-commerce, social media and IoT devices, the amount of data produced by society as a whole has been increasing exponentially since the start of the 21st century. It is estimated that the Global Datasphere will grow from 33 Zettabytes (one zettabyte is equivalent to a trillion gigabytes) in 2018 to 175 ZB by 2025.
Many organizations are turning to Apache Hadoop and other big data technologies harness the value from these large quantities of data to drive business decisions.
The more data you have, the more computing power you need to process it. When it comes to compute power, you have two options:
- Vertical scaling (i.e. more RAM, cores, HDDs)
- Horizontal scaling (i.e. distributed computing)
The problem with vertical scaling is that for one, due to a variety of constraints processor speeds will hover around 4 GHz for the foreseeable future. In addition, the cost of higher grade components isn’t linear. In other words, two systems would end up costing you less than a single system with twice the specs.
As a result, companies primarily use distributed computing for their data processing (i.e. business analytics, training machine learning models). Distributed computing sounds simple enough but in practice, there are a lot of considerations. For instance, how do the different systems share information, how can we break the problem up such that every system can work on it concurrently and what do we do if one or more of those nodes goes down.
According to the official documentation, Apache Hadoop is a open source framework that allows for the distributed processing of large data sets across clusters of computers. The Apache Hadoop project is broken down into HDFS, YARN and MapReduce.
HDFS (Hadoop Distributed File System)
Suppose that you were working as a data engineer at some startup and were responsible for setting up the infrastructure that would store all of the data produced by the customer facing application. You decide to use hard disk drives as the main non volatile storage medium since solid state drives are too expensive and tape is too slow for any kind of processing. Traditionally, you’d keep storage (where the data is at rest) and compute (where the data gets processed) separate. In these kinds of systems, the data would be moved over a very fast network to computers that would then process it. As it turns out, moving data over a network is really expensive in terms of time. Alternatively we can process the data where it’s stored. Data locality is a radical shift from separating storage and compute, which at one point had been a popular way of handling large scale data systems. Hadoop was one of the first technologies to adopt this approach. For obvious reasons, shipping the mapReduce code over the network is much faster than trying to send petabytes of data to where the program is. Data locality fundamentally removes the network as a bottleneck, thus enabling linear scalability.
Now, the question arises as to how we can seamlessly access data stored on different nodes. Obviously we wouldn’t want to have to remote into a computer every time we want to access the files on its HDD. The Hadoop distributed file system takes care of this for us. HDFS is implemented as a master and slave architecture that is made up of a NameNode (the master) and one or more DataNodes (the slaves). The NameNode is responsible for telling clients which node to send data to, or in the case of retrieving, which node contains the data they’re looking for. The client can then connect to the DataNode and begin transferring data without anymore involvement from the NameNode. The preceding process is what enables Hadoop to effectively scale horizontally.
By default, HDFS stores three copies of your files across the cluster. In the event, some failure occurs and there are temporarily only two copies, you’ll never know because it is all handled behind the scene by the NameNode.
The files stored in HDFS are read only. Once you write a file, it will have that content until it gets deleted. This implementation detail means that HDFS doesn’t need to make synchronized changes which give rise to multiple problems in distributed systems.
YARN (Yet Another Resource Negotiator)
There isn’t much to say about YARN other than it is used to manage compute resources. Prior to YARN, most resource negotiation was handled at the operating system level. The latter leads to multiple inefficiencies. Ergo, the community invented YARN to spread out workloads over the cluster more intelligently and tell each individual computer what it should be running as well as how many resources should be given to it.
MapReduce
In programming, map typically means applying some function to every element in the list and returning the result of everyone of those individual operations as a list. It’s important to note that the order in which map functions are applied doesn’t matter. Say that you wanted to apply a map function to a list of a trillion integers. When performed sequentially by a single computer it would take a while. Therefore, we can distribute the load across many nodes inside of the cluster to be processed concurrently. In other words, every node will execute the map function on a subset of the list. Given that the order in which a map function is applied doesn’t matter, we avoid one of the largest challenges in distributed systems, that is, getting different computers to agree on what order things get done.
The traditional reduce operation takes a function and a list as arguments, applies the function to the first two elements of the list, takes the result, then reapplies the function to result and the next element, continuing the process until it reaches the end of the list. In Hadoop, reduce takes the results from the map phase, groups them together by key and performs aggregation operations such as summation, count and average.
One of the major benefits to using Hadoop is that it abstracts away all the complexities of running code on a distributed system. The exact same MapReduce code could be run over a 10,000 node cluster or on a single laptop. In consequence, development is rendered easier and there are fewer opportunities to make errors.
Final Thoughts
Hadoop facilitates the storage and processing of data across different nodes inside of a cluster. Hadoop takes advantage of data locality to process large quantities of data without ever sending it over the network. HDFS takes care of a lot of the complexities surrounded distributed systems such as how to handle the case when a node inevitably goes down. Using MapReduce, we can breakdown a workload and distribute it across multiple computers to be executed concurrently, then the aggregated result can be obtained following the reduce phase.