Nowadays, once a business or any operation involving computers grows to the extent of requiring more than one computer to handle, we immediately begin using multiple computers as part of a single network to more efficiently handle the workload. This has become commonplace to the extent that nowadays almost all operations in fields like data science are performed by these networked computers. While no doubt performing demanding computer tasks is more efficient this way, it also is incredibly complicated, as you need to individually configure each computer and then manage the whole network as it goes through your tasks. This is where programs like Hadoop come in to save the day.
- What Is Apache Hadoop?
- Hadoop Use Cases
- Hadoop Architecture Overview
- Install Hadoop on Ubuntu 20.04 - Step by Step Guide
- Prerequisites
- Step 1: Create Non-Root User For Hadoop
- Step 2: Set Up SSH Keys
- Step 3: Download and Install Hadoop on Ubuntu
- Step 4: Configure Hadoop Environment
- Step 5: Edit Configuration Files
- Step 6: Format HDFS
- Step 7: Start Hadoop Services
- Step 8: Verify Installation
- Step 9: Access Web Interfaces
- Step 10: Run a MapReduce Example
- Step 11: Set Environment Variables
- Conclusion
- FAQ
Hadoop is a suite of tools and programs released by Apache that allows the whole process of networking a bunch of computers together to be performed with much more efficiency and ease. So in this article, I will review Hadoop, examine its use cases, go over its pros and cons, and provide an overview of its advanced architecture, before moving on to a step-by-step guide on how to install Hadoop on Ubuntu 20.04 to finish this 2024 Hadoop tutorial.
What Is Apache Hadoop?
Hadoop, a suite of tools powered by Apache, has been transforming network setup and utilization for over 15 years. Users can capitalize on Hadoop’s resource efficiency, allowing them to harness their current computing power for demanding tasks without the need for expensive upgrades. The suite consists of four modules: HDFS, YARN, MapReduce, and Hadoop Common, each designed for specific use cases.
The brilliance of Hadoop lies in its inherent resourcefulness, cleverly empowering both individuals and organizations to unite their existing computational capabilities into a cohesive force capable of overcoming substantial computational challenges. Without the guidance of Hadoop, these entities would find themselves compelled to embark on the costly pursuit of acquiring increasingly powerful computing machines.
Hadoop Use Cases
Now we know what Hadoop is. But how exactly do its use cases apply in the real world? Understanding a program on paper is well and good, but it will never substitute realizing its potential as part of a serious operation. So here I will provide some examples before moving on to the Hadoop tutorial.
Risk Analysis
As already mentioned, Hadoop allows you to harness the power of several computer systems as part of a single network unit to efficiently go through batches of extensive data and analyze them faster than usual. With any business, there are risks that need analysis and calculation. Hadoop is extremely handy here. So much so in fact that many credible hospitals use it to analyze the risks of different treatments and surmise the potential outcome and stats of their operations for their patients. You learn more about Hadoop’s revolutionary role in healthcare here.
Detecting Security Breaches
As the overall amount of networking and utilized devices increases within a network or business, there are more and more potential security breaches to be mindful of. One of the essential utilities of Hadoop is assessing the entirety of an operation by analyzing big batches of data and highlighting potential pain points of that system.
Review Mapping
Many businesses rely on the review feedback they get on their products to improve them or develop new market strategies. While a human will take ages to cover a large enough review file, Hadoop will work its networked computer magic to yield much faster results.
Market Analysis
Speaking of market strategies, the aforementioned review mapping pales in comparison to the number of resources needed to analyze the market to assess the potential for a brand-new product entering it. This is another use case where Hadoop shines as it allows even small up-and-coming businesses to efficiently evaluate the market with several computers in an efficient timeframe and manner.
Assessing Log Files
Another aspect of businesses that gets more complicated as time goes by and they get more significant is the amount of software that they will start using across the board. Using more and more software causes more potential bugs and pain points and needs a dedicated employee to manage the log files and handle the issues. This will take a lot of time, but using a few easy protocols, a business can use Hadoop to quickly review and assess log files to find these bugs and get rid of them.
There are a ton of other Hadoop use cases and applications, but in order to maintain the focus on the purpose of the article, we will not discuss any further.
Hadoop Architecture Overview
Let’s say you have heard about Hadoop and its overall use cases and what it does. And even if you haven’t, this article has probably done that for you so far. But now you need to gain an in-depth understanding of what Hadoop is actually made of and how each part of it works with its other features. As mentioned before, there are four general layers of Hadoop; in this part of the Hadoop tutorial we are going to learn more about HDFS (Hadoop Distributed File System), YARN (Yet Another Resource Negotiator), MapReduce, and Hadoop Common. However, since Hadoop Common does not have all that many features that need explaining, the essentials of it are known as Zookeeper. So in this section, I will attempt to boil down the advanced Hadoop architecture and ecosystem and its four sections in basic terms, before finally moving on to how to install Hadoop on Ubuntu 20.04.
HDFS
HDFS in Hadoop ecosystem, constitutes the overall storage system that all Hadoop subsections and applications use to assess, transfer, and save data. The main point in HDFS Architecture is that unlike Hadoop itself, an open-source program, HDFS in Hadoop is actually the file system responsible for running all the underlying operations of a single Hadoop cluster. HDFS is an incredibly resilient file system that divides the data batches into 128 MB chinks, and optimizes them for sequence-based operations.
The primary role of HDFS in Hadoop software is to provide all the data as part of an overall data rack, which then can be manipulated via different namenodes and secondary racks into subsections for organizing your data analysis operation. You can then use the other options like Journal racks, QJM, HA, fsimage and edit log files and the overall legend log to keep track of and perform other tasks.
YARN
YARN is another executive branch of Hadoop that is used to assign desired amounts of computing assets to specific applications within the Hadoop ecosystem. In essence, it allows you to use a resource manager for your clients to allocate these resources through a set of different nodes to different tasks and applications. There is also a legend in YARN, that similar to the one in HDFS, allows you to keep track of all of your allocated assets and operations. YARN itself is divided into three subsections: the Resource Manager, the Application Master, and the Node Manager.
Each of these three subsections creates a new instance of themselves per cluster, application, and node, respectively. Not only can you allocate resources to different tasks using YARN but you also can schedule these resources to change over time to come up with advanced algorithmic workflows. YARN is not limited to its subsection, There will be many instances in which you will use YARN in conjunction with other architectural layers like HDFS and Zookeeper to allocate resources and evaluate your overall operation.
Hadoop Mapreduce
Hadoop MapReduce is another major component in the Hadoop ecosystem. Once you install Hadoop on Ubuntu, you can use this feature to effectively get a huge batch of data analyzed in a distributed manner by several different computers. In essence, Hadoop MapReduce works like this: you input a large map of data into the program. This data map will be shuffled, broken down, and distributed across your networked computers. Subsequently, using particular protocols known as reducers, the data are boiled down to their most essential components and reduced. Each one of these operations is known as a Job.
Let’s say you have a three-word sentence that acts as the data map you want to analyze with MapReduce. Let’s say the sentence is Bear Hunt Rabbit. Hadoop MapReduce will break down and reduce this sentence into three different batches each with one word, then use these words and make new combinations with similar data input of your other jobs to create a final homogenized data batch with removed unnecessary data and can easily be analyzed.
Zookeeper
Zookeeper is another subsection of the Hadoop ecosystem that initially came into prominence and common use with the release of Hadoop version 2.0. Zookeeper’s main point of service is to coordinate between the different operations you are running as part of a single Hadoop instance. As such, Zookeeper is almost always used in conjunction with YARN’s Resource Manager and the different features of HDFS in Hadoop. Zookeeper’s primary use in these operations is to detect and remedy the potential points of failure. To do this, it uses two different tools: ZKFiloverControer, and the Zookeeper Quorum.
In these procedures, the data nodes managed by other components of the Hadoop architecture are categorized as active namenodes, overseen by the user. Subsequently, each of these namenodes undergoes scrutiny within the two aforementioned subsections of the Zookeeper. This is done to pinpoint areas of difficulty and identify potential failures.
Install Hadoop on Ubuntu 20.04 – Step by Step Guide
And finally, after learning about the Hadoop architecture, it is time to get to the meat of the matter which is how to install Hadoop on Ubuntu 20.04 as the final part of this Hadoop tutorial. Let’s cover the prerequisites before moving on to the step-by-step guide to installing Hadoop on Ubuntu 20.04. Keep in mind that this guide can also be used for Ubuntu 18.04 as well.
Prerequisites
The prerequisites needed to install Hadoop on Ubuntu are pretty simple. All you need is an Ubuntu-powered computer with root access, either locally available or remotely accessible through a VPS server. Regarding prerequisite programs, ensure you already have Java 11 and SSH installed. If you don’t have them, run the following commands one at a time to install them:
sudo apt update && sudo apt upgrade -y
sudo apt install openssh-server openssh-client -y
sudo apt install openjdk-11-jdk -y
As for the license, you will not need any, since Hadoop is free and open-source. That’s all you need. Let’s move on to step one.
Step 1: Create Non-Root User For Hadoop
Create a non-root user for your Hadoop using the following command. This is part of the pre-configurations that we need to do before actually downloading and installing Hadoop:
sudo adduser hdoop
su - hdoop
Step 2: Set Up SSH Keys
Now in order to install Hadoop on Ubuntu we will use the Hadoop user you just created and use it to make an SSH connection with it. Use this command to generate an SSH key pair and save it:
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
Once the keys are generated, this following line will enable you to mark them as authorized_keys and save them in your SSH directory:
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Now use this command to make sure that your SSH connection has all the required permissions:
chmod 600 ~/.ssh/authorized_keys
chmod 700 ~/.ssh
Confirm the changes and you will be able to easily connect to your localhost at all times with the user you made:
ssh localhost
Step 3: Download and Install Hadoop on Ubuntu
You can visit the Apache Hadoop website to see a list of versions with their recent change log. Select the version of your liking and you will be presented with a link that can be used with the following command to download and install Hadoop on Ubuntu. Here I am choosing version 3.3.6. Replace ‘3.3.6’ with the latest stable version if necessary:
wget https://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
Once the download is over, use this line to finish the extraction and installation process:
tar xzf hadoop-3.3.6.tar.gz
sudo mv hadoop-3.3.6 /usr/local/hadoop
sudo chown -R hdoop:hdoop /usr/local/hadoop
Step 4: Configure Hadoop Environment
Set JAVA_HOME in /usr/local/hadoop/etc/hadoop/hadoop-env.sh:
echo 'export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")' | sudo tee -a /usr/local/hadoop/etc/hadoop/hadoop-env.sh
Step 5: Edit Configuration Files
Update Hadoop’s XML configuration files with your cluster settings.
nano /usr/local/hadoop/etc/hadoop/core-site.xml
Step 6: Format HDFS
Initialize the Hadoop filesystem namespace.
/usr/local/hadoop/bin/hdfs namenode -format
Step 7: Start Hadoop Services
Launch HDFS and YARN services.
/usr/local/hadoop/sbin/start-dfs.sh
/usr/local/hadoop/sbin/start-yarn.sh
Step 8: Verify Installation
Check the running Java processes to confirm Hadoop is running.
jps
Step 9: Access Web Interfaces
Open web browsers to Hadoop’s NameNode and ResourceManager interfaces.
NameNode: http://localhost:9870
ResourceManager: http://localhost:8088
Step 10: Run a MapReduce Example
Execute a sample MapReduce job to validate the setup.
/usr/local/hadoop/bin/hdfs dfs -mkdir /input
/usr/local/hadoop/bin/hdfs dfs -put localfile.txt /input
/usr/local/hadoop/bin/hadoop jar
/usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.6.jar grep /input /output 'dfs[a-z.]+'
/usr/local/hadoop/bin/hdfs dfs -cat /output/*
Step 11: Set Environment Variables
Add Hadoop’s bin and sbin directories to the system PATH.
echo 'export PATH=$PATH:/usr/local/hadoop/bin:/usr/local/hadoop/sbin' >> ~/.bashrc
source ~/.bashrc
And that’s it! You have just managed to successfully configure and install Apache Hadoop on Ubuntu 20.04!
Conclusion
In summary, the installation of Hadoop on Ubuntu 20.04 is a thorough process that demands meticulous attention and a readiness to explore the nuances of the setup. By adhering to the steps provided in this guide, Ubuntu users can embark on a transformative journey, tapping into the substantial capabilities of Hadoop to fully realize the potential of their data analytics pursuits.
My recommendation is to deploy Hadoop as a single-node deployment using limited distribution if you only intend to learn and play with it. For this purpose, a VPS will work perfectly for you. Cloudzy offers you a host of different Linux VPS services including an Ironclad, reliable Ubuntu VPS that can be configured in no time to become the perfect Hadoop learning playground for you. Starting at $4.95 per month, you can get your own Ubuntu VPS with more than 15 locations and 24/7 caring support!
The majority of Linux-run servers are using Ubuntu; why not you? Discover why everybody loves Ubuntu — get an optimized Ubuntu VPS
FAQ
What are HDFS vs. MapReduce differences?
While both modules reside in the Hadoop ecosystem, they serve distinct purposes. HDFS functions as a distributed file system, facilitating data accessibility. On the other hand, MapReduce excels at breaking down and efficiently analyzing large data chunks.
Is Hadoop considered a database?
Hadoop is not a database, although this misconception is common. Rather, it operates as a distributed file system that enables storage and processing of voluminous data using a network of interconnected computers. It should not be used as a direct replacement for a traditional database system.
What are the four primary components of Hadoop?
Hadoop consists of four core components: HDFS (Hadoop Distributed File System), YARN (Yet Another Resource Negotiator), MapReduce, and Hadoop Common. Additionally, some resources consider ZooKeeper as a component, although it is not officially recognized as such.
Where is Hadoop typically utilized?
Hadoop finds applications in various domains where managing, storing, processing, and analyzing large-scale data is essential. It caters to operations ranging from medium-sized businesses and hospitals to burgeoning startups, providing data-driven solutions.