
Time for action – configuring the pseudo-distributed mode
Take a look in the conf
directory within the Hadoop distribution. There are many configuration files, but the ones we need to modify are core-site.xml
, hdfs-site.xml
and mapred-site.xml
.
- Modify
core-site.xml
to look like the following code:<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> </configuration>
- Modify
hdfs-site.xml
to look like the following code:<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
- Modify
mapred-site.xml
to look like the following code:<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>mapred.job.tracker</name> <value>localhost:9001</value> </property> </configuration>
What just happened?
The first thing to note is the general format of these configuration files. They are obviously XML and contain multiple property specifications within a single configuration element.
The property specifications always contain name and value elements with the possibility for optional comments not shown in the preceding code.
We set three configuration variables here:
- The
dfs.default.name
variable holds the location of the NameNode and is required by both HDFS and MapReduce components, which explains why it's incore-site.xml
and nothdfs-site.xml
. - The
dfs.replication
variable specifies how many times each HDFS block should be replicated. Recall from Chapter 1, What It's All About, that HDFS handles failures by ensuring each block of filesystem data is replicated to a number of different hosts, usually 3. As we only have a single host and one DataNode in the pseudo-distributed mode, we change this value to1
. - The
mapred.job.tracker
variable holds the location of the JobTracker just likedfs.default.name
holds the location of the NameNode. Because only MapReduce components need know this location, it is inmapred-site.xml
.
The network addresses for the NameNode and the JobTracker specify the ports on which the actual system requests should be directed. These are not user-facing locations, so don't bother pointing your web browser at them. There are web interfaces that we will look at shortly.
Configuring the base directory and formatting the filesystem
If the pseudo-distributed or fully distributed mode is chosen, there are two steps that need to be performed before we start our first Hadoop cluster.
- Set the base directory where Hadoop files will be stored.
- Format the HDFS filesystem.