Hadoop Beginner's Guide

上QQ阅读APP看书，第一时间看更新

Time for action – configuring the pseudo-distributed mode

Take a look in the conf directory within the Hadoop distribution. There are many configuration files, but the ones we need to modify are core-site.xml, hdfs-site.xml and mapred-site.xml.

Modify core-site.xml to look like the following code:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

Modify hdfs-site.xml to look like the following code:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

Modify mapred-site.xml to look like the following code:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>

What just happened?

The first thing to note is the general format of these configuration files. They are obviously XML and contain multiple property specifications within a single configuration element.

The property specifications always contain name and value elements with the possibility for optional comments not shown in the preceding code.

We set three configuration variables here:

The dfs.default.name variable holds the location of the NameNode and is required by both HDFS and MapReduce components, which explains why it's in core-site.xml and not hdfs-site.xml.
The dfs.replication variable specifies how many times each HDFS block should be replicated. Recall from Chapter 1, What It's All About, that HDFS handles failures by ensuring each block of filesystem data is replicated to a number of different hosts, usually 3. As we only have a single host and one DataNode in the pseudo-distributed mode, we change this value to 1.
The mapred.job.tracker variable holds the location of the JobTracker just like dfs.default.name holds the location of the NameNode. Because only MapReduce components need know this location, it is in mapred-site.xml.

Note

You are free, of course, to change the port numbers used, though 9000 and 9001 are common conventions in Hadoop.

The network addresses for the NameNode and the JobTracker specify the ports on which the actual system requests should be directed. These are not user-facing locations, so don't bother pointing your web browser at them. There are web interfaces that we will look at shortly.

Configuring the base directory and formatting the filesystem

If the pseudo-distributed or fully distributed mode is chosen, there are two steps that need to be performed before we start our first Hadoop cluster.

Set the base directory where Hadoop files will be stored.
Format the HDFS filesystem.

Note

To be precise, we don't need to change the default directory; but, as seen later, it's a good thing to think about it now.