Share on Facebook Share on Twitter Share on LinkedIn Share on Google+

While performing distcp, it is frequently required to specify Namenode HA for remote cluster. Recently, I had a requirement to move data from production to DR cluster using distcp. In my case, both production and DR cluster utilize Namenode HA. 

To get distcp to recognize HA information for remote cluster, hdfs-site.xml file needs to be modified. There are 2 ways of achieving it.

  1. Creating a copy of hdfs-site.xml file -  This can be done if developer doesn’t have access to modify default hdfs-site.xml for cluster.  To use this option, a new config directory should be created. At a minimum following files need to be copied to new config directory.
    hdfs-site.xml
    hadoop-env.sh

 Now, config directory can be specified on command line using 

hadoop –-config <configdir> distcp hdfs://<serviceId1>/path1 hdfs://<serviceId2>/path2

  1. Modifying default hdfs-site.xml file for cluster – This allows any hadoop command to accept HA Namenode for remote cluster. Using Ambari, new properties can be added to custom hdfs-site section

    Note: It only works with Ambari 2.4.0 and above.
    https://issues.apache.org/jira/browse/AMBARI-15507

In either case, following changes need to be made to hdfs-site.xml file.

dfs.nameservices property should be modified to add multiple nameservices. Multiple nameservices can be added by listing services separated by ,

 <property>
    <name>dfs.nameservices</name>  
    <value>serviceId1,serviceId2</value>
 </property>

For any new Nameservice, following entries should be added to hdfs-site.xml

  <property>  
      <name>dfs.client.failover.proxy.provider.serviceId2</name>
      <value>org.apache.hadoop.hdfs.server  
             .namenode.ha.ConfiguredFailoverProxyProvider</value>
  </property>
  <property>
      <name>dfs.ha.namenodes.serviceId2</name>    
      <value>nn1,nn2</value>
  </property>
  <property>
      <name>dfs.namenode.rpc-address.serviceId2.nn1</name>
      <value>nn1.com:8020</value>
  </property>
  <property>
      <name>dfs.namenode.servicerpc-address.serviceId2.nn1</name>
      <value>nn1.com:54321</value>
  </property>
  <property>
      <name>dfs.namenode.http-address.serviceId2.nn1</name>
      <value>nn1.com:50070</value>
  </property>
  <property>
      <name>dfs.namenode.https-address.serviceId2.nn1</name>
      <value>nn1.com:50470</value>
  </property>
  <property>
      <name>dfs.namenode.rpc-address.serviceId2.nn2</name>
      <value>nn2.com:8020</value>
  </property>
  <property>
      <name>dfs.namenode.servicerpc-address.serviceId2.nn2</name>
      <value>nn2.com:54321</value>
  </property>
  <property>
      <name>dfs.namenode.http-address.serviceId2.nn2</name>
      <value>nn2.com:50070</value>
  </property>
  <property>
      <name>dfs.namenode.https-address.serviceId2.nn2</name>  
      <value>nn2.com:50470</value>
  </property>

Voilla ! These techniques can be extended to get multiple Nameservices recognized by any hadoop commands. This can be very beneficial if working on multiple clusters from single machine.

Share this Article: Share on Facebook Share on Twitter Share on LinkedIn Share on Google+