Updated on 2022-06-01 GMT+08:00

Rules

Configure an HDFS NameNode Metadata Storage Path

The default path for storing NameNode metadata information is ${BIGDATA_DATA_HOME} /namenode/data. The ${BIGDATA_DATA_HOME} parameter is used to configure the path for storing HDFS metadata.

Enable NameNode Image Backup for HDFS

Set the fs.namenode.image.backup.enable to true so that the system can periodically back up NameNode data.

Set an HDFS DataNode Data Storage Path

The default storage path of DataNode data is ${BIGDATA_DATA_HOME}/hadoop/dataN/dn/datadir(N≥1), where N indicates the number of directories for storing data.

Example: ${BIGDATA_DATA_HOME}/hadoop/data1/dn/datadir,${BIGDATA_DATA_HOME}/hadoop/data2/dn/datadir

After the storage path is set, data is stored in the corresponding directory of each mounted disk on a node.

Improve HDFS Read/Write Performance

The data write process is as follows: After receiving service data and obtaining the data block number and location from the NameNode, the HDFS client contacts DataNodes and establishes a pipeline with the DataNodes to be written. Then, the HDFS client writes data to DataNode1 using a proprietary protocol, and DataNode1 writes data to DataNode2 and DataNode3 (three replicas). After data is written, a message is returned to the HDFS client.

  1. Set a proper block size. For example, set dfs.blocksize to 268435456 (256 MB).
  2. It is useless to cache some big data that cannot be reused in the buffer of the operating system. Set the following parameters to false.

    dfs.datanode.drop.cache.behind.reads and dfs.datanode.drop.cache.behind.writes

Set a Path for Storing MapReduce Intermediate Files

Only one path is used to store MapReduce intermediate files by default, that is, ${hadoop.tmp.dir} /mapred/local. You are advised to store intermediate files on each disk,

for example, /hadoop/hdfs/data1/mapred/local, /hadoop/hdfs/data2/mapred/local, and /hadoop/hdfs/data3/mapred/local. Directories that do not exist are automatically ignored.

Release Applied Resources in finally During Java Development

The applied HDFS resources are released in try/finally and cannot be released outside the try statement only. Otherwise, resource leakage occurs.

Overview of HDFS File Operation APIs

In the Hadoop, all file operation classes are in the org.apache.hadoop.fs package. These APIs support operations such as opening, reading, writing, and deleting a file. FileSystem is the API class provided for users in the Hadoop class library. FileSystem is an abstract class. Concrete classes can be obtained only using the Get method. The Get method has multiple overload versions and is commonly used.

static FileSystem get(Configuration conf);

This class encapsulates almost all file operations, such as mkdir and delete. Based on the preceding analysis, the application library framework of the operation file can be obtained.

operator()
{
    Obtain the Configuration object.
    Obtain the FileSystem object.
    Perform operations on a file.
}

HDFS Initialization Method

HDFS initialization is a prerequisite for using APIs provided by HDFS.

To initialize HDFS, load the HDFS service configuration file, implement Kerberos security authentication, and instantiate FileSystem. Obtain the keytab file for Kerberos authentication in advance. For details, see Preparing a Development User.

Correct example:

private  void init() throws IOException {
Configuration conf = new Configuration();
// Read the configuration file.
conf.addResource("user-hdfs.xml");
// Implement Kerberos security authentication in security mode first.
if ("kerberos".equalsIgnoreCase(conf.get("hadoop.security.authentication"))) {
String PRINCIPAL = "username.client.kerberos.principal";
String KEYTAB = "username.client.keytab.file";
// Set the keytab key file.
conf.set(KEYTAB, System.getProperty("user.dir") + File.separator + "conf"
?????+ File.separator + conf.get(KEYTAB));

// Set a Kerberos configuration file path.
String krbfilepath = System.getProperty("user.dir") + File.separator + "conf"
?????+ File.separator + "krb5.conf";
System.setProperty("java.security.krb5.conf", krbfilepath);
// Perform login authentication. */
SecurityUtil.login(conf, KEYTAB, PRINCIPAL);
}

// Instantiate FileSystem.
fSystem = FileSystem.get(conf);
}

Upload Local Files to HDFS

You can use FileSystem.copyFromLocalFile(Path src,Patch dst) to upload local files to a specified directory in HDFS. src and dst indicate complete file paths.

Correct example:

public class CopyFile {
        public static void main(String[] args) throws Exception {
         Configuration conf=new Configuration();
         FileSystem hdfs=FileSystem.get(conf);
               // Local files
        Path src =new Path("D:\\HebutWinOS");
        //To the HDFS
         Path dst =new Path("/");
               hdfs.copyFromLocalFile(src, dst);
         System.out.println("Upload to"+conf.get("fs.default.name"));
             FileStatus files[]=hdfs.listStatus(dst);
             for(FileStatus file:files){
                 System.out.println(file.getPath());
             }
        }
}

Create a Directory in HDFS

You can use FileSystem.mkdirs(Path f) to create a directory in HDFS. f indicates a full path.

Correct example:

public class CreateDir {
    public static void main(String[] args) throws Exception{
        Configuration conf=new Configuration();
        FileSystem hdfs=FileSystem.get(conf);
               Path dfs=new Path("/TestDir");
               hdfs.mkdirs(dfs);
    }
}

Query the Last Modification Time of the HDFS File

You can use FileSystem.getModificationTime() to query the modification time of a specified HDFS file.

Correct example:

     public static void main(String[] args) throws Exception {
         Configuration conf=new Configuration();
         FileSystem hdfs=FileSystem.get(conf);
         Path fpath =new Path("/user/hadoop/test/file1.txt");
         FileStatus fileStatus=hdfs.getFileStatus(fpath);
         long modiTime=fileStatus.getModificationTime();
         System.out.println("The modification time of file1.txt is"+modiTime);}

Read All Files in an HDFS Directory

You can use FileStatus.getPath() to query all files in a specified HDFS directory.

Correct example:

public static void main(String[] args) throws Exception {
        Configuration conf=new Configuration();
        FileSystem hdfs=FileSystem.get(conf);

        Path listf =new Path("/user/hadoop/test");

        FileStatus stats[]=hdfs.listStatus(listf);
        for(int i = 0; i < stats.length; ++i)                                                                            {
       System.out.println(stats[i].getPath().toString())
       }
        hdfs.close();
    }

Query the Location of a File in an HDFS Cluster

You can use FileSystem.getFileBlockLocation(FileStatus file,long start,long len) to query the location of a specified file in an HDFS cluster. file indicates a complete file path, and start and len specify the file path.

Correct example:

public static void main(String[] args) throws Exception {
        Configuration conf=new Configuration();
        FileSystem hdfs=FileSystem.get(conf);
        Path fpath=new Path("/user/hadoop/cygwin");

        FileStatus filestatus = hdfs.getFileStatus(fpath);
        BlockLocation[] blkLocations = hdfs.getFileBlockLocations(filestatus, 0, filestatus.getLen());

        int blockLen = blkLocations.length;
        for(int i=0;i < blockLen;i++){
        String[] hosts = blkLocations[i].getHosts();
        System.out.println("block_"+i+"_location:"+hosts[0]);
        }
    }

Obtain the Names of All Nodes in the HDFS Cluster

You can use DatanodeInfo.getHostName() to obtain the names of all nodes in the HDFS cluster.

Correct example:

public static void main(String[] args) throws Exception {
        Configuration conf=new Configuration();
        FileSystem fs=FileSystem.get(conf);
       
        DistributedFileSystem hdfs = (DistributedFileSystem)fs;
        DatanodeInfo[] dataNodeStats = hdfs.getDataNodeStats();
       
        for(int i=0;i < dataNodeStats.length;i++){
            System.out.println("DataNode_"+i+"_Name:"+dataNodeStats[i].getHostName());
        }
    }

Multithread Security Login Mode

If multiple threads are performing login operations, the relogin mode must be used for the subsequent logins of all threads after the first successful login of an application.

Login sample code:

  private Boolean login(Configuration conf){
    boolean flag = false;
    UserGroupInformation.setConfiguration(conf);
    
    try {
      UserGroupInformation.loginUserFromKeytab(conf.get(PRINCIPAL), conf.get(KEYTAB));
      System.out.println("UserGroupInformation.isLoginKeytabBased(): " +UserGroupInformation.isLoginKeytabBased());
      flag = true;
    } catch (IOException e) {
      e.printStackTrace();
    }
    return flag;
    
  }

Relogin sample code:

public Boolean relogin(){
        boolean flag = false;
        try {
            
          UserGroupInformation.getLoginUser().reloginFromKeytab();
          System.out.println("UserGroupInformation.isLoginKeytabBased(): " +UserGroupInformation.isLoginKeytabBased());
          flag = true;
        } catch (IOException e) {
            e.printStackTrace();
        }
        return flag;
    }