Updated on 2022-06-01 GMT+08:00

MapReduce Introduction

Hadoop MapReduce is an easy-to-use parallel computing software framework. Applications developed based on MapReduce can run on large clusters consisting of thousands of servers and concurrently process TB-level data sets in fault tolerance mode.

A MapReduce job (application or job) splits an input data set into several independent data blocks, which are processed by Map tasks in parallel mode. The framework sorts output results of the Map task, sends the results to Reduce tasks, and returns a result to the client. Input and output information is stored in the HDFS. The framework schedules and monitors tasks as well as re-executes failed tasks.

MapReduce has the following characteristics:

  • Large-scale parallel computing
  • Large data set processing
  • High fault tolerance and reliability
  • Proper resource scheduling