In order to execute the next-year plan, I search the research topics and technologies in Hadoop YARN and HDFS, then make a note as follows:
Since Hadoop YARN was proposed, the new generation technology are continusly discussed. For knowing the work of YARN, please refer to the post [1].
The capacity scheduler of YARN[2][3] provides a default capacity scheduler, org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler [4], to let hadoop eco-system manipulates its’ resources. It also provides resource calculator, org.apache.hadoop.yarn.util.resource.DefaultResourseCalculator [5], to calculate the memory usage, the disk usage and the cpu usage of each compute node.
For those system administrators and developers who are curious about the operations of resource allocation and the scheduler, please see the reports [6], [7] and [8].
In scientific area, Project HaSTE [9] proposed a new Hadoop YARN scheduling algorithm, which aims at efficiently utilizing the resources for scheduling map/reduce tasks in Hadoop YARN and improving the makespan of MapReduce jobs.
On the other hand, HDFS [10] is a usually used file system in Hadoop. However, it needs TCP socket connection to read/write data. Due to this reason, the IO performance will be lower than directly reading from local disk without network connection. Therefore, HDFS provides a function called HDFS Short-Circuit Local Reads [11] and also provides a native libaray to directly access the HDFS file system. According to the report [12], the I/O performance of using Short-Circuit is better than TCP.
ps. The other tricky technology to improve the I/O performance of HDFS is to use CombineFileInputFormat [13][14]. But I don’t think this method is better than using Short-Circuit.
Reference
- Karthik Kambatla, Wing Yew Poon, and Vikram Srivastava, “How Apache Hadoop YARN HA Works,” Cludera. Available: [Online] http://blog.cloudera.com/blog/2014/05/how-apache-hadoop-yarn-ha-works/
- Hadoop, “Haddop MapReduce Next Generation – Capacity Scheduler”, Apache Hadoop. Available: [Online] http://hadoop.apache.org/docs/r2.5.2/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html#Configuration
- skyWalker_ONLY, “Hadoop-2.4.1学习之容量调度器”. Available: [Online] http://blog.csdn.net/skywalker_only/article/details/41351147
- GrepCode, “CapacityScheduler”, GrepCode.com. Available: [Online] http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/hadoop-yarn-server-resourcemanager/2.6.0/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java/
- GrepCode, “ResourceCalculator”, GrepCode.com. Available: [Online] http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/hadoop-yarn-common/2.6.0/org/apache/hadoop/yarn/util/resource/DefaultResourceCalculator.java/
- Vinod Kumar Vavilapalli, “Resource Location in YARN: Deep Dive,” Hortonworks. Available: [Online] http://hortonworks.com/blog/resource-localization-in-yarn-deep-dive/
- SEQUENCEIQ, “YARN Schedulers demystified – Part 1: Capacity.” Available: [Online] http://blog.sequenceiq.com/blog/2014/07/22/schedulers-part-1/
- SEQUENCEIQ, “YARN Schedulers demystified – Part 2: Fair.” Available: [Online] http://blog.sequenceiq.com/blog/2014/09/09/yarn-schedulers-demystified-part-2-fair/
- Bo Sheng, “Project HaSTE: Hadoop YARN Scheduling Based on Task-Dependency and Resource-Demand,” The 7th IEEE International Conference on Cloud Computing, Anchorage, AK, June 2014. Available: [Online] http://www.cs.umb.edu/~shengbo/research/haste.html
- Hadoop, “HDFS User Guide,” Apache Hadoop. Available: [Online] http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Related_Documentation
- Hadoop, “HDFS Short-Circuit Local Reads,” Apache Hadoop. Available: [Online] https://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-hdfs/ShortCircuitLocalReads.html
- Colin McCabe, “How Improved Short-Circuit Local Reads Bring Better Performance and Security to Hadoop,” Cloudera. Available: [Online] http://blog.cloudera.com/blog/2013/08/how-improved-short-circuit-local-reads-bring-better-performance-and-security-to-hadoop/
- Dhruba Borthakur, “HDFS block replica placement in your hands now!” Available: [Online] http://hadoopblog.blogspot.de/2009/09/hdfs-block-replica-placement-in-your.html
- Hadoop, “Class CombineFileInputFormat<K,V>”. Available: [Online] http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/CombineFileInputFormat.html