HDFS
- suitable
- very large size, terabyte, petabyte
- write once and read many times
- handle node failure without noticeable interruption
- not suitable for some applications with,
- low-latency data access, HBase
- lots of small files
- multiple writers and arbitrary modification. only append to the end of file
- Block
replication on block
- namenode and datanode
- namenode - filesystem namespace image and edit logs which are written to multiple file systems or NFS. they are critical and cannot be lost
- namenode - block pool having blocks reported from datanodes when system startup
- secondary namenode - merge edit logs to the namespace image. when primary namenode is down, copy namespace image and edit logs from NFS and make it as primary namenode. Then,
- load the image to memory
- replay edit logs
- receive enough block reports from datanodes to leave safe mode
- block can be cached in datanode's memory on per-file basis
- HDFS Federation
- map file namespaces to different namenodes, like /usr and /share
- block pools in namenodes are not partitioned. they get block reports from same datanodes if they register with the namenodes.
- client uses mount table to map path to namenodes
- HA (High Availability)
- primary namenode and standby namenode share the storage for the image and edit logs
- datanodes send block reports to the both namenodes
- standby namenode does the merge of edit logs
- zookeeper to select namenode. failover and fencing
- Java API
- FileSystem, Path, FSDataInputStream, FSDataOutputStream, FileStatus
- fis = fs.open, fos = fs.create, fos = fs.append
- Anatomy of file read
- read block by block.
- close connection to datanode when its block reading is done
- network topology need to be set for hadoop, same node -> same rack -> same data center -> different data centers
- anatomy of file write
- arrange blocks (by DataStreamer), the first replica is in local, the second is in off-rack, the third is in different node in the same rack as the second.
- write completes on minimum replica requirement, usually 1. data queue and ack queue.
- coherency
- use hflush or hsync to make content visible to reader. close calls hflush