HDFS is a distributed file system and has the following properties:
1. It is optimized for streaming access of large files. You would typically store files that are in the 100s of MB upwards on HDFS and access them through MapReduce to process them in batch mode.
2. HDFS files are write once files. You can append to files in some of the recent versions but that is not a feature that is very commonly used. Consider HDFS files as write-once and read-many files. There is no concept of random writes.
3. HDFS doesn’t do random reads very well.
HBase on the other hand is a database that stores it’s data in a distributed filesystem. The filesystem of choice typically is HDFS owing to the tight integration between HBase and HDFS. Having said that, it doesn’t mean that HBase can’t work on any other filesystem. It’s just not proven in production and at scale to work with anything except HDFS.
HBase provides you with the following:
1. Low latency access to small amounts of data from within a large data set. You can access single rows quickly from a billion row table.
2. Flexible data model to work with and data is indexed by the row key.
3. Fast scans across tables.
4. Scale in terms of writes as well as total volume of data.
An analogous comparison would be between MySQL and Ext4.