The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant.
HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware.
HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data.
The goal of HDFS
- Hardware failure is the norm rather than the exception.
- Streaming Data Access
- Large Data Sets
- Simple Coherency Model
- Moving Computation is Cheaper than Moving Data
- Portability Across Heterogeneous Hardware and Software Platforms
HDFS is designed to reliably store very large files across machines in a large cluster.
MapReduce Software Framework
Offers clean abstraction between data analysis tasks and the underlying systems challenges involved in ensuring reliable large-scale computation.
- Processes large jobs in parallel across many nodes and combines results.
- Eliminates the bottlenecks imposed by monolithic storage systems.
- Results are collated and digested into a single output after each piece has been analyzed.