Storage and Data Formats
Traditional data warehousing focused on relational databases as the primary data and storage format. A key concept of the data lake is the ability to reliably store a large amount of data. Such data volumes are typically much larger than what can be handled in traditional relational databases, or much larger than what can be handled in a cost-effective manner. To this end, the underlying data storage must be scalable and reliable.
The Hadoop Distributed File System (HDFS) has matured and is now the leading data storage technology that enables the reliable persistence of large-scale data. However, other storage technologies can also provide the data-store backend for the data lake. Open source systems such as Cassandra, HBase, and MongoDB can provide reliable storage for the data lake. Alternatively, cloud-based storage services can also be used as a data-store backend. Such services include Amazon S3, Google Cloud Storage, and the Microsoft Azure Blob Store.