HDFS
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware.
It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant.
- HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware.
- HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data.
- HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is now an Apache Hadoop subproject. The project URL is https://hadoop.apache.org/hdfs/.
假设和目标
- 容忍硬件故障。
- 流式处理,低硬件要求换取高吞吐量。
- 大数据集,GB到PB级别。支持大文件。单个实例支持百万个文件。单个集群支持数百个节点。
- 写一次,读多次,不可更新。
- 为了减少数据移动,提供接口尽量使计算接近数据。
- 平台间可移植性高,提高适用性。(Java编写)
架构
- 主从模式
- 一个主节点
- 文件系统命名空间
- 规范文件访问
- 多个从节点
- 集群中一个机器节点对应一个 HDFS 节点
- 管理实际文件存储
- 以块(block)为操作单位
- 一个主节点
文件系统命名空间
- 支持传统层级式文件系统
- 不排除会克服的特性
- 不支持存储空间配额
- 不支持软硬链接
数据复制
- HDFS被设计于在集群机器间可靠存储非常大文件
- 以一系列块的形式存储。除了最后一个,所有的块大小都是相同的
- 数据复制是为了提高硬件故障容忍度
- 一个文件里的块大小和复制数是单独配置的
- 应用可以设置文件的复制数
- 复制数在文件创建的时候指定,可修改
- 主节点只有数据复制过程是不参与的
- 数据被拆成多个部分,进一步拆成多个块,
- 步骤
- 写入:需要大量调优经验
- 读取:就近读取,从 rack 到数据中心
- 安全模式:主节点处于安全模式时,块复制不会发生。在数据节点登记的块复制数超过配置的数量并且持续30s之后,主节点退出此模式,然后决定哪些节点需要接下来被复制。
通信协议
- 依赖于 TCP/IP,使用RPC
- 主节点永远不会主动发起RPC连接
健壮性
- 数据节点通过心跳保活,丢失心跳后上面数据被认为不可访问
- 硬盘故障、心跳丢失、复制数比率增加都可能造成主节点决定发起新的块复制
- 负载均衡:未实现根据剩余空间和对单个文件的访问需求自动调整复制数
- 通过块的哈希检查可能因硬盘或网络造成的数据不完整性
- FsImage 和 EditLog 是 HDFS 的中心数据结构
- 主节点维护多份这样的数据
- 任意一份的更新都会被同步
- 主节点故障是HDFS的单点故障,发生后需要人工介入
- HDFS 当前不支持快照,但将来会支持
数据管理
如何访问数据
- JavaAPI
- 正在开发的 WebDAV
- FS Shell
- DFSAdmin
- 浏览器界面
空间释放
- 文件删除后会放到
/trash
目录,超时(默认6小时)删除并释放数据节点空间 - 降低复制数后,主节点在心跳响应里携带释放空间的消息,所以不是马上见效的
Rack vs Data center
A Data Center is a collection of Racks. A Rack is a collection of Servers.
Cluster -> Data center(s) -> Rack(s) -> Server(s)
Hadoop
The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.
内部组件
- HDFS
- MapReduce
- YARN
- Submarine
相关项目
- Ambari™: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually alongwith features to diagnose their performance characteristics in a user-friendly manner.
- Avro™: A data serialization system.
- Cassandra™: A scalable multi-master database with no single points of failure.
- Chukwa™: A data collection system for managing large distributed systems.
- HBase™: A scalable, distributed database that supports structured data storage for large tables.
- Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
- Mahout™: A Scalable machine learning and data mining library.
- Pig™: A high-level data-flow language and execution framework for parallel computation.
- Spark™: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation.
- Submarine: A unified AI platform which allows engineers and data scientists to run Machine Learning and Deep Learning workload in distributed cluster.
- Tez™: A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. Tez is being adopted by Hive™, Pig™ and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace Hadoop™ MapReduce as the underlying execution engine.
- ZooKeeper™: A high-performance coordination service for distributed applications.
ZooKeeper
Because Coordinating Distributed Systems is a Zoo
- ZooKeeper is a distributed, open-source coordination service for distributed applications.
- It exposes a simple set of primitives that distributed applications can build upon to implement higher level services for synchronization, configuration maintenance, and groups and naming.
- 易于编程,使用文件夹式的结构
- Java 编写,有 Java 和 C 的接口绑定
- 数据存储于内存
- They maintain an in-memory image of state, along with a transaction logs and snapshots in a persistent store. As long as a majority of the servers are available, the ZooKeeper service will be available.
- 事物带有序号。
- 读写比是10:1。
Znode
- 带有状态版本号
- 读写都是原子的
- 有ACL
- 短暂节点被用于事物管理?
- 支持watch(监听),触发后向监听者发送消息。被用于通知?
保证
- 串行化一致性
- 原子性
- 单系统镜像
- 可靠性
- 及时性
API
- create
- delete
- exists
- get data
- set data
- get children
- sync
术语
- The name space consists of data registers - called znodes
- ZooKeeper itself is intended to be replicated over a sets of hosts called an ensemble.
- We use the term znode to make it clear that we are talking about ZooKeeper data nodes.
- ZooKeeper also has the notion of ephemeral nodes.
YARN
- The fundamental idea of YARN is to split up the functionalities of resource management and job scheduling/monitoring into separate daemons.
- The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job or a DAG of jobs.
- The ResourceManager and the NodeManager form the data-computation framework.
- The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system. The ResourceManager has two main components: Scheduler and ApplicationsManager.
- The Scheduler is responsible for allocating resources to the various running applications subject to familiar constraints of capacities, queues etc.
- The ApplicationsManager is responsible for accepting job-submissions, negotiating the first container for executing the application specific ApplicationMaster and provides the service for restarting the ApplicationMaster container on failure.
- The NodeManager is the per-machine framework agent who is responsible for containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the ResourceManager/Scheduler.
- The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system. The ResourceManager has two main components: Scheduler and ApplicationsManager.
MapReduce
Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.
A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks
in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks
.
Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.
The MapReduce framework consists of a single master ResourceManager, one worker NodeManager per cluster-node, and MRAppMaster per application
HBase
Apache HBase™ is the Hadoop database, a distributed, scalable, big data store.
Use Apache HBase™ when you need random, realtime read/write access to your Big Data. This project’s goal is the hosting of very large tables – billions of rows X millions of columns – atop clusters of commodity hardware.
Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google’s Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.
特性
- 可线性或模块化扩展
- 读写强一致性,而不是最终一致性
- 自动化或配置化分表
- 地区服务器间故障转移
- 有对MapReduce和HBase的封装类
- 易用的Java接口
- 块缓存、布隆过滤查询
- 预定义查询的服务端结果推送
- 支持 Thrift、REST-ful,支持 XML、Protobuf、二进制编码格式
- 可导出metrics到文件或Ganglia
- NoSQL
- 适用于大数据
- 缺乏RDMS的有类型的字段、二级索引、事务、SQL
- 需要至少5个数据节点和1一个主节点
- 可在笔记本上运行用作开发
技术细节
TODO