大数据开源组件概览

HDFS

The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware.

It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant.

HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware.
HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data.
HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is now an Apache Hadoop subproject. The project URL is https://hadoop.apache.org/hdfs/.

假设和目标

容忍硬件故障。
流式处理，低硬件要求换取高吞吐量。
大数据集，GB到PB级别。支持大文件。单个实例支持百万个文件。单个集群支持数百个节点。
写一次，读多次，不可更新。
为了减少数据移动，提供接口尽量使计算接近数据。
平台间可移植性高，提高适用性。（Java编写）

架构

主从模式
- 一个主节点
  - 文件系统命名空间
  - 规范文件访问
- 多个从节点
  - 集群中一个机器节点对应一个 HDFS 节点
  - 管理实际文件存储
  - 以块（block）为操作单位

文件系统命名空间

支持传统层级式文件系统
不排除会克服的特性
- 不支持存储空间配额
- 不支持软硬链接

数据复制

HDFS被设计于在集群机器间可靠存储非常大文件
以一系列块的形式存储。除了最后一个，所有的块大小都是相同的
数据复制是为了提高硬件故障容忍度
一个文件里的块大小和复制数是单独配置的
应用可以设置文件的复制数
复制数在文件创建的时候指定，可修改
主节点只有数据复制过程是不参与的
数据被拆成多个部分，进一步拆成多个块，
步骤
- 写入：需要大量调优经验
- 读取：就近读取，从 rack 到数据中心
- 安全模式：主节点处于安全模式时，块复制不会发生。在数据节点登记的块复制数超过配置的数量并且持续30s之后，主节点退出此模式，然后决定哪些节点需要接下来被复制。

通信协议

依赖于 TCP/IP，使用RPC
主节点永远不会主动发起RPC连接

健壮性

数据节点通过心跳保活，丢失心跳后上面数据被认为不可访问
硬盘故障、心跳丢失、复制数比率增加都可能造成主节点决定发起新的块复制
负载均衡：未实现根据剩余空间和对单个文件的访问需求自动调整复制数
通过块的哈希检查可能因硬盘或网络造成的数据不完整性
FsImage 和 EditLog 是 HDFS 的中心数据结构
- 主节点维护多份这样的数据
- 任意一份的更新都会被同步
主节点故障是HDFS的单点故障，发生后需要人工介入
HDFS 当前不支持快照，但将来会支持

数据管理

如何访问数据

JavaAPI
正在开发的 WebDAV
FS Shell
DFSAdmin
浏览器界面

空间释放

文件删除后会放到 /trash 目录，超时（默认6小时）删除并释放数据节点空间
降低复制数后，主节点在心跳响应里携带释放空间的消息，所以不是马上见效的

Rack vs Data center

A Data Center is a collection of Racks. A Rack is a collection of Servers.

Cluster -> Data center(s) -> Rack(s) -> Server(s)

Hadoop

The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.

内部组件

HDFS
MapReduce
YARN
Submarine

ZooKeeper

Because Coordinating Distributed Systems is a Zoo

ZooKeeper is a distributed, open-source coordination service for distributed applications.
It exposes a simple set of primitives that distributed applications can build upon to implement higher level services for synchronization, configuration maintenance, and groups and naming.
易于编程，使用文件夹式的结构
Java 编写，有 Java 和 C 的接口绑定
数据存储于内存
They maintain an in-memory image of state, along with a transaction logs and snapshots in a persistent store. As long as a majority of the servers are available, the ZooKeeper service will be available.
事物带有序号。
读写比是10:1。

Znode

带有状态版本号
读写都是原子的
有ACL
短暂节点被用于事物管理？
支持watch（监听），触发后向监听者发送消息。被用于通知？

保证

串行化一致性
原子性
单系统镜像
可靠性
及时性

API

create
delete
exists
get data
set data
get children
sync

术语

The name space consists of data registers - called znodes
ZooKeeper itself is intended to be replicated over a sets of hosts called an ensemble.
We use the term znode to make it clear that we are talking about ZooKeeper data nodes.
ZooKeeper also has the notion of ephemeral nodes.

YARN

The fundamental idea of YARN is to split up the functionalities of resource management and job scheduling/monitoring into separate daemons.
The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job or a DAG of jobs.
The ResourceManager and the NodeManager form the data-computation framework.
- The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system. The ResourceManager has two main components: Scheduler and ApplicationsManager.
  - The Scheduler is responsible for allocating resources to the various running applications subject to familiar constraints of capacities, queues etc.
  - The ApplicationsManager is responsible for accepting job-submissions, negotiating the first container for executing the application specific ApplicationMaster and provides the service for restarting the ApplicationMaster container on failure.
- The NodeManager is the per-machine framework agent who is responsible for containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the ResourceManager/Scheduler.

MapReduce

Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.

A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks.

Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.

The MapReduce framework consists of a single master ResourceManager, one worker NodeManager per cluster-node, and MRAppMaster per application

HBase

Apache HBase™ is the Hadoop database, a distributed, scalable, big data store.

Use Apache HBase™ when you need random, realtime read/write access to your Big Data. This project’s goal is the hosting of very large tables – billions of rows X millions of columns – atop clusters of commodity hardware.

Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google’s Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.

特性

可线性或模块化扩展
读写强一致性，而不是最终一致性
自动化或配置化分表
地区服务器间故障转移
有对MapReduce和HBase的封装类
易用的Java接口
块缓存、布隆过滤查询
预定义查询的服务端结果推送
支持 Thrift、REST-ful，支持 XML、Protobuf、二进制编码格式
可导出metrics到文件或Ganglia
NoSQL
适用于大数据
缺乏RDMS的有类型的字段、二级索引、事务、SQL
需要至少5个数据节点和1一个主节点
可在笔记本上运行用作开发

技术细节

TODO