why's blog

GlusterFS

时间：March 20, 2019 分类：Linux服务

GlusterFS介绍

GlusterFS是什么

GlusterFS

创建时间：2006年

创始人：Anand Babu Periasamy

目标：代替开源的Lustre和商业产品GPFS

特点：

分布式文件系统（POSIX兼容，可移植操作系统接口兼容，涉及文件系统）
无中心架构（没有元数据服务器，所有节点角色相同）
Scala-Out横向扩展（容量和性能）
资源池（聚合存储和内存）
全局统一命名空间（物理分散的存储资源虚拟化为一个统一的资源池）
复制和自动修复
易于部署

历程：

2006~2009 版本为1.0~3.0 功能是分布式文件系统，自修复，同步副本，条带，弹性哈希算法
2010 版本为3.1 弹性云能力
2011 版本为3.2 远程复制，监控，Quota，并且被Redhat收购，也是在这个时候被更多的公司应用到生产环境
2012 版本为3.3 对象存储，HDFS兼容，主动自修复，细粒度锁，复制优化
2013 版本为3.4 POSiX ACL支持，同步复制，虚拟机存储优化，Quorum机制，Libgfapi

在13年的3.4版本被广泛的开始使用

优势：

软件定义存储
无中心架构
堆栈式用户空间设计

GlusterFS存储实现方式

弹性哈希算法

无集中式元数据消除性能瓶颈，提高可靠性
采用hash算法定位文件基于路径和文件名，一致性哈希DHT
弹性卷管理文件存储在逻辑卷中，逻辑卷从物理存储池中划分，逻辑可以在线进行扩容或者缩减

对于hash函数有两个特点

单向性不可逆，只能通过输入推导输出
碰撞约束一个输入的结果不会是一个已知的输出结果

对于应用到生产的hash函数，在样本足够多的情况下能保证随机性和分布均匀

流程为

对每个brick分配一个Hash range（扩展集群会重新分配），使用Davies-Meyer算法计算32位hash值，输入参数为文件名
根据hash值在集群中选择子卷(存储服务器)进行文件定位
对所选数据子卷进行数据访问

添加节点之后

老数据分配不变，新数据分配到所有节点（旧目录还是分配到原来的节点，新创建的目录才会分配原来节点和新节点）
执行rebalance数据重新分配（hash重新分配，数据进行迁移）

文件更名之后

采用文件符号链接，访问时解析重定向（类似于软链接，rebalance之后会进行处理迁移）

容量负载优先

设置容量阀值，优先选择容量充足的brick
Hash目标brick上创建文件符号链接
访问解析重定向

无元数据方式

文件属性大小，属组，是写入的inode上的，大小有限
扩展属性使用扩展属性存储元数据，元数据和数据存储在一起，扩展属性通过setfattr和getfattr

堆栈式架构：

通过层级接口堆叠的方式实现

基本概念

Node/Peer 节点
Brick 存储的数据
Volume 文件系统
Tranlator 叠加的功能

？？？网络文件系统使用什么方式交互数据(是通过tcp长链)

卷分类

哈希卷（Distributed Volume）

正常情况文件级为raid0，不具备容错能力(和raid0不一样，raid0一块盘坏了，所有数据都损失)

复制卷（Replicated Volume）

文件同步复制到多个brick上
文件级RAID1，具有容错能力（复制跨节点进行，比硬件有防止Node问题导致磁盘不能访问）

同步写，而非异步写，写性能会受到影响，读性能提升

条带卷（Striped Volume）

单个文件分布到多个brick上，支持超大文件
类似raid0，以Round-Robin方式
通常基于HPC（高性能计算或并行计算）中的超大文件高并发访问

GlusterFS访问接口

File

FUSE 直接挂载
NFS 网络文件系统
SMB 共享，使用CIFS协议

Block

qemu 原生支持qemu，对于虚拟
Cinder openstack直接使用

Transport

IP
RDMA ？？？这是啥答案：一种通过内存而不进过网络IO进行读取存储数据

Swift

libgfapi

读写数据流

主流分布式系统对比

文件系统分类

分布式文件系统 CS架构或者网络文件系统，数据不是本地直连
集群文件系统分布式文件系统的一个子集，多节点协同服务，不存在单点
并行文件系统支持MPI等并行应用，并发读写，所有节点可以同时读写同一文件

GlusterFS是满足上边的条件

FastDFS更适合上传和下载

部署简单，易于管理，底层使用ext4或者zfs 适用于大文件存储，对象存储等场景，对内存等占用较小

应用场景

非结构数据（结构数据：数据库类，非结构存储：文件，半结构数据：key-value数据）
归档，容灾
虚拟机存储（openstack等虚拟化块设备也是非结构性数据）
云存储
内容云（网盘等）
大数据

测试方案

功能测试
数据一致性测试（md5，diff）
POSIX语义严重性测试
部署方式测试（自动安装，集群部署等）
可用性测试（掉电，拔盘，拔网线）
扩展性测试
稳定性测试（长时间满负荷跑，功能性能是否正常，例如LTP，IOzone，Postmark）
压力测试（IOzone，Postmark等，通过iostat，sar等）
性能测试（IOzone，Postmark等，大文件顺序读写指标带宽，小文件随机读写指标是IOPS，目录创建删除，大量小文件读写，大文件读写，这些指标就对应负载，iowait等）

功能测试

基本功能测试

创建，启动，删除，停止卷

文件系统级别

文件操作和控制fstest
系统API调用LTP
锁应用locktest

存储主要的不是性能，而是稳定性和可靠性，除非非常依赖性能的场景

Gluster使用

Gluster安装

下载地址

可以考虑自己配置yum源

[glusterfs]
name=glusterfs
baseurl=https://buildlogs.centos.org/centos/7/storage/x86_64/gluster-3.12/
gpgcheck=0
enabled=1

安装服务

$ yum install rpcbind libaio lvm2-devel glusterfs glusterfs-cli glusterfs-libs glusterfs-api glusterfs-fuse glusterfs-server

格式化磁盘

$ mkfs.ext4 -L /brick1 /dev/vdb
$ mkfs.ext4 -L /brick2 /dev/vdc

分区自动挂载

mkdir /brick1 /brick2
cat  >> /etc/fstab << EOF
LABEL=/brick1 /brick1 ext4 defaults 1 1 
LABEL=/brick2 /brick2 ext4 defaults 1 1 
EOF

启动GlusterFS

$ systemctl start glusterd.service 
$ systemctl status glusterd.service

集群添加节点

$ gluster peer probe test02
peer probe: success. 
[root@test01 yum.repos.d]# gluster peer probe test03
peer probe: success. 

$ gluster peer status
Number of Peers: 2

Hostname: test02
Uuid: ebaa94f9-7349-4ab2-af2c-1f5f7cbe3923
State: Peer in Cluster (Connected)

Hostname: test03
Uuid: 24f6c8a2-5029-4742-8c5b-77900702b7a6
State: Peer in Cluster (Connected)

哈希卷测试

添加存储卷

$ gluster volume create testvol test01:/brick1/date test02:/brick1/date
volume create: testvol: success: please start the volume to access data
$ gluster volume info testvol

Volume Name: testvol
Type: Distribute
Volume ID: c82559eb-f7e0-453e-b712-b8c905ac6f46
Status: Created
Snapshot Count: 0
Number of Bricks: 2
Transport-type: tcp
Bricks:
Brick1: test01:/brick1/date
Brick2: test02:/brick1/date
Options Reconfigured:
transport.address-family: inet
nfs.disable: on

这边制定的一定要是挂载点上的，如果直接指向根上的目录是无法完成挂载的

启动存储卷

$ gluster volume start testvol
volume start: testvol: success

挂载

$ mount -t glusterfs test01:/testvol /mnt/
$ df -h
Filesystem       Size  Used Avail Use% Mounted on
/dev/vda1         50G  1.6G   46G   4% /
devtmpfs         486M     0  486M   0% /dev
tmpfs            496M   24K  496M   1% /dev/shm
tmpfs            496M  512K  496M   1% /run
tmpfs            496M     0  496M   0% /sys/fs/cgroup
tmpfs            100M     0  100M   0% /run/user/0
/dev/vdb          50G   53M   47G   1% /brick1
/dev/vdc          50G   53M   47G   1% /brick2
test01:/testvol   99G  105M   94G   1% /mnt

测试创建创建数据

cd /mnt/
mkdir test2disk
cd test2disk/
touch {0..9}

查看两个节点

test01$ ll /brick1/date/test2disk/
total 16
-rw-r--r-- 2 root root 0 Mar 19 11:49 2
-rw-r--r-- 2 root root 0 Mar 19 11:49 3
-rw-r--r-- 2 root root 0 Mar 19 11:49 4
-rw-r--r-- 2 root root 0 Mar 19 11:49 6

test02$ ll /brick1/date/test2disk/
total 24
-rw-r--r-- 2 root root 0 Mar 19 11:49 0
-rw-r--r-- 2 root root 0 Mar 19 11:49 1
-rw-r--r-- 2 root root 0 Mar 19 11:49 5
-rw-r--r-- 2 root root 0 Mar 19 11:49 7
-rw-r--r-- 2 root root 0 Mar 19 11:49 8
-rw-r--r-- 2 root root 0 Mar 19 11:49 9

可以看到两个brick都有文件写入

如果添加新的brick

$ gluster volume add-brick testvol test03:/brick1/date 
volume add-brick: success

创建新的文件

$touch 1{0..9}

然后在查看一下

test01$ ll /brick1/date/test2disk/
total 40
-rw-r--r-- 2 root root 0 Mar 19 12:05 10
-rw-r--r-- 2 root root 0 Mar 19 12:05 12
-rw-r--r-- 2 root root 0 Mar 19 12:05 14
-rw-r--r-- 2 root root 0 Mar 19 12:05 15
-rw-r--r-- 2 root root 0 Mar 19 12:05 16
-rw-r--r-- 2 root root 0 Mar 19 12:05 17
-rw-r--r-- 2 root root 0 Mar 19 11:49 2
-rw-r--r-- 2 root root 0 Mar 19 11:49 3
-rw-r--r-- 2 root root 0 Mar 19 11:49 4
-rw-r--r-- 2 root root 0 Mar 19 11:49 6
test02$ ll /brick1/date/test2disk/
total 40
-rw-r--r-- 2 root root 0 Mar 19 11:49 0
-rw-r--r-- 2 root root 0 Mar 19 11:49 1
-rw-r--r-- 2 root root 0 Mar 19 12:05 11
-rw-r--r-- 2 root root 0 Mar 19 12:05 13
-rw-r--r-- 2 root root 0 Mar 19 12:05 18
-rw-r--r-- 2 root root 0 Mar 19 12:05 19
-rw-r--r-- 2 root root 0 Mar 19 11:49 5
-rw-r--r-- 2 root root 0 Mar 19 11:49 7
-rw-r--r-- 2 root root 0 Mar 19 11:49 8
-rw-r--r-- 2 root root 0 Mar 19 11:49 9
test03$ ll /brick1/date/test2disk/
total 0

可以看到新加的节点还是没有数据

进行rebalance

$ gluster volume rebalance testvol start
volume rebalance: testvol: success: Rebalance on testvol has been started successfully. Use rebalance status command to check status of the rebalance process.
ID: 3c4aef78-8211-462a-843c-4aca77f59349

可以单独做balance的一个步骤，示例fix Layout和migrate Data，rebalance过程就是这两个过程的结合

gluster volume rebalance testvol fix-layout start
gluster volume rebalance testvol migrate-data start

然后在查看一下

test01$ ll /brick1/date/test2disk/
total 16
-rw-r--r-- 2 root root 0 Mar 19 12:05 16
-rw-r--r-- 2 root root 0 Mar 19 12:05 18
-rw-r--r-- 2 root root 0 Mar 19 12:05 19
-rw-r--r-- 2 root root 0 Mar 19 11:49 7
test02$ ll /brick1/date/test2disk/
total 28
-rw-r--r-- 2 root root 0 Mar 19 11:49 0
-rw-r--r-- 2 root root 0 Mar 19 11:49 1
-rw-r--r-- 2 root root 0 Mar 19 12:05 11
-rw-r--r-- 2 root root 0 Mar 19 12:05 13
-rw-r--r-- 2 root root 0 Mar 19 11:49 5
-rw-r--r-- 2 root root 0 Mar 19 11:49 8
-rw-r--r-- 2 root root 0 Mar 19 11:49 9
test03$ll /brick1/date/test2disk/
total 36
-rw-r--r-- 2 root root 0 Mar 19 12:05 10
-rw-r--r-- 2 root root 0 Mar 19 12:05 12
-rw-r--r-- 2 root root 0 Mar 19 12:05 14
-rw-r--r-- 2 root root 0 Mar 19 12:05 15
-rw-r--r-- 2 root root 0 Mar 19 12:05 17
-rw-r--r-- 2 root root 0 Mar 19 11:49 2
-rw-r--r-- 2 root root 0 Mar 19 11:49 3
-rw-r--r-- 2 root root 0 Mar 19 11:49 4
-rw-r--r-- 2 root root 0 Mar 19 11:49 6

可以看到rebalance之后重新分配了文件存储

如果需要进行删除brick，需要进行remove-brick

$ gluster volume remove-brick testvol test01:/brick1/date start
volume remove-brick start: success
ID: 5d60df94-1da6-4dfe-98c6-86d4d1d9d9f8
$ ll /brick1/date/test2disk/
total 0

然后在查看一下

test01$ ll /brick1/date/test2disk/
total 0
test02$ ll /brick1/date/test2disk/
total 40
-rw-r--r-- 2 root root 0 Mar 19 11:49 0
-rw-r--r-- 2 root root 0 Mar 19 11:49 1
-rw-r--r-- 2 root root 0 Mar 19 12:05 11
-rw-r--r-- 2 root root 0 Mar 19 12:05 13
-rw-r--r-- 2 root root 0 Mar 19 12:05 18
-rw-r--r-- 2 root root 0 Mar 19 12:05 19
-rw-r--r-- 2 root root 0 Mar 19 11:49 5
-rw-r--r-- 2 root root 0 Mar 19 11:49 7
-rw-r--r-- 2 root root 0 Mar 19 11:49 8
-rw-r--r-- 2 root root 0 Mar 19 11:49 9
test03$ ll /brick1/date/test2disk/
total 40
-rw-r--r-- 2 root root 0 Mar 19 12:05 10
-rw-r--r-- 2 root root 0 Mar 19 12:05 12
-rw-r--r-- 2 root root 0 Mar 19 12:05 14
-rw-r--r-- 2 root root 0 Mar 19 12:05 15
-rw-r--r-- 2 root root 0 Mar 19 12:05 16
-rw-r--r-- 2 root root 0 Mar 19 12:05 17
-rw-r--r-- 2 root root 0 Mar 19 11:49 2
-rw-r--r-- 2 root root 0 Mar 19 11:49 3
-rw-r--r-- 2 root root 0 Mar 19 11:49 4
-rw-r--r-- 2 root root 0 Mar 19 11:49 6

移除brick

$ gluster volume remove-brick testvol test01:/brick1/date  status 
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost                4        0Bytes             4             0             0            completed        0:00:00
$ gluster volume remove-brick testvol test01:/brick1/date  commit
Removing brick(s) can result in data loss. Do you want to Continue? (y/n) y
volume remove-brick commit: success
Check the removed bricks to ensure all files are migrated.
If files with data are found on the brick path, copy them via a gluster mount point before re-purposing the removed brick. 
$ gluster volume info testvol

Volume Name: testvol
Type: Distribute
Volume ID: c82559eb-f7e0-453e-b712-b8c905ac6f46
Status: Started
Snapshot Count: 0
Number of Bricks: 2
Transport-type: tcp
Bricks:
Brick1: test02:/brick1/date
Brick2: test03:/brick1/date
Options Reconfigured:
performance.client-io-threads: on
transport.address-family: inet
nfs.disable: on

迁移卷

？？？测试过程中有问题

$ gluster  volume  replace-brick  testvol  test02:/brick1/date test01:/brick2/date  start

Usage:
volume replace-brick <VOLNAME> <SOURCE-BRICK> <NEW-BRICK> {commit force}

删除卷

$ gluster volume stop testvol
Stopping volume will make its data inaccessible. Do you want to continue? (y/n) y
volume stop: testvol: success
$ gluster volume delete testvol
Deleting volume will erase all information about the volume. Do you want to continue? (y/n) y
volume delete: testvol: success

设置卷属性

$ gluster volume set 

Usage:
volume set <VOLNAME> <KEY> <VALUE>

参数项目	说明	缺省值	合法值
auth.allow	IP访问授权	*(allow all)	ip地址
cluster.min-free-disk	剩余磁盘空间阈值	10%	百分比
cluster.stripe-block-size	条带大小	128KB	字节
network.frame-timeout	请求等待时间	1800s	0-1800
network.ping-timeout	客户端等待时间	42s	0-42
nfs.disabled	关闭NFS服务	off	off/on
performance.io-thread-count	IO线程数	16	0-65
performance.cache-refresh-timeout	缓存校验周期	1s	0-61
performance.cache-size	读缓存大小	32MB	字节

auth.allow也有对应的auth.reject

复制哈希卷

$ gluster volume create testvol repl 2 test01:/brick1/date test02:/brick1/date
$ gluster volume info testvol

Volume Name: testvol
Type: Replicate
Volume ID: 0fca067b-ea58-4f49-9b25-f31b96a4c146
Status: Created
Snapshot Count: 0
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: test01:/brick1/date
Brick2: test02:/brick1/date
Options Reconfigured:
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off

test01$ ll /brick1/date/test2disk/
total 40
-rw-r--r-- 2 root root 0 Mar 19 15:57 0
-rw-r--r-- 2 root root 0 Mar 19 15:57 1
-rw-r--r-- 2 root root 0 Mar 19 15:57 2
-rw-r--r-- 2 root root 0 Mar 19 15:57 3
-rw-r--r-- 2 root root 0 Mar 19 15:57 4
-rw-r--r-- 2 root root 0 Mar 19 15:57 5
-rw-r--r-- 2 root root 0 Mar 19 15:57 6
-rw-r--r-- 2 root root 0 Mar 19 15:57 7
-rw-r--r-- 2 root root 0 Mar 19 15:57 8
-rw-r--r-- 2 root root 0 Mar 19 15:57 9
test02$ ll /brick1/date/test2disk/
total 40
-rw-r--r-- 2 root root 0 Mar 19 15:57 0
-rw-r--r-- 2 root root 0 Mar 19 15:57 1
-rw-r--r-- 2 root root 0 Mar 19 15:57 2
-rw-r--r-- 2 root root 0 Mar 19 15:57 3
-rw-r--r-- 2 root root 0 Mar 19 15:57 4
-rw-r--r-- 2 root root 0 Mar 19 15:57 5
-rw-r--r-- 2 root root 0 Mar 19 15:57 6
-rw-r--r-- 2 root root 0 Mar 19 15:57 7
-rw-r--r-- 2 root root 0 Mar 19 15:57 8
-rw-r--r-- 2 root root 0 Mar 19 15:57 9

火眼征信大数据工程师闫大佬