why's blog

kubernetes集群节点NotReady

时间：Jan. 4, 2021 分类：工作问题

问题排查

当时存在k8s的node节点偶尔会有NotReady状态

通过journalctl -u kubelet.service查看日志，可以看到大量的PLEG问题

Dec 29 19:29:25 k8s-worker01.whysdomain.com kubelet[47927]: E1229 19:29:25.008549   47927 pod_workers.go:191] Error syncing pod 763fb9c3-8fe1-423e-ad2d-503935380b9c ("gridshop-68584886f-ncq2v_stra
Dec 29 19:29:28 k8s-worker01.whysdomain.com kubelet[47927]: E1229 19:29:28.007247   47927 kubelet.go:1845] skipping pod synchronization - PLEG is not healthy: pleg was last seen active 3m0.8692148
Dec 29 19:29:28 k8s-worker01.whysdomain.com kubelet[47927]: E1229 19:29:28.107507   47927 kubelet.go:1845] skipping pod synchronization - PLEG is not healthy: pleg was last seen active 3m0.9694697
Dec 29 19:29:28 k8s-worker01.whysdomain.com kubelet[47927]: E1229 19:29:28.307701   47927 kubelet.go:1845] skipping pod synchronization - PLEG is not healthy: pleg was last seen active 3m1.1696783
Dec 29 19:29:28 k8s-worker01.whysdomain.com kubelet[47927]: E1229 19:29:28.612767   47927 remote_runtime.go:351] ExecSync 823e12efc6c7570011b78bf5df26eca6828da661f268025a63c97b58122dc7c5 './grpc_h
Dec 29 19:29:28 k8s-worker01.whysdomain.com kubelet[47927]: E1229 19:29:28.692309   47927 remote_runtime.go:351] ExecSync 823e12efc6c7570011b78bf5df26eca6828da661f268025a63c97b58122dc7c5 './grpc_h
Dec 29 19:29:28 k8s-worker01.whysdomain.com kubelet[47927]: E1229 19:29:28.707861   47927 kubelet.go:1845] skipping pod synchronization - PLEG is not healthy: pleg was last seen active 3m1.5698356
Dec 29 19:29:29 k8s-worker01.whysdomain.com kubelet[47927]: E1229 19:29:29.508105   47927 kubelet.go:1845] skipping pod synchronization - PLEG is not healthy: pleg was last seen active 3m2.3700767
Dec 29 19:29:31 k8s-worker01.whysdomain.com kubelet[47927]: E1229 19:29:31.108349   47927 kubelet.go:1845] skipping pod synchronization - PLEG is not healthy: pleg was last seen active 3m3.9703190
Dec 29 19:29:34 k8s-worker01.whysdomain.com kubelet[47927]: E1229 19:29:34.309115   47927 kubelet.go:1845] skipping pod synchronization - PLEG is not healthy: pleg was last seen active 3m7.1710830
Dec 29 19:29:35 k8s-worker01.whysdomain.com kubelet[47927]: I1229 19:29:35.715940   47927 setters.go:559] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2020-12-29 19:29:35.7158
Dec 29 19:29:39 k8s-worker01.whysdomain.com kubelet[47927]: E1229 19:29:39.309538   47927 kubelet.go:1845] skipping pod synchronization - PLEG is not healthy: pleg was last seen active 3m12.171494

PLEG就是docker的问题了，通过USR1信号，打印docker daemon的线程调用栈，位置/var/run/docker，daemon的主要作用是接收http请求，路由请求，然后进入处理函数，可能会存在等待mutex实例的情况

$ kill -USR1 47534
$ ll /run/docker/
-rw-r--r-- 1 root root 1672003 Dec 29 22:41 goroutine-stacks-2020-12-29T224154+0800.log
-rw-r--r-- 1 root root 1681425 Dec 29 22:48 goroutine-stacks-2020-12-29T224815+0800.log
-rw-r--r-- 1 root root 1681441 Dec 29 22:48 goroutine-stacks-2020-12-29T224833+0800.log
-rw-r--r-- 1 root root 1684623 Dec 29 22:51 goroutine-stacks-2020-12-29T225121+0800.log
-rw-r--r-- 1 root root 1685505 Dec 29 22:51 goroutine-stacks-2020-12-29T225131+0800.log
-rw-r--r-- 1 root root 1687790 Dec 29 22:52 goroutine-stacks-2020-12-29T225233+0800.log
-rw-r--r-- 1 root root 1687833 Dec 29 22:53 goroutine-stacks-2020-12-29T225306+0800.log
-rw-r--r-- 1 root root 1710028 Dec 29 23:08 goroutine-stacks-2020-12-29T230833+0800.log

可以在日志里看到大量锁的信息

$ grep mux /run/docker/goroutine-stacks-2020-12-29T224154+0800.log
github.com/docker/docker/vendor/github.com/gorilla/mux.(*Router).ServeHTTP(0xc000bc8300, 0x55f4083d27c0, 0xc002293b20, 0xc002b80900)
    /root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/vendor/github.com/gorilla/mux/mux.go:212 +0xe5
github.com/docker/docker/vendor/github.com/gorilla/mux.(*Router).ServeHTTP(0xc000bc8300, 0x55f4083d27c0, 0xc00c222e00, 0xc00a7a1b00)
    /root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/vendor/github.com/gorilla/mux/mux.go:212 +0xe5
github.com/docker/docker/vendor/github.com/gorilla/mux.(*Router).ServeHTTP(0xc000bc8300, 0x55f4083d27c0, 0xc009d79dc0, 0xc00964c200)
    /root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/vendor/github.com/gorilla/mux/mux.go:212 +0xe5
github.com/docker/docker/vendor/github.com/gorilla/mux.(*Router).ServeHTTP(0xc000bc8300, 0x55f4083d27c0, 0xc0041d6e00, 0xc0036f7100)
    /root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/vendor/github.com/gorilla/mux/mux.go:212 +0xe5
github.com/docker/docker/vendor/github.com/gorilla/mux.(*Router).ServeHTTP(0xc000bc8300, 0x55f4083d27c0, 0xc00253eee0, 0xc001f0eb00)
    /root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/vendor/github.com/gorilla/mux/mux.go:212 +0xe5
...

打印containerd的日志，kill -SIGUSR1 <containerd-pid>，日志位于/var/log/message

打印的日志没有换行，最后一行为

Dec 30 16:13:22 VM-49-5-centos containerd: time="2020-12-30T16:13:22.478162686+08:00" level=info msg="goroutine stack dump written to /tmp/containerd.46182.stacks.log"

去/tmp/下查看日志，可以看很多等待的

$ grep -i process /tmp/containerd.46182.stacks.log
github.com/containerd/containerd/vendor/google.golang.org/grpc.(*Server).processStreamingRPC(0xc000510900, 0x55ff2140be20, 0xc0004d9b00, 0xc000598300, 0xc00042a3f0, 0x55ff21f79fc0, 0x0, 0x0, 0x0)
github.com/containerd/containerd/vendor/google.golang.org/grpc.(*Server).processStreamingRPC(0xc000510900, 0x55ff2140be20, 0xc0004d9500, 0xc00023a000, 0xc00042a3f0, 0x55ff21f79fc0, 0x0, 0x0, 0x0)
os.(*Process).blockUntilWaitable(0xc0005a8360, 0xc0005cdec0, 0x55ff1f9d9932, 0x0)
os.(*Process).wait(0xc0005a8360, 0x1, 0x0, 0x0)
os.(*Process).Wait(...)
os.(*Process).blockUntilWaitable(0xc0004863f0, 0x0, 0x0, 0x0)
os.(*Process).wait(0xc0004863f0, 0x0, 0x0, 0x0)
os.(*Process).Wait(...)
os.(*Process).blockUntilWaitable(0xc0002d2990, 0xc0005c6eb0, 0x55ff1f8bfc1e, 0xc000510f00)
os.(*Process).wait(0xc0002d2990, 0x0, 0x61746e6f632e6f69, 0x6e732e6472656e69)
os.(*Process).Wait(...)
os.(*Process).blockUntilWaitable(0xc0001f0090, 0x0, 0x0, 0x0)
os.(*Process).wait(0xc0001f0090, 0x0, 0x0, 0x0)
os.(*Process).Wait(...)
os.(*Process).blockUntilWaitable(0xc0004863c0, 0x0, 0x0, 0x0)
os.(*Process).wait(0xc0004863c0, 0x0, 0x0, 0x0)
os.(*Process).Wait(...)
os.(*Process).blockUntilWaitable(0xc0003ec480, 0x0, 0x0, 0x0)
os.(*Process).wait(0xc0003ec480, 0x0, 0x0, 0x1)
os.(*Process).Wait(...)
os.(*Process).blockUntilWaitable(0xc0003ecd80, 0xc0006fb6c0, 0x55ff1f9d9932, 0x0)
os.(*Process).wait(0xc0003ecd80, 0x55ff21391f01, 0x55ff210ec1a0, 0xc0006f9b60)
os.(*Process).Wait(...)
os.(*Process).blockUntilWaitable(0xc0001f0510, 0x0, 0x0, 0x0)
os.(*Process).wait(0xc0001f0510, 0x0, 0x0, 0x0)
os.(*Process).Wait(...)
os.(*Process).blockUntilWaitable(0xc0000449c0, 0xc0007064e0, 0x55ff22000420, 0xffffffffffffffff)
os.(*Process).wait(0xc0000449c0, 0x55ff21391fb0, 0x55ff210ec1a0, 0xc000706540)
os.(*Process).Wait(...)
os.(*Process).blockUntilWaitable(0xc0005a87b0, 0x0, 0x0, 0x0)
os.(*Process).wait(0xc0005a87b0, 0x0, 0x0, 0x0)
os.(*Process).Wait(...)

问题原因

kubelet会定期的将节点信息（包括节点和容器的），将状态同步到apiserver

kubelet中通过PLEG来定期获取容器的状态，超时时间为3min，如果超时会将节点NotReady的状态同步到apiserver

PLEG是pod lifecycle event generator，为kubelet用于检测runtime状态，也就是docker的状态

Docker分为daemon，containerd，containerd-shim和runC

daemon
containerd面向daemin提供grpc接口，负责节点上容器声明周期的管理（包括镜像，存储和网络等）
containerd-shim负责管理容器启动和停止，监控容器状态
runC创建容器进程，容器启动之后，runC会进行退出

$ ps -ef | grep containerd
....
root       3283  46182  0 12月29 ?      00:00:33 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/54c4a1baac74b5026a8439a0bdfbf49afe302ea2112a1930acc700baec29ea1c -address /run/containerd/containerd.sock -containerd-binary /usr/bin/containerd -runtime-root /var/run/docker/runtime-runc -systemd-cgroup
root      91251   3283  0 00:50 ?        00:00:00 runc --root /var/run/docker/runtime-runc/moby --log /run/containerd/io.containerd.runtime.v1.linux/moby/54c4a1baac74b5026a8439a0bdfbf49afe302ea2112a1930acc700baec29ea1c/log.json --log-format json --systemd-cgroup exec --process /tmp/runc-process856557927 --detach --pid-file /run/containerd/io.containerd.runtime.v1.linux/moby/54c4a1baac74b5026a8439a0bdfbf49afe302ea2112a1930acc700baec29ea1c/cf510b693d00cb1962e474ad5086267423f49ddc987f890ca19a08889362006c.pid 54c4a1baac74b5026a8439a0bdfbf49afe302ea2112a1930acc700baec29ea1c
root      91265   3283  8 00:50 ?        01:20:05 runc --root /var/run/docker/runtime-runc/moby --log /run/containerd/io.containerd.runtime.v1.linux/moby/54c4a1baac74b5026a8439a0bdfbf49afe302ea2112a1930acc700baec29ea1c/log.json --log-format json --systemd-cgroup update --resources - 54c4a1baac74b5026a8439a0bdfbf49afe302ea2112a1930acc700baec29ea1c
work      91267  91251  0 00:50 ?        00:00:00 runc init

containerd-shim同时启动了

runc exec
runc update --resources -(用于读取标准输入)

/tmp/runc-process856557927文件示例

{"user":{"uid":1000,"gid":1000},"args":["./grpc_health_probe","--addr=:8609"],"env":["PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin","HOSTNAME=nav-pause-5844f4558b-gd57s","ContainerPort=8609","LivenessProbe=/","ReadinessProbe=/","Run_Env=Prod","DATA_PAUSE_PORT=tcp://10.112.56.102:80","NAV_SERVICE_PORT_HTTP=80","KUBERNETES_PORT=tcp://10.112.0.1:443","DATA_PAUSE_PORT_80_TCP=tcp://10.112.56.102:80","DATA_SERVICE_PORT=80","DATA_PORT_80_TCP_PORT=80","NAV_SERVICE_HOST=10.121.36.204","NAV_SERVICE_PORT=80","NAV_PAUSE_PORT=tcp://10.117.82.106:80","KUBERNETES_SERVICE_PORT_HTTPS=443","DATA_PAUSE_PORT_80_TCP_PORT=80","DATA_SERVICE_HOST=10.127.64.120","NAV_PORT_80_TCP_PROTO=tcp","NAV_PAUSE_PORT_80_TCP_PORT=80","NAV_PAUSE_PORT_80_TCP_ADDR=10.117.82.106","DATA_SERVICE_PORT_HTTP=80","NAV_PORT_80_TCP_PORT=80","NAV_PAUSE_SERVICE_PORT=80","DATA_PORT_80_TCP=tcp://10.127.64.120:80","DATA_PORT_80_TCP_ADDR=10.127.64.120","NAV_PORT_80_TCP_ADDR=10.121.36.204","NAV_PAUSE_SERVICE_PORT_HTTP=80","KUBERNETES_SERVICE_HOST=10.112.0.1","KUBERNETES_PORT_443_TCP_PORT=443","KUBERNETES_SERVICE_PORT=443","DATA_PAUSE_SERVICE_HOST=10.112.56.102","DATA_PAUSE_SERVICE_PORT=80","DATA_PAUSE_SERVICE_PORT_HTTP=80","DATA_PAUSE_PORT_80_TCP_PROTO=tcp","DATA_PAUSE_PORT_80_TCP_ADDR=10.112.56.102","DATA_PORT_80_TCP_PROTO=tcp","NAV_PAUSE_PORT_80_TCP_PROTO=tcp","KUBERNETES_PORT_443_TCP=tcp://10.112.0.1:443","KUBERNETES_PORT_443_TCP_PROTO=tcp","KUBERNETES_PORT_443_TCP_ADDR=10.112.0.1","DATA_PORT=tcp://10.127.64.120:80","NAV_PORT=tcp://10.121.36.204:80","NAV_PORT_80_TCP=tcp://10.121.36.204:80","NAV_PAUSE_SERVICE_HOST=10.117.82.106","NAV_PAUSE_PORT_80_TCP=tcp://10.117.82.106:80"],"cwd":"/home/work/nav","capabilities":{"bounding":["CAP_CHOWN","CAP_DAC_OVERRIDE","CAP_FSETID","CAP_FOWNER","CAP_MKNOD","CAP_NET_RAW","CAP_SETGID","CAP_SETUID","CAP_SETFCAP","CAP_SETPCAP","CAP_NET_BIND_SERVICE","CAP_SYS_CHROOT","CAP_KILL","CAP_AUDIT_WRITE"],"inheritable":["CAP_CHOWN","CAP_DAC_OVERRIDE","CAP_FSETID","CAP_FOWNER","CAP_MKNOD","CAP_NET_RAW","CAP_SETGID","CAP_SETUID","CAP_SETFCAP","CAP_SETPCAP","CAP_NET_BIND_SERVICE","CAP_SYS_CHROOT","CAP_KILL","CAP_AUDIT_WRITE"]},"oomScoreAdj":-998}

可以通过命令行的方式直接使用runc命令来执行

grpc检测的时候使用了exec的探活方式，需用进入容器执行对应命令，kubelet会通过runc的方式调用容器进行启动，导致有的探活runc没有正常执行完成

问题解决方式

runc的进程没有正常退出，将进程杀掉，可以看到新的调用栈日志是非常小，内部也没有锁相关的信息

$ kill 91251 91265 91267
$ kill -USR1 47534
$ ll /run/docker/

-rw-r--r--  1 root root 1672003 Dec 29 22:41 goroutine-stacks-2020-12-29T224154+0800.log
-rw-r--r--  1 root root 1681425 Dec 29 22:48 goroutine-stacks-2020-12-29T224815+0800.log
-rw-r--r--  1 root root 1681441 Dec 29 22:48 goroutine-stacks-2020-12-29T224833+0800.log
-rw-r--r--  1 root root 1684623 Dec 29 22:51 goroutine-stacks-2020-12-29T225121+0800.log
-rw-r--r--  1 root root 1685505 Dec 29 22:51 goroutine-stacks-2020-12-29T225131+0800.log
-rw-r--r--  1 root root 1687790 Dec 29 22:52 goroutine-stacks-2020-12-29T225233+0800.log
-rw-r--r--  1 root root 1687833 Dec 29 22:53 goroutine-stacks-2020-12-29T225306+0800.log
-rw-r--r--  1 root root 1710028 Dec 29 23:08 goroutine-stacks-2020-12-29T230833+0800.log
-rw-r--r--  1 root root  106479 Dec 29 23:10 goroutine-stacks-2020-12-29T231044+0800.log
$ grep mux goroutine-stacks-2020-12-29T231044+0800.log

节点恢复正常

为了方式这种PLEG的问题做了以下改进

探活方式由exec探活改为了tcp探活
添加了kubelet检测pleg时间的监控

监控采集数据

      - record: Kubelet_PLEGRlist_Gt4s_Num
        ## expr:  sum by (instance,cluster) (rate(kubelet_pleg_relist_interval_seconds_bucket{le="+Inf"}[10m])) - sum by (instance,cluster) (rate(kubelet_pleg_relist_interval_seconds_bucket{le="5"}[10m]))
        expr: (sum by (instance,cluster) (kubelet_pleg_relist_interval_seconds_bucket{le="+Inf"}) - sum by (instance,cluster) (kubelet_pleg_relist_interval_seconds_bucket{le="5"}))-(sum by (instance,cluster) (kubelet_pleg_relist_interval_seconds_bucket{le="+Inf"} offset 1m) - sum by (instance,cluster) (kubelet_pleg_relist_interval_seconds_bucket{le="5"} offset 1m))
        labels:
          common: "Kubelet的PLEG进行rlist操作时间超过4s的数量

参考 https://www.cnblogs.com/ryanyangcs/p/12294899.html

火眼征信大数据工程师闫大佬