coredns使用代理解析失败
目录:
问题介绍
新上kubernetes业务解析部分域名失败
现象参考
$ nslookup api-push.vivo.com.cn
Server: 10.96.0.10
Address: 10.96.0.10#53
** server can't find api-push.vivo.com.cn: SERVFAIL
$ nslookup www.sina.com.cn
Server: 10.96.0.10
Address: 10.96.0.10#53
** server can't find www.sina.com.cn: SERVFAIL
$ nslookup www.baidu.com
Server: 10.96.0.10
Address: 10.96.0.10#53
Non-authoritative answer:
www.baidu.com canonical name = www.a.shifen.com.
Name: www.a.shifen.com
Address: 180.101.49.11
Name: www.a.shifen.com
Address: 180.101.49.12
这里www.baidu.com是能正常解析的,而其他两个域名不能正常解析
在宿主机上,三个域名都能正常解析
问题排查
抓包排查
这里宿主机抓包看到,dns的请求返回的数据是正常的,但是会在一次dns请求的时间内重复请求
$ tcpdump -nnn port 53
11:16:23.111282 IP 192.168.1.200.53062 > 192.168.1.100.53: 30973+ A? api-push.vivo.com.cn. (38)
11:16:23.113128 IP 192.168.1.100.53 > 192.168.1.200.53062: 30973 20/0/0 A 182.92.255.169, A 112.126.11.70, A 112.126.122.234, A 39.105.250.172, A 112.126.122.207, A 112.126.12.147, A 112.126.123.53, A 39.106.77.118, A 39.105.252.165, A 39.105.250.170, A 112.126.12.58, A 39.107.240.232, A 39.105.251.19, A 39.106.66.100, A 39.105.250.174, A 123.57.207.113, A 112.126.12.13, A 39.105.253.59, A 39.105.180.52, A 39.105.250.169 (758)
11:16:23.113434 IP 192.168.1.200.17143 > 192.168.1.101.53: 30973+ A? api-push.vivo.com.cn. (38)
11:16:23.113518 IP 192.168.1.200.40943 > 192.168.1.100.53: 64593+ NS? . (17)
11:16:23.115875 IP 192.168.1.101.53 > 192.168.1.200.17143: 30973 20/0/0 A 39.105.252.119, A 39.105.252.228, A 39.105.252.165, A 39.106.77.118, A 39.105.180.52, A 182.92.255.169, A 112.126.122.207, A 39.105.253.59, A 39.105.250.169, A 123.57.207.113, A 112.126.12.147, A 39.105.250.172, A 39.106.66.100, A 39.105.250.174, A 112.126.12.13, A 39.105.252.250, A 39.105.251.19, A 39.107.240.232, A 112.126.11.70, A 39.105.250.170 (758)
11:16:23.115899 IP 192.168.1.100.53 > 192.168.1.200.40943: 64593 13/0/0 NS g.root-servers.net., NS k.root-servers.net., NS a.root-servers.net., NS e.root-servers.net., NS d.root-servers.net., NS m.root-servers.net., NS c.root-servers.net., NS b.root-servers.net., NS h.root-servers.net., NS i.root-servers.net., NS l.root-servers.net., NS f.root-servers.net., NS j.root-servers.net. (420)
11:16:23.116168 IP 192.168.1.200.39677 > 192.168.1.102.53: 30973+ A? api-push.vivo.com.cn. (38)
11:16:23.116256 IP 192.168.1.200.53370 > 192.168.1.101.53: 183+ NS? . (17)
11:16:23.118521 IP 192.168.1.102.53 > 192.168.1.200.39677: 30973 20/0/0 A 39.105.253.59, A 39.105.252.250, A 112.126.12.13, A 39.105.250.174, A 39.105.252.165, A 112.126.12.147, A 39.106.77.118, A 39.105.252.228, A 39.105.252.119, A 112.126.122.207, A 112.126.11.70, A 39.105.250.172, A 39.105.250.170, A 39.105.251.19, A 182.92.255.169, A 123.57.207.113, A 39.107.240.232, A 39.106.66.100, A 39.105.180.52, A 39.105.250.169 (758)
11:16:23.118535 IP 192.168.1.101.53 > 192.168.1.200.53370: 183 13/0/0 NS b.root-servers.net., NS h.root-servers.net., NS i.root-servers.net., NS l.root-servers.net., NS f.root-servers.net., NS j.root-servers.net., NS g.root-servers.net., NS k.root-servers.net., NS a.root-servers.net., NS e.root-servers.net., NS d.root-servers.net., NS m.root-servers.net., NS c.root-servers.net. (420)
11:16:23.118745 IP 192.168.1.200.48507 > 192.168.1.100.53: 30973+ A? api-push.vivo.com.cn. (38)
11:16:23.118837 IP 192.168.1.200.27376 > 192.168.1.102.53: 40007+ NS? . (17)
11:16:23.119816 IP 192.168.1.102.53 > 192.168.1.200.27376: 40007 13/0/0 NS j.root-servers.net., NS g.root-servers.net., NS k.root-servers.net., NS a.root-servers.net., NS e.root-servers.net., NS d.root-servers.net., NS m.root-servers.net., NS c.root-servers.net., NS b.root-servers.net., NS h.root-servers.net., NS i.root-servers.net., NS l.root-servers.net., NS f.root-servers.net. (420)
11:16:23.120716 IP 192.168.1.100.53 > 192.168.1.200.48507: 30973 20/0/0 A 39.105.250.174, A 39.105.180.52, A 39.105.252.250, A 112.126.12.147, A 39.107.240.232, A 39.105.250.172, A 39.105.252.228, A 39.105.250.170, A 39.105.251.19, A 39.105.253.59, A 39.105.252.119, A 112.126.122.207, A 123.57.207.113, A 39.105.250.169, A 39.106.66.100, A 39.106.77.118, A 112.126.12.13, A 39.105.252.165, A 112.126.11.70, A 182.92.255.169 (758)
11:16:23.120955 IP 192.168.1.200.4269 > 192.168.1.101.53: 30973+ A? api-push.vivo.com.cn. (38)
11:16:23.120983 IP 192.168.1.200.30322 > 192.168.1.100.53: 9184+ NS? . (17)
11:16:23.121808 IP 192.168.1.100.53 > 192.168.1.200.30322: 9184 13/0/0 NS j.root-servers.net., NS g.root-servers.net., NS k.root-servers.net., NS a.root-servers.net., NS e.root-servers.net., NS d.root-servers.net., NS m.root-servers.net., NS c.root-servers.net., NS b.root-servers.net., NS h.root-servers.net., NS i.root-servers.net., NS l.root-servers.net., NS f.root-servers.net. (420)
11:16:23.122700 IP 192.168.1.101.53 > 192.168.1.200.4269: 30973 20/0/0 A 182.92.255.169, A 112.126.11.70, A 112.126.12.13, A 112.126.123.53, A 39.105.252.165, A 39.105.250.174, A 112.126.12.147, A 112.126.122.207, A 39.105.180.52, A 112.126.122.234, A 39.106.66.100, A 112.126.12.58, A 39.105.250.169, A 39.105.251.19, A 39.105.253.59, A 39.105.250.170, A 39.105.250.172, A 39.107.240.232, A 39.106.77.118, A 123.57.207.113 (758)
11:16:23.122912 IP 192.168.1.200.58375 > 192.168.1.102.53: 30973+ A? api-push.vivo.com.cn. (38)
...
日志
[ERROR] plugin/errors: 2 api-push.vivo.com.cn. A: dns: buffer size too small
[ERROR] plugin/errors: 2 api-push.vivo.com.cn. A: dns: buffer size too small
[ERROR] plugin/errors: 2 api-push.vivo.com.cn. A: dns: buffer size too small
[ERROR] plugin/errors: 2 api-push.vivo.com.cn. A: dns: buffer size too small
[ERROR] plugin/errors: 2 api-push.vivo.com.cn. A: dns: buffer size too small
[ERROR] plugin/errors: 2 api-push.vivo.com.cn. A: dns: buffer size too small
[ERROR] plugin/errors: 2 api-push.vivo.com.cn. A: dns: buffer size too small
[ERROR] plugin/errors: 2 api-push.vivo.com.cn. A: dns: buffer size too small
[ERROR] plugin/errors: 2 api-push.vivo.com.cn. A: dns: buffer size too small
[ERROR] plugin/errors: 2 api-push.vivo.com.cn. A: dns: buffer size too small
[ERROR] plugin/errors: 2 api-push.vivo.com.cn. A: dns: buffer size too small
[ERROR] plugin/errors: 2 api-push.vivo.com.cn. A: dns: buffer size too small
[ERROR] plugin/errors: 2 api-push.vivo.com.cn. A: dns: buffer size too small
[ERROR] plugin/errors: 2 api-push.vivo.com.cn. A: dns: buffer size too small
[ERROR] plugin/errors: 2 api-push.vivo.com.cn. A: dns: buffer size too small
[ERROR] plugin/errors: 2 www.sina.com.cn. A: dns: overflow unpacking uint32
[ERROR] plugin/errors: 2 www.sina.com.cn. A: dns: overflow unpacking uint32
[ERROR] plugin/errors: 2 www.sina.com.cn. A: dns: overflow unpacking uint32
[ERROR] plugin/errors: 2 www.sina.com.cn. A: dns: overflow unpacking uint32
[ERROR] plugin/errors: 2 www.sina.com.cn. A: dns: overflow unpacking uint32
[ERROR] plugin/errors: 2 www.sina.com.cn. A: dns: overflow unpacking uint32
[ERROR] plugin/errors: 2 www.sina.com.cn. A: dns: overflow unpacking uint32
[ERROR] plugin/errors: 2 www.sina.com.cn. A: dns: overflow unpacking uint32
[ERROR] plugin/errors: 2 www.sina.com.cn. A: dns: overflow unpacking uint32
[ERROR] plugin/errors: 2 www.sina.com.cn. A: dns: overflow unpacking uint32
[ERROR] plugin/errors: 2 www.sina.com.cn. A: dns: overflow unpacking uint32
[ERROR] plugin/errors: 2 www.sina.com.cn. A: dns: overflow unpacking uint32
[ERROR] plugin/errors: 2 www.sina.com.cn. A: dns: overflow unpacking uint32
[ERROR] plugin/errors: 2 www.sina.com.cn. A: dns: overflow unpacking uint32
[ERROR] plugin/errors: 2 www.sina.com.cn. A: dns: overflow unpacking uint32
[ERROR] plugin/errors: 2 www.sina.com.cn. A: dns: overflow unpacking uint32
参考buffer size too small问题,可能是因为解析dns返回的数据记录太多导致
问题解决
参考官方文档bufsize
bufsize在1.6.6版本引入,我们是使用的为1.6.7,生效的代码为coredns的plugin/forward/connect.go
func (p *Proxy) Connect(ctx context.Context, state request.Request, opts options) (*dns.Msg, error) {
...
pc, cached, err := p.transport.Dial(proto)
if err != nil {
return nil, err
}
// Set buffer size correctly for this client.
pc.c.UDPSize = uint16(state.Size())
if pc.c.UDPSize < 512 {
pc.c.UDPSize = 512
}
...
}
使用UDP的原因是,forward这边只做请求的转发,默认不做协议的转换,默认buffersize为512,调整为最大的4096问题恢复
apiVersion: v1
data:
Corefile: |
.:53 {
bufsize 4096
errors
health {
lameduck 5s
}
ready
prometheus :9153
forward . /etc/resolv.conf
cache 30 {
success 10240
denial 51200
}
template ANY AAAA {
rcode NXDOMAIN
}
loop
reload
loadbalance
}
kind: ConfigMap
metadata:
name: coredns
namespace: kube-system
造成一直循环请求的问题是这段代码
/plugin/forward/forward.go
func (f *Forward) ServeDNS(ctx context.Context, w dns.ResponseWriter, r *dns.Msg) (int, error) {
...
opts := f.opts
for {
ret, err = proxy.Connect(ctx, state, opts)
if err == ErrCachedClosed { // Remote side closed conn, can only happen with TCP.
continue
}
// Retry with TCP if truncated and prefer_udp configured.
if ret != nil && ret.Truncated && !opts.forceTCP && opts.preferUDP {
opts.forceTCP = true
continue
}
break
}
...
缓冲区溢出导致ret为空,导致一直在for循环直到超时
如果日后出现4096还不能满足的域名解析,需要配置forward的force_tcp参数对域单独进行tcp请求,参考forward
根因调查
线上服务coredns数据包做了报文压缩,大小为400bytes
进行一下进制转换
c00c(十六进制) = 1100 0000 0000 1100(二进制)
- 前两位(高位)11代表压缩
- 后边14位为指向的位置1100(二进制转为10进制)为12,代表DNS报文偏移12字节的位置,详见https://www.ietf.org/rfc/rfc1035.txt的4.1.4. Message Compression
- 08代表这集域名的长度api-push,8个字符,然后61为a,可以python的chr(0x61)
DNS报文偏移的12个字节如下所示
这样2bytes(c0 0c)代替22 bytes(字符对应为8api-push4vivo3com2cn0,长度为22),每个解析节省20bytes,一共20个A记录解析,就是节省400bytes
另一个自研的DNS,数据包没有压缩,大小为800bytes,超过了512B的默认配置,对于400bytes多了400bytes
应该是没有设置压缩的参数Compress,参考https://github.com/miekg/dns/blob/68df4402de4a2303c8dad5aea62450485d589d81/msg.go
type Msg struct {
MsgHdr
Compress bool `json:"-"` // If true, the message will be compressed when converted to wire format.
Question []Question // Holds the RR(s) of the question section.
Answer []RR // Holds the RR(s) of the answer section.
Ns []RR // Holds the RR(s) of the authority section.
Extra []RR // Holds the RR(s) of the additional section.
}