1、概述
node feature discovery(nfd)是由intel创建的项目,能够帮助kubernetes集群更智能地管理节点资源。它通过检测每个节点的特性能力(例如cpu型号、gpu型号、内存大小等)并将这些能力以标签的形式发送到kubernetes集群的api(kube-apiserver)。然后,通过kube-apiserver修改节点的标签。这些标签可以帮助调度器(kube-scheduler)更智能地选择最适合特定工作负载的节点来运行pod。
github:https://github.com/kubernetes-sigs/node-feature-discovery
docs:https://kubernetes-sigs.github.io/node-feature-discovery/master/get-started/index.html
2、组件架构
nfd 细分为 nfd-master 和 nfd-worker 两个组件:
nfd-master:是一个负责与 kubernetes api server 通信的deployment pod,它从 nfd-worker 接收节点特性并相应地修改 node 资源对象(标签、注解)。
nfd-worker:是一个负责对 node 的特性能力进行检测的 daemon pod,然后它将信息传递给 nfd-master,nfd-worker 应该在每个 node 上运行。
可以检测发现的硬件特征源(feature sources)清单包括:
- cpu
- iommu
- kernel
- memory
- network
- pci
- storage
- system
- usb
- custom (rule-based custom features)
- local (hooks for user-specific features)
3、组件安装
(1)安装前查看集群节点状态
[root@master-10 ~]# kubectl get nodes
name status roles age version
master-10.20.31.105 ready control-plane,master,worker 31h v1.21.5
节点详细信息,主要关注标签、注解。
[root@master-10 ~]# kubectl describe nodes master-10.20.31.105
name: master-10.20.31.105
roles: control-plane,master,worker
labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/arch=amd64
kubernetes.io/hostname=master-10.20.31.105
kubernetes.io/os=linux
node-role.kubernetes.io/control-plane=
node-role.kubernetes.io/master=
node-role.kubernetes.io/worker=
node.kubernetes.io/exclude-from-external-load-balancers=
annotations: flannel.alpha.coreos.com/backend-data: {"vtepmac":"c6:fb:4b:8a:bb:12"}
flannel.alpha.coreos.com/backend-type: vxlan
flannel.alpha.coreos.com/kube-subnet-manager: true
flannel.alpha.coreos.com/public-ip: 10.20.31.105
kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
creationtimestamp: tue, 12 mar 2024 21:01:31 -0400
taints:
........
(2)组件安装
[root@master-10 opt]# kubectl apply -k https://github.com/kubernetes-sigs/node-feature-discovery/deployment/overlays/default?ref=v0.14.2
namespace/node-feature-discovery created
customresourcedefinition.apiextensions.k8s.io/nodefeaturerules.nfd.k8s-sigs.io created
customresourcedefinition.apiextensions.k8s.io/nodefeatures.nfd.k8s-sigs.io created
serviceaccount/nfd-master created
serviceaccount/nfd-worker created
role.rbac.authorization.k8s.io/nfd-worker created
clusterrole.rbac.authorization.k8s.io/nfd-master created
rolebinding.rbac.authorization.k8s.io/nfd-worker created
clusterrolebinding.rbac.authorization.k8s.io/nfd-master created
configmap/nfd-master-conf created
configmap/nfd-worker-conf created
service/nfd-master created
deployment.apps/nfd-master created
daemonset.apps/nfd-worker created
(3)查看组件状态
[root@master-10 opt]# kubectl get pods -n=node-feature-discovery
name ready status restarts age
nfd-master-5c4684f5cb-hvjjb 1/1 running 0 4m11s
nfd-worker-cpwx6 1/1 running 0 4m11s
(4)查看组件日志
可以看到nfd-worker组件默认每隔一分钟检测一次节点特性。
[root@master-10 ~]# kubectl logs -f -n=node-feature-discovery nfd-worker-rlf5t
i0314 06:30:32.003264 1 main.go:66] "-server is deprecated, will be removed in a future release along with the deprecated grpc api"
i0314 06:30:32.003372 1 nfd-worker.go:219] "node feature discovery worker" version="v0.14.2" nodename="master-10.20.31.105" namespace="node-feature-discovery"
i0314 06:30:32.003589 1 nfd-worker.go:520] "configuration file parsed" path="/etc/kubernetes/node-feature-discovery/nfd-worker.conf"
i0314 06:30:32.004500 1 nfd-worker.go:552] "configuration successfully updated" configuration={"core":{"klog":{},"labelwhitelist":{},"nopublish":false,"featuresources":["all"],"sources":null,"labelsources":["all"],"sleepinterval":{"duration":60000000000}},"sources":{"cpu":{"cpuid":{"attributeblacklist":["bmi1","bmi2","clmul","cmov","cx16","erms","f16c","htt","lzcnt","mmx","mmxext","nx","popcnt","rdrand","rdseed","rdtscp","sgx","sgxlc","sse","sse2","sse3","sse4","sse42","ssse3","tdx_guest"]}},"custom":[],"fake":{"labels":{"fakefeature1":"true","fakefeature2":"true","fakefeature3":"true"},"flagfeatures":["flag_1","flag_2","flag_3"],"attributefeatures":{"attr_1":"true","attr_2":"false","attr_3":"10"},"instancefeatures":[{"attr_1":"true","attr_2":"false","attr_3":"10","attr_4":"foobar","name":"instance_1"},{"attr_1":"true","attr_2":"true","attr_3":"100","name":"instance_2"},{"name":"instance_3"}]},"kernel":{"kconfigfile":"","configopts":["no_hz","no_hz_idle","no_hz_full","preempt"]},"local":{},"pci":{"deviceclasswhitelist":["03","0b40","12"],"devicelabelfields":["class","vendor"]},"usb":{"deviceclasswhitelist":["0e","ef","fe","ff"],"devicelabelfields":["class","vendor","device"]}}}
i0314 06:30:32.004796 1 metrics.go:70] "metrics server starting" port=8081
i0314 06:30:32.019135 1 nfd-worker.go:562] "starting feature discovery..."
i0314 06:30:32.019364 1 nfd-worker.go:577] "feature discovery completed"
i0314 06:31:32.021520 1 nfd-worker.go:562] "starting feature discovery..."
i0314 06:31:32.021695 1 nfd-worker.go:577] "feature discovery completed"
i0314 06:32:32.027970 1 nfd-worker.go:562] "starting feature discovery..."
i0314 06:32:32.028141 1 nfd-worker.go:577] "feature discovery completed"
可以看到nfd-master组件启动后默认第一分钟相应地修改 node 资源对象(标签、注解),之后是每隔一个小时修改一次 node 资源对象(标签、注解),也就是说如果一个小时以内用户手动误修改node资源特性信息(标签、注解),最多需要一个小时nfd-master组件才自动更正node资源特性信息。
[root@master-10 ~]# kubectl logs -n=node-feature-discovery nfd-master-5c4684f5cb-hvjjb
i0314 06:23:08.190218 1 nfd-master.go:213] "node feature discovery master" version="v0.14.2" nodename="master-10.20.31.105" namespace="node-feature-discovery"
i0314 06:23:08.190356 1 nfd-master.go:1214] "configuration file parsed" path="/etc/kubernetes/node-feature-discovery/nfd-master.conf"
i0314 06:23:08.190912 1 nfd-master.go:1274] "configuration successfully updated" configuration=<
denylabelns: {}
enabletaints: false
extralabelns: {}
klog: {}
labelwhitelist: {}
leaderelection:
leaseduration:
duration: 15000000000
renewdeadline:
duration: 10000000000
retryperiod:
duration: 2000000000
nfdapiparallelism: 10
nopublish: false
resourcelabels: {}
resyncperiod:
duration: 3600000000000
>
i0314 06:23:08.190928 1 nfd-master.go:1338] "starting the nfd api controller"
i0314 06:23:08.191105 1 node-updater-pool.go:79] "starting the nfd master node updater pool" parallelism=10
i0314 06:23:08.860810 1 metrics.go:115] "metrics server starting" port=8081
i0314 06:23:08.861033 1 component.go:36] [core][server #1] server created
i0314 06:23:08.861050 1 nfd-master.go:347] "grpc server serving" port=8080
i0314 06:23:08.861084 1 component.go:36] [core][server #1 listensocket #2] listensocket created
i0314 06:23:09.860886 1 nfd-master.go:694] "will process all nodes in the cluster"
i0314 06:23:09.923362 1 nfd-master.go:1086] "node updated" nodename="master-10.20.31.105"
i0314 07:23:09.224254 1 nfd-master.go:1086] "node updated" nodename="master-10.20.31.105"
i0314 08:23:09.081362 1 nfd-master.go:1086] "node updated" nodename="master-10.20.31.105"
(5)查看节点特性信息
可以看到nfd组件已经把节点特性信息维护到了节点标签、注解上,其中标签前缀默认为 feature.node.kubernetes.io/。
[root@master-10 opt]# kubectl describe node master-10.20.31.105
name: master-10.20.31.105
roles: control-plane,master,worker
labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
feature.node.kubernetes.io/cpu-cpuid.adx=true
feature.node.kubernetes.io/cpu-cpuid.aesni=true
feature.node.kubernetes.io/cpu-cpuid.avx=true
feature.node.kubernetes.io/cpu-cpuid.avx2=true
feature.node.kubernetes.io/cpu-cpuid.avx512bw=true
feature.node.kubernetes.io/cpu-cpuid.avx512cd=true
feature.node.kubernetes.io/cpu-cpuid.avx512dq=true
feature.node.kubernetes.io/cpu-cpuid.avx512f=true
feature.node.kubernetes.io/cpu-cpuid.avx512vl=true
feature.node.kubernetes.io/cpu-cpuid.cmpxchg8=true
feature.node.kubernetes.io/cpu-cpuid.fma3=true
feature.node.kubernetes.io/cpu-cpuid.fxsr=true
feature.node.kubernetes.io/cpu-cpuid.fxsropt=true
feature.node.kubernetes.io/cpu-cpuid.hle=true
feature.node.kubernetes.io/cpu-cpuid.hypervisor=true
feature.node.kubernetes.io/cpu-cpuid.lahf=true
feature.node.kubernetes.io/cpu-cpuid.movbe=true
feature.node.kubernetes.io/cpu-cpuid.mpx=true
feature.node.kubernetes.io/cpu-cpuid.osxsave=true
feature.node.kubernetes.io/cpu-cpuid.rtm=true
feature.node.kubernetes.io/cpu-cpuid.syscall=true
feature.node.kubernetes.io/cpu-cpuid.sysee=true
feature.node.kubernetes.io/cpu-cpuid.x87=true
feature.node.kubernetes.io/cpu-cpuid.xsave=true
feature.node.kubernetes.io/cpu-cpuid.xsavec=true
feature.node.kubernetes.io/cpu-cpuid.xsaveopt=true
feature.node.kubernetes.io/cpu-cpuid.xsaves=true
feature.node.kubernetes.io/cpu-hardware_multithreading=false
feature.node.kubernetes.io/cpu-model.family=6
feature.node.kubernetes.io/cpu-model.id=85
feature.node.kubernetes.io/cpu-model.vendor_id=intel
feature.node.kubernetes.io/kernel-config.no_hz=true
feature.node.kubernetes.io/kernel-config.no_hz_full=true
feature.node.kubernetes.io/kernel-version.full=3.10.0-1160.105.1.el7.x86_64
feature.node.kubernetes.io/kernel-version.major=3
feature.node.kubernetes.io/kernel-version.minor=10
feature.node.kubernetes.io/kernel-version.revision=0
feature.node.kubernetes.io/pci-0300_15ad.present=true
feature.node.kubernetes.io/system-os_release.id=centos
feature.node.kubernetes.io/system-os_release.version_id=7
feature.node.kubernetes.io/system-os_release.version_id.major=7
kubernetes.io/arch=amd64
kubernetes.io/hostname=master-10.20.31.105
kubernetes.io/os=linux
node-role.kubernetes.io/control-plane=
node-role.kubernetes.io/master=
node-role.kubernetes.io/worker=
node.kubernetes.io/exclude-from-external-load-balancers=
annotations: flannel.alpha.coreos.com/backend-data: {"vtepmac":"c6:fb:4b:8a:bb:12"}
flannel.alpha.coreos.com/backend-type: vxlan
flannel.alpha.coreos.com/kube-subnet-manager: true
flannel.alpha.coreos.com/public-ip: 10.20.31.105
kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
nfd.node.kubernetes.io/feature-labels:
cpu-cpuid.adx,cpu-cpuid.aesni,cpu-cpuid.avx,cpu-cpuid.avx2,cpu-cpuid.avx512bw,cpu-cpuid.avx512cd,cpu-cpuid.avx512dq,cpu-cpuid.avx512f,cpu-...
nfd.node.kubernetes.io/master.version: v0.14.2
nfd.node.kubernetes.io/worker.version: v0.14.2
node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
creationtimestamp: tue, 12 mar 2024 21:01:31 -0400
4、组件应用场景
5、总结
如果您的 kubernetes 集群需要根据节点的硬件特性进行智能调度或者对节点的硬件资源进行感知和利用,那么安装 node feature discovery(nfd)是有必要的。然而,如果您的集群中的节点都具有相似的硬件配置,且不需要考虑硬件资源的差异,那么不需要安装 nfd。