The Sunburst Process chart provides an up to the second view of the cluster-wide share of CPU resources used by each namespace.
The Sunburst DNS chart shows a real-time view of network activity generated by each namespace. The chart is produced by looking up DNS names for network addresses observed in packet flows using the Kubernetes DNS service. The domain names contain information about the namespace, service, and node generating the packets. Most traffic is exchanges between nodes within the cluster (identified as local). The external (not local) traffic is also shown by DNS name.
The Sunburst Protocols chart shows the different network protocols being used to communicate between nodes in the cluster. The chart shows the IP over IP tunnel traffic used for network virtualization.
Clicking on a segment in the Sunburst Protocols chart allows the selected traffic to be examined in detail using the Flow Browser. In this example, DNS names are again used to translate raw packet flow data into inter-namespace flows. See Defining Flows for information on the flow analytics capabilities that can be explored using the browse-flows application.
The Discard Browser provides a detailed view of any network packets dropped in the cluster. In this chart inter-namespace dropped packets are displayed, identifying the haproxy service as the largest source of dropped packets.
The final chart shows an up to the second view of the average power consumed by a GPU in the cluster (approximately 250 Watts per GPU).
The diagram shows the elements of the monitoring solution. Host sFlow agents deployed on each Node in the Kubernetes Cluster stream standard sFlow telemetry to an instance of the sFlow-RT real-time analytics software that provides cluster wide metrics through a REST API, where they can be viewed, or imported into time series databases like Prometheus and trended in dashboards using tools like Grafana.
Note: sFlow is widely supported by network switches and routers. Enable sFlow monitoring in the physical network infrastructure for end-to-end visibility.
Create the following sflow-rt.yml file to deploy the pre-built sflow/prometheus Docker image, bundling sFlow-RT with the applications used in this article:
apiVersion: v1 kind: Service metadata: name: sflow-rt-sflow spec: type: NodePort selector: name: sflow-rt ports: - protocol: UDP port: 6343 --- apiVersion: v1 kind: Service metadata: name: sflow-rt-rest spec: type: LoadBalancer selector: name: sflow-rt ports: - protocol: TCP port: 8008 --- apiVersion: apps/v1 kind: Deployment metadata: name: sflow-rt spec: replicas: 1 selector: matchLabels: name: sflow-rt template: metadata: labels: name: sflow-rt spec: containers: - name: sflow-rt image: sflow/prometheus:latest ports: - name: http protocol: TCP containerPort: 8008 - name: sflow protocol: UDP containerPort: 6343
Run the following command to deploy the service:
kubectl apply -f sflow-rt.yml
Now create the following host-sflow.yml file to deploy the pre-built sflow/host-sflow Docker image:
apiVersion: apps/v1 kind: DaemonSet metadata: name: host-sflow spec: selector: matchLabels: name: host-sflow template: metadata: labels: name: host-sflow spec: restartPolicy: Always hostNetwork: true dnsPolicy: ClusterFirstWithHostNet containers: - name: host-sflow image: sflow/host-sflow:latest env: - name: COLLECTOR value: "sflow-rt-sflow" - name: SAMPLING value: "10" - name: NET value: "host" - name: DROPMON value: "enable" volumeMounts: - mountPath: /var/run/docker.sock name: docker-sock readOnly: true volumes: - name: docker-sock hostPath: path: /var/run/docker.sock
Run the following command to deploy the agents:
kubectl apply -f host-sflow.yml
Telemetry should immediately start streaming as a Host sFlow agent is started on each node in the cluster.
Note: Exporting GPU performance metrics from the NVIDIA GPUs in the Nautilus cluster requires a special version of the Host sFlow agent built using the NVIDIA supplied Docker image that includes GPU drivers, see https://gitlab.nrp-nautilus.io/prp/sflow/
Access the sFlow-RT web user interface to confirm that telemetry is being received.
The sFlow-RT Status page confirms that telemetry is being received from all 180 nodes in the cluster.Note: If you don't currently have access to a production Kubernetes cluster, you can experiment with this solution using Docker Desktop, see Kubernetes testbed.
The charts shown in this article are accessed via the sFlow-RT Apps tab.
The sFlow-RT applications are designed to explore the available metrics, but don't provide persistent storage. Prometheus export functionality allows metrics to be recorded in a time series database to drive operational dashboards, see Flow metrics with Prometheus and Grafana.
Hello Peter , Thank you so much for everything , I learned a Lot from your blogs. here I have small problem that i cant find (nvml_power) in my browse metric, is it new item I should update my sflow-rt ? or I should to add some configuration ?
ReplyDeleteYou need a custom build of the Host sFlow agent. See the note in the article for a link to the Dockerfile used to build Host sFlow with the NVIDIA drivers.
Delete