rancher-charts/charts/rancher-monitoring/100.1.1+up19.0.3/app-README.md

3.0 KiB

Rancher Monitoring and Alerting

This chart is based on the upstream kube-prometheus-stack chart. The chart deploys Prometheus Operator and its CRDs along with Grafana, Prometheus Adapter and additional charts / Kubernetes manifests to gather metrics. It allows users to monitor their Kubernetes clusters, view metrics in Grafana dashboards, and set up alerts and notifications.

For more information on how to use the feature, refer to our docs.

The chart installs the following components:

  • Prometheus Operator - The operator provides easy monitoring definitions for Kubernetes services, manages Prometheus and AlertManager instances, and adds default scrape targets for some Kubernetes components.
  • kube-prometheus - A collection of community-curated Kubernetes manifests, Grafana Dashboards, and PrometheusRules that deploy a default end-to-end cluster monitoring configuration.
  • Grafana - Grafana allows a user to create / view dashboards based on the cluster metrics collected by Prometheus.
  • node-exporter / kube-state-metrics / rancher-pushprox - These charts monitor various Kubernetes components across different Kubernetes cluster types.
  • Prometheus Adapter - The adapter allows a user to expose custom metrics, resource metrics, and external metrics on the default Prometheus instance to the Kubernetes API Server.

For more information, review the Helm README of this chart.

Upgrading from 100.0.0+up16.6.0 to 100.1.0+up19.0.3

Noticeable changes:

Grafana:

  • sidecar.dashboards.searchNamespace, sidecar.datasources.searchNamespace and sidecar.notifiers.searchNamespace support a list of namespaces now.

Kube-state-metrics

  • the type of collectors is changed from Dictionary to List.
  • kubeStateMetrics.serviceMonitor.namespaceOverride was replaced by kube-state-metrics.namespaceOverride.

Known issues:

  • Occasionally, the upgrade fails with errors related to the webhook prometheusrulemutate.monitoring.coreos.com. This is a known issue in the upstream, and the workaround is to trigger the upgrade one more time. 32416