February 23, 2025

Mastering Kubernetes Observability — Multi-Cluster Monitoring with VictoriaMetrics, Loki & Grafana

Originally published on Medium.

Introduction: Navigating Multi-Cluster Complexity with Observability

In today’s cloud-native world, applications no longer run in a single cluster. Organizations adopt multi-cluster architectures for scalability, resilience, and geographic distribution. But with this power comes complexity — and the need for robust monitoring.

Imagine managing Kubernetes clusters across multiple regions, each hosting critical microservices. Without a unified view, troubleshooting becomes a needle-in-a-haystack challenge. This is where multi-cluster monitoring proves essential.

Why Multi-Cluster Monitoring Matters?

“More Clusters, More Chaos — Here’s How to Stay Ahead”

Monitoring one cluster is hard — scaling to multiple is exponentially tougher. Each cluster generates logs, metrics, and traces, creating an overwhelming data flood. Without the right tools, you risk missing key insights and reacting too late.

Open-source solutions like Prometheus, Thanos, Grafana, and OpenTelemetry offer cost-effective, flexible observability, helping detect anomalies and ensure reliability in multi-cluster environments.

Our Solution

“From Costly to Cost-Effective: Our Multi-Cluster Monitoring Solution”

After evaluating our options, we decided to implement a high-availability, open-source monitoring and logging setup that could scale with our infrastructure while keeping costs in check. Here’s how we did it:

Infrastructure Details

Average Node Count: 100–150
Number of Clusters: 4
Number of Regions: 4
Container Orchestration: Kubernetes
Cloud Provider: Google Cloud Platform (GCP)

The Challenge: Exploring Our Monitoring Options

Paid Alternatives: Are They Worth It?

“The Truth About Paid Monitoring: Are You Paying for a Black Box?”

Paid monitoring platforms like Datadog and New Relic provide seamless multi-cluster visibility but come with high costs:

These black-box solutions limit customization, making them less suited for dynamic environments.
Licensing fees, per-node pricing, and additional charges add up quickly.

Given our cluster size of 120+ nodes and high data intake, the cost of these platforms became prohibitive. Instead of paying a premium for a one-size-fits-all tool, a more cost-effective approach is to build a tailored observability stack using open-source solutions.

How about Open-Source Alternatives?

“Prometheus, Grafana & Loki: The Ultimate Observability Trio?”

Prometheus + Grafana + Loki form a powerful open-source solution, but they come with their own challenges:

Lack of built-in clustering capabilities
High maintenance requirements

Prometheus, designed as a monolith with local storage, is not ideal for scalable setups. However, integrations like Thanos or Cortex can enable scalability by:

Making Prometheus stateless by transferring metrics (older than two hours) to remote storage
Implementing a global query layer
Adding remote storage support (e.g., S3, GCS)

While Thanos or Cortex enable sharding and unified querying, they also add complexity. As your infrastructure scales, resource consumption increases, leading to higher operational overhead and potential performance bottlenecks.

So, what’s the alternative?

What Did We Choose?

“Our Winning Stack: VictoriaMetrics + Grafana + Loki !!”

A Scalable and Performance-Oriented Monitoring Alternative: VictoriaMetrics + Grafana + Loki

VictoriaMetrics comes with built-in clustering and is fully backward compatible with all Prometheus endpoints. It emerged as the top performer in our resource and performance benchmark, which was crucial for scaling.

VictoriaMetrics Overview

VictoriaMetrics, similar to Prometheus, offers built-in clustering capabilities, making it an ideal choice for our monitoring needs. It comprises:

vmstorage: Stores data with replication and sharding for scalability.
vmselect: Manages query operations; Grafana connects here.
vmwrite: Handles data ingestion, configured as an internal load balancer.
vmcluster: Comprises three components.
vmagent: Scrapes Prometheus metrics using scrape configs and pushes data to the cluster.

Loki Overview

Loki provides three deployment strategies, and for our setup, we opted for the Simple Deployment strategy to efficiently handle clustering. For storage, we use Google Cloud Storage (GCS) to ensure scalability and durability.

Loki deployment consists of three main components:

Loki-Backend — Handles indexing and storage operations with GCS.
Loki-Read — Manages log querying and retrieval.
Loki-Write — Handles incoming log ingestion. Exposed via an internal load balancer, allowing all clusters to connect seamlessly.

Logging Agent: Fluent Bit

For log collection, we chose Fluent Bit as the logging agent. Each log line is enriched with custom labels (e.g. Cluster: cluster1) before being pushed to the master cluster, ensuring efficient filtering and retrieval.

This setup provides a streamlined, scalable logging solution while leveraging Loki’s simplicity and GCS’s reliability.

Architecture Overview

“Building a High-Availability Monitoring Stack — Here’s How We Did It”

Our production monitoring setup consists of three Kubernetes clusters:

Clusters 2 and 3: Handle workloads and other components.
Cluster 1: Master cluster (hosts the VictoriaMetrics cluster and centralized Grafana).

Key Details

Log Ingestion via Load Balancer — Logs are sent to Loki-Write through an internal load balancer for centralized processing.
Logging with Fluent Bit — Deployed in each cluster, Fluent Bit appends custom labels similar to vmagent for structured log management.
Metrics sources: blackbox-exporter (endpoint probing), node-exporter (node-level metrics), kube-state-metrics (Kubernetes resource monitoring).
Metrics Collection with vmagent — Each cluster runs vmagent to collect metrics, tagging them with custom labels (e.g. clustername=cluster1) for segregation.
Global VPC Connectivity — All clusters communicate via a global VPC, spanning multiple subnets and regions.

Implementation Highlights

Data Flow

Grafana, hosted centrally in Cluster 1, connects to vmselect and Loki read for unified monitoring dashboards.
Logs via Fluent Bit are sent to Loki write in Cluster 1.
Metrics collected by vmagent in Clusters 1, 2, and 3 are sent to the vmwrite component in Cluster 1.

High Availability

Internal load balancing improves reliability and prevents bottlenecks.
Loki is set up with proper log indexing based on K8s required labels; data is synced to GCS, and a compactor is used to handle data retention.
Replication and sharding ensure data integrity and scalability within vmstorage.

Visualization

Centralized Grafana dashboards provide a comprehensive view of metrics and logs across clusters, enabling quick identification of issues and performance tuning.

Performance Comparison

“VictoriaMetrics vs. Prometheus + Thanos: The Ultimate Performance Face-Off”

VictoriaMetrics demonstrated:

Cost Efficiency: Open-source, with lower infrastructure costs compared to paid alternatives.
Reduced Maintenance Overhead: Clustering is natively supported, unlike Prometheus.
Improved Query Speeds: Optimized for large datasets and distributed setups.

Metric (1.5M TS/15s, 100K samples/s)	Thanos	VictoriaMetrics
CPU	4.01 cores	0.86 cores
Memory	21 GiB	8.93 GiB
Bytes/sample	4.72 B	0.91 B

Conclusion

“Monitoring at Scale? Open-Source is the Way Forward”

By leveraging VictoriaMetrics, Grafana, and Loki, we successfully implemented a high-performing, scalable monitoring solution tailored to our needs. This setup not only ensures optimal performance but also aligns with cost and scalability requirements, making it a sustainable choice for growing infrastructure demands.

In a world where multi-cluster architectures are becoming the norm, open-source tools like these empower organizations to achieve enterprise-grade observability without the enterprise-grade price tag.

Sample config for multi-cluster monitoring setup: multi-cluster-observability-demo.

Tags: DevOps · Monitoring · Observability · VictoriaMetrics · Loki