Thanos · waji // devops notes

Thanos: Prometheus at Scale! - DevConf.CZ 2020 Youtube video

“Running a product without any monitoring is like NOT running the product at all”

Its useless to run a 24 hrs working system if we can’t prove that its actually working - proving using monitoring tools

Prometheus needs to scale out..

Why multiple Prometheus Instances?

What if the Prometheus node dies and it can’t scrape and no querying is happening… Or maybe I want to switch to another Prom server or do a rolling update without any downtime

In these cases, we need a highly available Prometheus Server/Instances (multiple replicas)

Functional Sharding → separate prometheus for scraping each namespace
Consistent Hash Sharding → telling prometheus that how many prometheus instances/servers are running so that it equally divides the app load for metric scraping

So but what if we have multiple clusters? We will surely need more than one Prometheus.

Question: Why can’t we have one Prometheus server that collects data from multiple clusters?

Answer: Not recommended. Because the pool model should be in the same failure domain & same network as your services that it monitors.

Prometheus doesn’t work like a distributed application/system. So there are some scaling out challenges.

Challenges when faced when we have multiple Prometheus server:

Firstly, the datastore are not replicated for multiple clusters. So each data should be queried individually from each Prometheus service/server → No global view. Global aggregation is not possible
Another issue we can face for multiple Prometheus servers in one node for a specific service is that it could miss some metrics in a time series but both of them can be different. One prometheus can miss a metric from A time series and the other Prometheus can miss a metric from B time series and vice versa. We need HA to solve this. Not just scraping but in querying as well
Typically Prometheus has a really short time range for data. Retention. For example set up a data retention rule that will keep the data in the Prometheus Database for 15 days then make it fall out. But what if we want to look at long term data for example from last year? If I keep all these data in my disk, this will be really hard to maintain Prometheus data for such long retention times.

So far summary:

Monitoring is essential
Monitoring is easy with Prometheus
We often have to run more than one Prometheus → HA or to manage the load
Using distributed Prometheus brings some challenges
Unlimited retention of metrics has some challenges

All the above challenges can be solved using the tool known as Thanos. Thanos is only focused in simplicity and low maintanence. Exists just to scale Prometheus.

Global Query View
HA
Long term Storage

So how to enable Thanos on multiple Prometheus in Seven Steps:

Step 1: Add a Thanos sidecar to each Prometheus. It is a small GO binary that has some features. Need this added to each Prometheus.
Step 2: Add stateless Thanos Querier. It essentially performs PromQL evaluation in Global level. It connects to storeAPI via gRPC protocol. The Querier has a built-in deduplication layer. For each of the replicas of Prometheus (using replicalabels), it detects missing data for each replica and & fills it accordingly.
Long-Term Retention: Thanos sidecars get credentials for object storage. When prometheus outputs a block in its TSDB, the Thanos Sidecar uploads that to the object storage. Now to actually use this data, we need a new Thanos Component called Thanos Store Gateway. Instead of reading the data from Prometheus API, it reads from Object storage. So recent data is from Thanos sidecars, long term is from Storage Gateway. But there is a slight issue with this as the long term data could be over millions. So here comes the Thanos Compactor. It does the downsampling (high resolution to low resolution data). Other thing is that it combines multiple Prometheus TSDB blocks to fewer larger blocks. E.g: Many 2 hours TSDB blocks to few Longer hour TSDB blocks.

So what is a storeAPI?

Every component in Thanos serves data via gRPC StoreAPI

sidecar mode

💡

prometheus instance 마다 함께 구성하여 아래와 같은 기능 지원 thanos query에서 query 경우 해당 요청을 받아 2시간 이내 데이터는 prometheus tsdb의 data로 접근하여 반환 object storage에 매 2시간마다 tsdb block을 object storage에 저장해 줌 단점으로는 storage type은 object storage만 지원 ingress object 생성 시 (nginx) annotations에 아래와 같은 구성을 추가해야 grpc 통신 가능 nginx.ingress.kubernetes.io/force-ssl-redircet: “true” nginx.ingress.kubernetes.io/backend-protocol: “GRPC” nginx.ingress.kubernetes.io/protocol: “h2c” nginx.ingress.kubernetes.io/grpc-backend: “true” nginx.ingress.kubernetes.io/proxy-read-timeout: “160” # 해당 옵션은 없어도 됨.

receive mode

💡

각각의 Prometheus Instance 로부터 Thanos Receiver 로 실시간 Data를 전송 해줌 데이터를 중앙 집중화 하여 저장 관리 가능 Prometheus 에 대한 접근 없이 하나의 Receiver를 통해 모든 메트릭 데이터에 접근 할 수 있음 여러 type의 storage를 사용 할 수 있음 단점으로는 실시간 성으로 remote write 요청이라 resource 과부하가 걸릴 수 있음

external label

global:
  external_labels:
     cluster: <cluster 명> ## alert rule 등을 구성할 때에도 사용 가능

thanos query

💡

thanos query는 deduplication 기능 및 endpoint discovery 기능이 있음. deduplication 기능은 동일한 target 에 대한 여러 개의 Prometheus Data Source(HA) 에 대해 마치 하나의 Data Source로부터 수집 된 Data처럼 Merging 할 수 있도록 수행하는 기능 * flag —query.replica-label 옵션을 통해서 deduplicate 기준 정할 수 있음 endpoint service Discovery 는 총 3가지 방법이 있는데 —store flag를 사용하거나 —store.sd-files 와 —store.sd-interval flag를 통해 구성 —store flag에 dns+ dnssvc+ scheme를 사용하는 방법이 있음 * ingress 사용시 아래와 같이 설정 —store=dnssrv+_grpc._tcp.thanos-sidecar-<ingress_domain>:10901

thanos query frontend

💡

thanos query에 대한 read path 성능 향상을 위한 component 임 /api/v1/query_range api call인 range query에 대해 spliting 과 caching 기능을 사용하여 처리 하고 다른 API Call 에 대해서는 Downstream URL에 지정된 Querier에 전달하여 처리 함. 초당 대용량의 sample을 처리해야 할 경우 사용 함. Splitting Range가 큰 Query에 대해 오래 걸리거나 OOM 현상이 발생하는 것을 방지 하기 위한 목적으로 해당 Query를 일정 기간의 Range로 나눠서 Query를 수행하도록 하는 기능 flag 는 —query-range.split-interval 임 Caching Query Result를 Cache DB에 저장하며, Subsequent Query에 대해 재 사용 할 수 있도록 하는 기능 subsequent query 연속적인 쿼리를 말하는 거다 예를들어 cpu 사용량 조회 후 cpu 사용량 조회와 memory 사용량 조회를 하는 경우

thanos compactor

💡

thanos compactor는 object storage 에 있는 데이터를 압축 및 다운샘플링을 해주는 compoent이다. 제한된 저장공간으로 인해 데이터 보관기관을 지정하려면 raw, 5분, 1시간 데이터에 대한 retention값을 각각 지정해야 한다. 보통 오래된 날짜의 데이터는 상세하게 보지 않기 때문에 raw < 5분 < 1시간 순으로 보관기간을 설정하면 된다. 또한 cronjob을 이용해 돌리면 resource를 효율화 할 수 있겠지만 매일 10GB 이상 데이터를 쌓는다고 하면 cronjob보다는 상시 돌리는 게 낫다. Downsampling 매트릭이 1분 단위로 샘플링 되었다면, 10분이나 1시간 단위로 샘플링 기준을 다운해서 (해상도를 낮춰서) 전체 데이타 저장 용량을 낮추는 방법이지만 주된 목적은 쿼리 속도를 향상하기 위함이지 공간을 절약하는 것이 주 목적은 아니다.

Downsampling

Downsampling is the process of reducing the resolution of metric data over time to save storage space and improve query performance. In Thanos, downsampling is achieved by aggregating the original high-resolution data into lower-resolution data points. Thanos supports three levels of downsampling:

5-minute resolution: The highest level of downsampling that Thanos performs, which aggregates data points into 5-minute intervals.
1-hour resolution: The intermediate level of downsampling, where data points are further aggregated into 1-hour intervals.
Raw resolution: The original high-resolution data points collected by Prometheus.

The primary purpose of downsampling is to strike a balance between retaining detailed data for recent metrics and reducing the storage footprint for older data, which typically does not require as much detail. Queries for recent data use higher resolution (raw or 5-minute), while queries for older data use lower resolution (1-hour) for efficiency.

Compaction

Compaction in Thanos refers to the process of consolidating smaller blocks of metric data into larger blocks. This serves multiple purposes:

Reducing the number of blocks: Smaller blocks are merged into larger ones to reduce the overall number of blocks stored. This helps in reducing the overhead associated with managing many small files and improves query performance.
Data deduplication: During compaction, Thanos can also perform data deduplication, ensuring that redundant data points (typically from multiple Prometheus servers) are removed.
Retention of downsampled data: Compaction helps in retaining the downsampled data while discarding or merging the raw data according to the retention policies. This way, it manages the data lifecycle effectively.

The compaction process in Thanos is managed by the thanos compact component, which runs periodically to identify blocks that need compaction and performs the necessary merging and downsampling.

How These Processes Work Together

Initial Data Ingestion: Prometheus instances collect and store high-resolution metric data in blocks, which are then uploaded to an object storage (e.g., S3, GCS) by Thanos sidecars.
Compaction: The thanos compact component periodically runs to merge smaller blocks into larger ones and perform downsampling as configured.
Querying Data: When a query is made, Thanos Querier intelligently selects the appropriate resolution (raw, 5-minute, or 1-hour) based on the time range and resolution required. Recent data queries use raw or 5-minute resolution, while older data queries use 1-hour resolution to improve performance.

By combining downsampling and compaction, Thanos ensures that metric data is stored efficiently and remains accessible for both short-term and long-term analysis, providing a scalable solution for monitoring large-scale systems.

Regarding Compactor

Compaction occurs in the following steps (in order)

Compaction
Downsampling
Retention

So if Compaction hasn’t completed, Downsampling won’t occur & vice versa.

Troubleshooting the compactor (if theres a bottleneck - slow queries)

https://thanos.io/tip/operating/compactor-backlog.md/

Important thing to remember

Calculate Size for PVC & Obj Storage in Thanos

To calculate the PVC size for thanos compactor

We have the following formula (from Cortex)

min_disk_space_required = compactor.compaction-concurrency * max_compaction_range_blocks_size * 2

(Search in cortex docs regarding this → #compactor-disk-utilization)

For the Object storage we can always get the biggest block data from thanos bucket (14 days block for Thanos by default)

Then add all the data sizes for all the resolutions (raw, 5m & 1h)

Then divide by 14 to get 1 day data

Then multiply by 30 to get 1 month data (depends on the situation)

Also it depends on the retention period of all of the resolution sizes.