遇到的坑

Grafana 中添加 Loki 数据源时报错

点击 Save & test 后报错Unable to connect with Loki. Please check the server logs for more details.

解决方案:

  • 在 HTTP headers(Pass along additional context and metadata about the request/response)处点击 Add header
    • Header 填 X-Scope-OrgID
    • Value 填任意值(例如 anonymous)后 Save & test

为什么必须加 X-Scope-OrgID?

  • Loki 从 2.0 版本开始默认启用了**多租户(multi-tenancy)**模式,即使你只部署了一个 Loki 实例,也会强制要求每条请求都带上一个“租户 ID”(tenant ID),用来区分不同用户/项目/命名空间的日志。
    • 这个租户 ID 就是通过 HTTP Header X-Scope-OrgID 传递的。
    • 如果你不带这个 Header,Loki 会返回 401 Unauthorized 或直接拒绝连接。
    • 如果你带了(哪怕值是随便填的),Loki 就认为这是“单租户模式”下的请求,允许你正常读写。

参考链接:

  • Loki 官方文档 - 多租户模式(Multi-tenancy):https://grafana.com/docs/loki/latest/operations/multi-tenancy/
  • Loki 官方文档 - 配置 auth_enabled:https://grafana.com/docs/loki/latest/configuration/#auth_enabled

通过 Helm 更新 Vector 配置文件时报错

配置文件与报错内容:

# cat values-vector-0.35.3.yaml
...
customConfig:
...
  sinks:
    loki_push:
        type: loki
        inputs:
          - k8s_logs
        endpoint: http://loki-write.monitoring.svc.cluster.local:3100
        encoding:
          codec: json
        labels:
          namespace: "{{ kubernetes.pod_namespace }}"
          pod: "{{ kubernetes.pod_name }}"
          container: "{{ kubernetes.container_name }}"
          node: "{{ kubernetes.pod_node_name }}"
          stream: "{{ kubernetes.container_name }}"
...

# helm upgrade --install vector ./vector-0.35.3.tgz --namespace monitoring --create-namespace -f values-vector-0.35.3.yaml
Error: UPGRADE FAILED: template: vector/templates/daemonset.yaml:30:28: executing "vector/templates/daemonset.yaml" at <include (print $.Template.BasePath "/configmap.yaml") .>: error calling include: template: vector/templates/configmap.yaml:11:3: executing "vector/templates/configmap.yaml" at <tpl (toYaml .Values.customConfig) .>: error calling tpl: error during tpl function execution for "api:\n address: 127.0.0.1:8686\n enabled: true\n playground: false\ndata_dir: /vector-data-dir\nsinks:\n loki_push:\n encoding:\n codec: json\n endpoint: http://loki-write.monitoring.svc.cluster.local:3100\n healthcheck:\n enabled: true\n inputs:\n - k8s_logs\n labels:\n container: '{{ kubernetes.container_name }}'\n namespace: '{{ kubernetes.pod_namespace }}'\n node: '{{ kubernetes.pod_node_name }}'\n pod: '{{ kubernetes.pod_name }}'\n stream: '{{ kubernetes.container_name }}'\n out_of_order_action: drop\n remove_label_fields: true\n remove_timestamp: true\n request:\n buffer:\n max_size: 1073741824\n type: disk\n when_full: block\n eviction_policy:\n strategy: drop_newest\n type: loki\nsources:\n k8s_logs:\n auto_partial_merge: true\n delay_deletion_ms: 300000\n exclude_paths_glob_patterns:\n - '**/*.gz'\n - '**/*.tmp'\n extra_label_selector: vector.dev/exclude!=true\n extra_namespace_label_selector: vector.dev/exclude!=true,kubernetes.io/metadata.name!=kube-system\n glob_minimum_cooldown_ms: 30000\n ignore_older_secs: 3600\n include_paths_glob_patterns:\n - /var/log/pods/*/*/*.log\n ingestion_timestamp_field: .ingest_timestamp\n max_line_bytes: 1048576\n oldest_first: false\n pod_annotation_fields:\n container_image: .kubernetes.container_image\n container_name: .kubernetes.container_name\n pod_labels: .kubernetes.pod_labels\n pod_name: .kubernetes.pod_name\n pod_namespace: .kubernetes.pod_namespace\n pod_node_name: .kubernetes.pod_node_name\n read_from: end\n type: kubernetes_logs\n use_apiserver_cache: true": parse error at (vector/templates/daemonset.yaml:16): function "kubernetes" not defined
  • 因为在 values-vector-0.35.3.yaml 里直接写了 “{{ kubernetes.pod_name }}” 等内容,Helm 在渲染 customConfig 的时候会先尝试对整个 YAML 字符串做 tpl 处理,它把 {{ kubernetes.xxx }} 当成了 Helm/Go template 的函数调用,而 Helm 根本没有叫 kubernetes 的函数,所以直接炸了。

解决方案:对 Vector 模板变量进行转义

  • 把 customConfig 中所有 Vector 需要用到的 {{ xxxx }} 改成 Helm 不解析、但渲染后会变成 Vector 能识别的写法。
...
sinks:
    loki_push:
      type: loki
      inputs:
        - k8s_logs
      endpoint: http://loki-write.monitoring.svc.cluster.local:3100
      encoding:
        codec: json
      labels:
        # 用 {{ "{{ xxxx }}" }}(两层大括号)来转义 Vector 的模板变量Helm 渲染后会变成 {{ kubernetes.pod_name }},这正是 Vector 能识别的语法。
        namespace:  '{{ "{{ kubernetes.pod_namespace }}" }}' 
        pod:        '{{ "{{ kubernetes.pod_name }}" }}'
        container:  '{{ "{{ kubernetes.container_name }}" }}'
        node:       '{{ "{{ kubernetes.pod_node_name }}" }}'
...

Vector 向 loki-write 写日志时报 401

  • Loki 使用微服务(microservices)模式部署时,Vector 日志需要通过 loki-gateway 写入。

loki-gateway 无法解析域名

loki-gateway 关键日志片段(could not be resolved):

[error] 13#13: *91364 loki-read.monitoring.svc.cluster.local could not be resolved (110: Operation timed out), client: r: , request: "GET /loki/api/v1/tail?query=%7Bstream%3D%22stdout%22%2Cpod%3D%22loki-canary-4ffch%22%7D+ HTTP/1.1", host: ring.svc.cluster.local.:80"
[error] 13#13: *91370 loki-read.monitoring.svc.cluster.local could not be resolved (110: Operation timed out), client: : , request: "GET /loki/api/v1/tail?query=%7Bstream%3D%22stdout%22%2Cpod%3D%22loki-canary-q8vwr%22%7D+ HTTP/1.1", host: ring.svc.cluster.local.:80"
[error] 13#13: *91377 loki-read.monitoring.svc.cluster.local could not be resolved (110: Operation timed out), client: : , request: "GET /loki/api/v1/tail?query=%7Bstream%3D%22stdout%22%2Cpod%3D%22loki-canary-zgc7l%22%7D+ HTTP/1.1", host: ring.svc.cluster.local.:80"
[error] 11#11: *89581 loki-write.monitoring.svc.cluster.local could not be resolved (110: Operation timed out), client: r: , request: "POST /loki/api/v1/push HTTP/1.1", host: "loki-gateway.monitoring.svc.cluster.local"
[error] 13#13: *89639 loki-write.monitoring.svc.cluster.local could not be resolved (110: Operation timed out), client: r: , request: "POST /loki/api/v1/push HTTP/1.1", host: "loki-gateway.monitoring.svc.cluster.local"
[error] 12#12: *91425 loki-read.monitoring.svc.cluster.local could not be resolved (110: Operation timed out), client: r: , request: "GET /loki/api/v1/tail?query=%7Bstream%3D%22stdout%22%2Cpod%3D%22loki-canary-4ffch%22%7D+ HTTP/1.1", host: ring.svc.cluster.local.:80"
[error] 12#12: *91430 loki-read.monitoring.svc.cluster.local could not be resolved (110: Operation timed out), client: : , request: "GET /loki/api/v1/tail?query=%7Bstream%3D%22stdout%22%2Cpod%3D%22loki-canary-q8vwr%22%7D+ HTTP/1.1", host: ring.svc.cluster.local.:80"
[error] 10#10: *89536 loki-write.monitoring.svc.cluster.local could not be resolved (110: Operation timed out), client: r: , request: "POST /loki/api/v1/push HTTP/1.1", host: "loki-gateway.monitoring.svc.cluster.local"
[error] 9#9: *91438 loki-read.monitoring.svc.cluster.local could not be resolved (110: Operation timed out), client: 10. request: "GET /loki/api/v1/tail?query=%7Bstream%3D%22stdout%22%2Cpod%3D%22loki-canary-zgc7l%22%7D+ HTTP/1.1", host: ring.svc.cluster.local.:80"
[error] 13#13: *89639 loki-write.monitoring.svc.cluster.local could not be resolved (110: Operation timed out), client: r: , request: "POST /loki/api/v1/push HTTP/1.1", host: "loki-gateway.monitoring.svc.cluster.local"
[error] 11#11: *89581 loki-write.monitoring.svc.cluster.local could not be resolved (110: Operation timed out), client: r: , request: "POST /loki/api/v1/push HTTP/1.1", host: "loki-gateway.monitoring.svc.cluster.local"
[error] 10#10: *91489 loki-read.monitoring.svc.cluster.local could not be resolved (110: Operation timed out), client: r: , request: "GET /loki/api/v1/tail?query=%7Bstream%3D%22stdout%22%2Cpod%3D%22loki-canary-4ffch%22%7D+ HTTP/1.1", host: ring.svc.cluster.local.:80"
[error] 10#10: *91492 loki-read.monitoring.svc.cluster.local could not be resolved (110: Operation timed out), client: : , request: "GET /loki/api/v1/tail?query=%7Bstream%3D%22stdout%22%2Cpod%3D%22loki-canary-q8vwr%22%7D+ HTTP/1.1", host: ring.svc.cluster.local.:80"
[error] 10#10: *91497 loki-read.monitoring.svc.cluster.local could not be resolved (110: Operation timed out), client: : , request: "GET /loki/api/v1/tail?query=%7Bstream%3D%22stdout%22%2Cpod%3D%22loki-canary-zgc7l%22%7D+ HTTP/1.1", host: ring.svc.cluster.local.:80"
[error] 10#10: *89536 loki-write.monitoring.svc.cluster.local could not be resolved (110: Operation timed out), client: r: , request: "POST /loki/api/v1/push HTTP/1.1", host: "loki-gateway.monitoring.svc.cluster.local"
[error] 13#13: *89639 loki-write.monitoring.svc.cluster.local could not be resolved (110: Operation timed out), client: 10.233.89.150, server: , request: "POST /loki/api/v1/push HTTP/1.1", host: "loki-gateway.monitoring.svc.cluster.local"