排查 Pod 启动失败问题

Pod 一直处于 CrashLoopBackOff/Pending/ImagePullBackOff 状态怎么办？

解决方案

系统排查流程推荐

# 1. 查看 Pod 状态
kubectl get pod <pod-name> -o wide

# 2. 查看详细事件（重点看 Events 部分）
kubectl describe pod <pod-name>

# 3. 查看容器日志
kubectl logs <pod-name>
# 如果容器已重启，查看上次日志
kubectl logs <pod-name> --previous

# 4. 查看集群事件
kubectl get events --sort-by=.lastTimestamp

# 5. 检查资源配额
kubectl describe resourcequota -n <namespace>

按照状态 → 事件 → 日志 → 集群事件的顺序排查，能快速定位大部分问题。describe 的 Events 部分通常包含最关键的错误信息。

适用场景：Pod 无法正常启动或反复重启时

针对 ImagePullBackOff

# 检查镜像名称是否正确
kubectl describe pod <pod-name> | grep -A5 'Events'

# 检查 imagePullSecrets 配置
kubectl get pod <pod-name> -o jsonpath='{.spec.imagePullSecrets}'

# 创建镜像拉取凭证
kubectl create secret docker-registry regcred \
  --docker-server=registry.example.com \
  --docker-username=user \
  --docker-password=pass

ImagePullBackOff 通常是镜像名错误、仓库不可达或缺少认证凭证导致。

适用场景：Pod 事件显示镜像拉取失败

针对 Pending 状态

# 查看调度失败原因
kubectl describe pod <pod-name> | grep -A10 'Events'

# 检查节点资源
kubectl top nodes
kubectl describe nodes | grep -A5 'Allocated resources'

# 检查节点污点
kubectl describe nodes | grep Taints

Pending 通常是资源不足（CPU/内存）、节点有污点、或 PVC 无法绑定导致。

适用场景：Pod 一直处于 Pending 无法调度

注意事项

--previous 只能查看上一次容器的日志，更早的需要外部日志系统

资源不足时不要盲目扩容，先检查是否有资源泄漏