Etcd Cluster Restore
Many users ran into the case only one of the 3 pods of etcd cluster can works. The other 2 pods always restart and can't back to work.
Our experiences in this case it is only one reason:
Then the cluster was broken, Only one pod can works.
Let's help the cluster back to work.
If you only have one pod can't work, just delete the PVC of the pod and delete the pod, then it should be work.
1. Start the etcd cluster with bash
Because the etcd can't start, we need to start to etcd pod with bash
. like this:
Change the values.yaml
of helm for OpenObserve.
Add those lines:
command: ["/bin/bash", "-c", "while true; do sleep 1; done"]
livenessProbe:
enabled: false
readinessProbe:
enabled: false
After change the config of etcd
should be like this:
etcd:
enabled: true # if true then etcd will be deployed as part of openobserve
externalUrl: "my_custom_host.com:2379" # if bundled is false then this is required
replicaCount: 3 # if bundled is true then this is required. should be odd number
auth:
token:
enabled: false
rbac:
create: false
allowNoneAuthentication: true
rootPassword: ""
logLevel: "info"
extraEnvVars:
- name: ETCD_AUTO_COMPACTION_RETENTION
value: "1"
command: ["/bin/bash", "-c", "while true; do sleep 1; done"]
livenessProbe:
enabled: false
readinessProbe:
enabled: false
2. Start the pod which still can work
Normally, at least you have one pod still can work. we should start the pod first, because it should have the latest data.
Use any tools or directory use kubectl -n openobserve exec etcd-N -- bash
login the pod.
You can use the command env | grep ETCD_INITIAL_CLUSTER
to check current config for initial cluster.
!@openobserve-etcd-2:/opt/bitnami/etcd$ env | grep ETCD_INITIAL_CLUSTER
ETCD_INITIAL_CLUSTER_TOKEN=etcd-cluster-k8s
ETCD_INITIAL_CLUSTER_STATE=existing
ETCD_INITIAL_CLUSTER=openobserve-etcd-0=http://openobserve-etcd-0.openobserve-etcd-headless.ziox-dev.svc.cluster.local:2380,openobserve-etcd-1=http://openobserve-etcd-1.openobserve-etcd-headless.ziox-dev.svc.cluster.local:2380,openobserve-etcd-2=http://openobserve-etcd-2.openobserve-etcd-headless.ziox-dev.svc.cluster.local:2380
The result tell us the etcd cluster starting needs to wait 3 node, but we known it can't, so we change the environment only use current node to start.
Through the command line we can known current pod is etcd-2
, so we change it like this:
Please note the pod can work for you maybe is
etcd-1
you need to check it first.
ETCD_INITIAL_CLUSTER=openobserve-etcd-2=http://openobserve-etcd-2.openobserve-etcd-headless.ziox-dev.svc.cluster.local:2380
Then start the etcd of this node:
Yes, just type etcd
and enter. the etcd will start, and keep the command line here.
Notice: don't close it, it will shutdown the etcd server if you close the shell.
Waiting for seconds, you can see the etcd
is working.
Then let's restore other nodes.
Before switch to other pod, we need to add a new member to this cluster.
You need create a new shell for this pod, in our case it is etcd-2
.
Check current member of this cluster, it should only it self.
!@openobserve-etcd-2:/opt/bitnami/etcd$ etcdctl member list
3f59fc06477e49f8, started, openobserve-etcd-2, http://openobserve-etcd-2.openobserve-etcd-headless.ziox-dev.svc.cluster.local:2380, http://openobserve-etcd-2.openobserve-etcd-headless.ziox-dev.svc.cluster.local:2379,http://openobserve-etcd.ziox-dev.svc.cluster.local:2379, false
Let's add a new member to this cluster:
!@openobserve-etcd-2:/opt/bitnami/etcd$ etcdctl member add openobserve-etcd-0 --peer-urls=http://openobserve-etcd-0.openobserve-etcd-headless.ziox-dev.svc.cluster.local:2380
We use the information from ENV
. Here we add the member openobserve-etcd-0
. of course you can choose etcd-1
, it doesn't matter, but the next you need restore this node first.
3. Start other pods which can't work
Earlier we add etcd-0
to the cluster, so we need restore it first, Let's login into the pod.
First, Delete the old data of this node.
Second, change the environment for the cluster:
ETCD_INITIAL_CLUSTER=openobserve-etcd-0=http://openobserve-etcd-0.openobserve-etcd-headless.ziox-dev.svc.cluster.local:2380,openobserve-etcd-2=http://openobserve-etcd-2.openobserve-etcd-headless.ziox-dev.svc.cluster.local:2380
ETCD_INITIAL_CLUSTER_STATE=existing
Please notice it include 2 pods,
etcd-0
andetcd-2
.
Then start etcd:
Yes, just type etcd
and enter. the etcd will start, and keep the command line here.
Notice: don't close it, it will shutdown the etcd server if you close the shell.
Waiting for seconds, you can see the etcd
is working.
You can create a new shell and login into this pod to check it:
!@openobserve-etcd-2:/opt/bitnami/etcd$ etcdctl member list
3f59fc06477e49f8, started, openobserve-etcd-2, http://openobserve-etcd-2.openobserve-etcd-headless.ziox-dev.svc.cluster.local:2380, http://openobserve-etcd-2.openobserve-etcd-headless.ziox-dev.svc.cluster.local:2379,http://openobserve-etcd.ziox-dev.svc.cluster.local:2379, false
51daebb86180a114, started, openobserve-etcd-0, http://openobserve-etcd-0.openobserve-etcd-headless.ziox-dev.svc.cluster.local:2380, http://openobserve-etcd-0.openobserve-etcd-headless.ziox-dev.svc.cluster.local:2379,http://openobserve-etcd.ziox-dev.svc.cluster.local:2379, false
It should list 2 members.
Let's add the last member, same steps do again:
First, add etcd-1
to the cluster:
!@openobserve-etcd-2:/opt/bitnami/etcd$ etcdctl member add openobserve-etcd-1 http://openobserve-etcd-1.openobserve-etcd-headless.ziox-dev.svc.cluster.local:2380
Second, login into the pod etcd-1
, delete old data and start etcd.
Finally, it should list 3 members:
!@openobserve-etcd-2:/opt/bitnami/etcd$ etcdctl member list
3f59fc06477e49f8, started, openobserve-etcd-2, http://openobserve-etcd-2.openobserve-etcd-headless.ziox-dev.svc.cluster.local:2380, http://openobserve-etcd-2.openobserve-etcd-headless.ziox-dev.svc.cluster.local:2379,http://openobserve-etcd.ziox-dev.svc.cluster.local:2379, false
51daebb86180a114, started, openobserve-etcd-0, http://openobserve-etcd-0.openobserve-etcd-headless.ziox-dev.svc.cluster.local:2380, http://openobserve-etcd-0.openobserve-etcd-headless.ziox-dev.svc.cluster.local:2379,http://openobserve-etcd.ziox-dev.svc.cluster.local:2379, false
8f96db53130680d4, started, openobserve-etcd-1, http://openobserve-etcd-1.openobserve-etcd-headless.ziox-dev.svc.cluster.local:2380, http://openobserve-etcd-1.openobserve-etcd-headless.ziox-dev.svc.cluster.local:2379,http://openobserve-etcd.ziox-dev.svc.cluster.local:2379, false
Okay, the cluster restored.
Now, we need restore the configuration of helm.
etcd:
enabled: true # if true then etcd will be deployed as part of openobserve
externalUrl: "my_custom_host.com:2379" # if bundled is false then this is required
replicaCount: 3 # if bundled is true then this is required. should be odd number
auth:
token:
enabled: false
rbac:
create: false
allowNoneAuthentication: true
rootPassword: ""
logLevel: "info"
extraEnvVars:
- name: ETCD_AUTO_COMPACTION_RETENTION
value: "1"
# command: ["/bin/bash", "-c", "while true; do sleep 1; done"]
# livenessProbe:
# enabled: false
# readinessProbe:
# enabled: false
Comment the command we added at first, and then deploy to k8s again. You will see the etcd pods recreate one by one. and every pod needs 2 minutes to become green.
After 5 minutes, everything should works.
You also need to recreate OpenObserve pods because the etcd recreated, maybe some pods can't connect to etcd for a while.