2025年6月28日 星期六

Mipad 4 custom rom installation guide

Steps (assumed bootloader is unlocked)

  1. Prepare materials listed in "Materials" sections

  2. Navigate to References section, 3rd link to download the adb fastboot drivers, make sure all the necessary executables are configured in system environment variables. Test it with typing adb command in cli.

  3. Boot up the MiPad 4 to fastboot mode by pressing volume down together with on / off button, connect the device to the Windows host

  4. If this is the first time you ever perform connection between Windows and Android devices in debug mode or ADB flash mode, install the driver in folder "android-bootloader-interface-driver"

  5. You may see "Other devices > Android" with exclaimation mark beside in device manager, right click the item, choose "Update driver > Browser my computer for drivers > Let me pick from a list of available drivers on my computer > Show all devices", then browse to the folder downloaded in step 3, choose the file "android_winusb.inf" file to install the driver

  6. Choose "Android bootloader interface" and install.

  7. Run the "usb-conn-fix-run-as-admin.bat" in administrator mode.

  8. Connect the Mipad 4 in fastboot mode again.

  9. Flash the "recovery-20241226-clover.img" with command by "fastboot flash recovery recovery-20241226-clover.img", make sure to navigate to the directory where the img locates. Wait for the completion.

  10. Shutdown the MiPad 4, and press Volume up and on / off button get into the recovery interface just flashed in step 9

  11. Format data first, and then use the "adb sideload" command to flash the LineageOS ROM with file name "lineage-22.2-20250616-UNOFFICIAL-clover.zip", make sure you configured the ADB sideload mode first in the recovery interface first. Back to the Windows host, run "adb sideload lineage-22.2-20250510-UNOFFICIAL-clover.zip", wait for completion.

  12. Do not restart to system at the moment, run "adb sideload MindTheGapps-15.0.0-arm64-20250214_082511.zip" twice

  13. Restart again and it should boot to the Lineage OS

References

Materials (download)
  • Android tool set (ADB and fastboot tools drivers)
  • Drivers (.inf) for connecting devices in fastboot mode
  • Fastboot recovery img
  • ROM (.zip)
  • GApps (.zip)
  • USB registry fix (.bat)

2025年1月13日 星期一

8BitDo SN30 pro controller connectivity issues in Windows 11

 

Problems

The used to be working controller suddenly stop working even it is successfully connected. 


Solutions

Turn the controller to the back side, turn off the controller by long pressing START button, change the mode to "Xinput", turn back the controller on again.

2024年6月25日 星期二

Uninstall failure on pipenv uninstall

Problem

Error "AttributeError: module 'enum' has no attribute 'IntFlag'" encountered when running unintall command of pipenv

Solutions

1. Run "pip uninstall enum34"
2. Re-run the uninstall again

2023年8月1日 星期二

Fix gitlab runner error after VM backup restore processes

Background

The gitlab runner service is unable to start after performing VM backup and restore process (for the sake of increasing storage size)


Problems

The error messages come as follows when I try to start the gitlab runner manually

prodev@storage:~$ sudo gitlab-runner run
Runtime platform                                    arch=amd64 os=linux pid=20037 revision=ac8e767a version=12.6.0
Starting multi-runner from/etc/gitlab-runner/config.toml...  builds=0
Running in system-mode.

Configuration loaded                                builds=0
listen_address not defined, metrics & debug endpoints disabled  builds=0
[session_server].listen_address not defined, session endpoints disabled  builds=0
ERROR: Checking for jobs... forbidden               runner=99ab6a69
ERROR: Checking for jobs... forbidden               runner=eac73045
ERROR: Checking for jobs... forbidden               runner=99ab6a69
ERROR: Checking for jobs... forbidden               runner=eac73045
ERROR: Checking for jobs... forbidden               runner=99ab6a69
ERROR: Runner https://repo.gitlab.dev.hkg.internal.p2mt.com99ab6a6972442b2a52ffe7358cbf8f is not healthy and will be disabled!
ERROR: Checking for jobs... forbidden               runner=eac73045
ERROR: Runner https://gitlab.com/eac73045875b618fea0464590a0fb2 is not healthy and will be disabled!


Solutions

1. Rename config.toml in /etc/gitlab-runner to config.toml.bk

2. Restart gitlab service by "gitlab-runner start"

3. Re-name config.toml.bk back to config.toml

4. Re-start the service again by "gitlab-runner restart"

2023年5月9日 星期二

Docker / docker compose cheatsheet

Background

Very often there are commands which are very useful for debugging and troubleshooting in docker and docker compose, this post is to document them down for future use.

  1. Gather health check command status information of specific container
    docker inspect --format "{{json .State.Health }}" anm11_log_server_1 | jq
    jq bash command is needed to install beforehand. Following results will be outputted

    {
      "Status": "healthy",
      "FailingStreak": 0,
      "Log": [
        {
          "Start": "2023-05-09T15:31:55.974151402+08:00",
          "End": "2023-05-09T15:31:56.180881836+08:00",
          "ExitCode": 0,
          "Output": ""
        },
        {
          "Start": "2023-05-09T15:32:56.185509821+08:00",
          "End": "2023-05-09T15:32:56.322503697+08:00",
          "ExitCode": 0,
          "Output": ""
        },
        {
          "Start": "2023-05-09T15:33:56.327354557+08:00",
          "End": "2023-05-09T15:33:56.506142901+08:00",
          "ExitCode": 0,
          "Output": ""
        },
        {
          "Start": "2023-05-09T15:37:16.863314985+08:00",
          "End": "2023-05-09T15:37:17.0210078+08:00",
          "ExitCode": 0,
          "Output": ""
        },
        {
          "Start": "2023-05-09T15:38:17.025502598+08:00",
          "End": "2023-05-09T15:38:17.155554148+08:00",
          "ExitCode": 0,
          "Output": ""
        }
      ]
    }
    

2023年4月18日 星期二

Learning Kubernetes

Deployment

Update of deployment related items

Sometimes there is a need to update the persistent volume and persistent volume claim, e.g.: a wrongly configured mount point. Since pvc claims for pv, and pv is used by deployment, to completely remove pv, pvc needs to be deleted first by "kubectl delete pvc <pvc_name>", and then remove pv with "kubectl delete pv <pv_name>" followed by "kubectl patch pv <pv_name> -p '{"metadata": {"finalizers": null}}' ".

After deleting PVC, re-creating the PV will find status become "Released", the claim ID is still attaching the old PVC, we need to patch the PV using "kubectl patch pv <volume_name> --type json -p '[{"op": "remove", "path": "/spec/claimRef/uid"}]' ". The PV can then re-bound with the new PVC.

Deployment example (redmine)

Refer to the following deployment composed (deployment-redmine.yaml)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: redmine
  labels:
    app: redmine
spec:
  replicas: 1
  selector:
    matchLabels:
      app: redmine
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        app: redmine
    spec:
      containers:
        - name: redmine
          image: bitnami/redmine:4.1.1
          imagePullPolicy: IfNotPresent
          ports:
          - containerPort: 3000
            protocol: TCP
          env:
          - name: REDMINE_USERNAME
            value: admin
          - name: REDMINE_PASSWORD
            value: admin
          - name: REDMINE_DB_USERNAME
            value: root
          - name: REDMINE_DB_PASSWORD
            value: P2mobile8!886
          - name: REDMINE_DB_NAME
            value: bitnami_redmine
          - name: REDMINE_DB_MYSQL
            value: mariadb.redmine.svc.cluster.local
          lifecycle:
            postStart:
              exec:
                command: ["sh","-c","/usr/bin/curl http://storage.dev.hkg.internal.p2mt.com/storage/redmine/purple.tar --output purple.tar; tar -xvf purple.tar -C /opt/bitnami/redmine/public/themes/;\n"]
          volumeMounts:
          - mountPath: /bitnami
            name: nfs-redmine
      restartPolicy: Always
      volumes:
      - name: nfs-redmine
        persistentVolumeClaim:
          claimName: redmine-data-claim

Here are some points worth noting

Different types of "labels"

A. metadata.labels (with value app: redmine) - This label is to manage the deployment itself. This may be used to filter the deployment based on this label. E.g. when you run ""kubectl get deploy --show-labels, the following outputs

NAME      READY   UP-TO-DATE   AVAILABLE   AGE   LABELS
redmine   1/1     1            1           20h   app=redmine

We can see LABELS field with "app=redmine", which is what it was defined above, so when we want to filter specific deployment, we can also use this label by "kubectl get deployments -L app=redmine -o wide", the following will be outputted

NAME      READY   UP-TO-DATE   AVAILABLE   AGE   CONTAINERS   IMAGES                  SELECTOR
redmine   1/1     1            1           20h   redmine      bitnami/redmine:4.1.1   app=redmine

B. spec.selector.matchLabels (with value app: redmine) - This label tells ReplicaSet, an application managed in Deployment, to use this label to identify and manage how many instances of this label of pods to maintain the numbers, this label must equal to the one specified in spec.template.spec.metadata.labels

C. spec.template.spec.metadata.labels (with value app: redmine) - Label of the pod, must be the same as spec.selector.matchLabels

Spec and templates

Spec and templates defines the desired state of the deployment. Here are some configs worth noting at.

1. deployment.spec: Defines status of pods within the deployment, e.g. how many copies of pods required for the service containers, how the pod is formed (by defining 1 or more containers within a pod)

2. deployment.spec.selector.matchLabels: A required field which must match with deployment.spec.template.metadata.labels, when specified, the deployment knows which pods underhood are being targeted.

3. deployment.spec.template.metadata.labels: The name of the template, which must match to deployment.spec.selector.matchLabels

4. deployment.spec.template.spec.containers: The formation in terms of containers that the deployment is defined to, more than 1 container can be specified, like docker compose, container name, image etc. can be designated. deployment.spec.template.spec.containers.[n].ports.containerPort specify the port used (internally) by container, it can then exposed with an external port / IP if service container is needed to be accessed externally.

deployment.spec.template.spec.containers.[n].lifecycle  can be used to put some commands executed in different cycles, e.g. one may want to execute an entrypoint command after container in up successfully using .lifecycle.podStart.exec.command.

Volumes

In this redmine deployment example, we use nfs (Network File Storage) situated in our company's synology NAS as the permanent storage source, here is how to create a volume

1.  Create PV (Persistent Volume) and PVC (Persistent Volume Claim)

In K8S, we need to defined PV and PVC first before we can actually "frame" them into the deployment yaml. Here are the yamls of PV and PVC.

Persistent Volume (PV), pv-redmine.yaml

apiVersion: v1
kind: PersistentVolume
metadata:
  name: nfs-redmine
spec:
  storageClassName: redmine-class
  capacity:
    storage: 200Gi 
  volumeMode: Filesystem
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  nfs:
    path: /volume1/k8s-nfs/redmine
    server: 10.240.0.3

PV defines the specifications of the volume,  in this case, we use nfs as volume source, we must ensure the connectivity between the node and the NAS (10.240.0.3) before applying the manifest. And we need to also create and intialize the nfs server first (as shown below)


nfs.path is the physical location of the NAS, in our case, we need to refer to the NAS's nfs server mount point settings and fill in the mount point value. In Synology NAS, this information can be retrieved through "Shared folder > NFS rights", the mount path can be found at the bottom.


spec.storageClassName and spec.accessModes are worth noting, storageClassName is used to bind the PV with the PVC, the corresponding storage class name needs to be the same in PVC. ReadWriteOnce access mode means the volume can only be read / write by a single node.

Persistent Volume (PVC), pvc-redmine.yaml

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: redmine-data-claim
spec:
  storageClassName: redmine-class
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 200Gi

The PVC is created to claim the volume (PV), spec.storageClassName should match the one in PV or the claim cannot linked the PV. 

Linking PVC to deployment

After creating PV and PVC, deployment.spec.template.spec.volumes should be created, it is part of the pod specifications, the deployment.spec.template.spec.volumes.persistentVolumeClaim.claimName should match the pvc.meta.name so as to link up the PV with the deployment. 

Furthermore, the deployment.spec.tempplate.spec.containers.[n].volumeMounts should also be configured, the volumeMounts.[n].mountPath should be filled with the pod's mount path, e.g. /etc/config, /bitnami etc.

The volumePath.name should also match the pv.metadata.name to let the mountPath knows where the mount volume should it refer to.

Service

In k8s, service is needed to expose network application (usually containers running in pods), e.g. a redmine server.

Manifests

The following objects of service manifest (redmine-service.yaml) are worth noting.

apiVersion: v1
kind: Service
metadata:
  annotations:
    metallb.universe.tf/address-pool: metallb-address-pool
    metallb.universe.tf/loadBalancerIPs: 10.230.0.12
  labels:
    app: redmine
  name: redmine
spec:
  ports:
  - name: "redmine"
    port: 80
    targetPort: 3000
  selector:
    app: redmine
  type: LoadBalancer
status:
  loadBalancer: {}

The "specs.selector.app" instructs the service to be applied to the pod (usually defined in deployment yaml) which also contains the same label selector. In this case, pod with label app:redmine will be selected and apply this service (with name "redmine")

Identification

Labels and annotations are used to identify pods when cluster becomes sophisticated, labels are typically valued as release, stable etc. which are identifiable and can be selected by selector.

Labels can be multipled

Annotations can mark some informational data such as author, versioning, release contact etc. These are all arbitrary for easy identification

Troubleshooting

1. Reading logs

System wide

Sometimes logs are useful to track where the problem is, use "kubectl get events --sort-by=.metadata.creationTimestamp" or "kubectl describe pods" to do so. 

Deployments

kubectl logs deployment/redmine dumps logs of the redmine container (assuming 1 container only in the deployment), and kubectl logs deployment/redmine -c <containerName> for multiple containers

Pods

For single pod logs, use kubectl logs -l app=redmine, to dump logs with designated label

2. Throttling error when trying to retrieve information

When receving "Throttling request took..." while running " kubectl get all -n <ns>" , run " sudo chown -R <user>:<group> .kube/" to solve the issue

3. The connection to the server <master>:6443 was refused - did you specify the right host or port? when running any kubectl command

kubectl is unable to access K8S API server port, troubleshoot following the below steps

1. Check KUBECONFIG status by 
env | grep -i kub

2. Check status of docker service by
systemctl status docker.service

3. Check kubelet service by 
systemctl status kubelet.service

● kubelet.service - kubelet: The Kubernetes Node Agent
     Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
    Drop-In: /etc/systemd/system/kubelet.service.d
             └─10-kubeadm.conf
     Active: activating (auto-restart) (Result: exit-code) since Sun 2023-05-28 13:19:36 HKT; 1s ago
       Docs: https://kubernetes.io/docs/home/
    Process: 4048306 ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS (code=exited, status=1/FAILURE)
   Main PID: 4048306 (code=exited, status=1/FAILURE)

Kubelet unable to initialize, further check system log

4. Check system log by
journalctl -f -u kubelet
where -f means follow and -u to specify particular service
-- Logs begin at Tue 2023-04-25 19:25:06 HKT. --
May 28 13:18:45 jacky-MS-7817-master systemd[1]: Stopped kubelet: The Kubernetes Node Agent.
May 28 13:18:45 jacky-MS-7817-master systemd[1]: Started kubelet: The Kubernetes Node Agent.
May 28 13:18:45 jacky-MS-7817-master kubelet[4048052]: Flag --network-plugin has been deprecated, will be removed along with dockershim.
May 28 13:18:45 jacky-MS-7817-master kubelet[4048052]: Flag --network-plugin has been deprecated, will be removed along with dockershim.
May 28 13:18:45 jacky-MS-7817-master kubelet[4048052]: I0528 13:18:45.187814 4048052 server.go:440] "Kubelet version" kubeletVersion="v1.22.1"
May 28 13:18:45 jacky-MS-7817-master kubelet[4048052]: I0528 13:18:45.188249 4048052 server.go:868] "Client rotation is on, will bootstrap in background"
May 28 13:18:45 jacky-MS-7817-master kubelet[4048052]: E0528 13:18:45.189222 4048052 bootstrap.go:265] part of the existing bootstrap client certificate in /etc/kubernetes/kubelet.conf is expired: 2022-06-12 17:11:44 +0000 UTC
May 28 13:18:45 jacky-MS-7817-master kubelet[4048052]: E0528 13:18:45.189256 4048052 server.go:294] "Failed to run kubelet" err="failed to run Kubelet: unable to load bootstrap kubeconfig: stat /etc/kubernetes/bootstrap-kubelet.conf: no such file or directory"
May 28 13:18:45 jacky-MS-7817-master systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE
May 28 13:18:45 jacky-MS-7817-master systemd[1]: kubelet.service: Failed with result 'exit-code'.
May 28 13:18:55 jacky-MS-7817-master systemd[1]: kubelet.service: Scheduled restart job, restart counter is at 1647394.
Certificate is expired, needs to check and renew the certificate.
sudo kubeadm certs renew all
sudo kubeadm certs check-expiration
which outputs
[renew] Reading configuration from the cluster...
[renew] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[renew] Error reading configuration from the Cluster. Falling back to default configuration

certificate embedded in the kubeconfig file for the admin to use and for kubeadm itself renewed
certificate for serving the Kubernetes API renewed
MISSING! certificate the apiserver uses to access etcd
certificate for the API server to connect to kubelet renewed
certificate embedded in the kubeconfig file for the controller manager to use renewed
certificate for liveness probes to healthcheck etcd renewed
certificate for etcd nodes to communicate with each other renewed
certificate for serving etcd renewed
certificate for the front proxy client renewed
certificate embedded in the kubeconfig file for the scheduler manager to use renewed
and
CERTIFICATE                EXPIRES                  RESIDUAL TIME   CERTIFICATE AUTHORITY   EXTERNALLY MANAGED
admin.conf                 May 27, 2024 05:04 UTC   364d                                    no      
apiserver                  May 27, 2024 05:04 UTC   364d            ca                      no      
apiserver-etcd-client      May 27, 2024 05:06 UTC   364d            etcd-ca                 no      
apiserver-kubelet-client   May 27, 2024 05:04 UTC   364d            ca                      no      
controller-manager.conf    May 27, 2024 05:04 UTC   364d                                    no      
etcd-healthcheck-client    May 27, 2024 05:04 UTC   364d            etcd-ca                 no      
etcd-peer                  May 27, 2024 05:04 UTC   364d            etcd-ca                 no      
etcd-server                May 27, 2024 05:04 UTC   364d            etcd-ca                 no      
front-proxy-client         May 27, 2024 05:04 UTC   364d            front-proxy-ca          no      
scheduler.conf             May 27, 2024 05:04 UTC   364d                                    no
indicating successful of certificate renewal
5. Check the system log again, but still fails to initiailize the service
May 28 13:07:32 jacky-MS-7817-master kubelet[4044890]: E0528 13:07:32.789268 4044890 server.go:294] "Failed to run kubelet" err="failed to run Kubelet: unable to load bootstrap kubeconfig: stat /etc/kubernetes/bootstrap-kubelet.conf: no such file or directory"
May 28 13:07:32 jacky-MS-7817-master systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE
May 28 13:07:32 jacky-MS-7817-master systemd[1]: kubelet.service: Failed with result 'exit-code'.

6. Re-initialize k8s control plane node using the renew certs
sudo cp -r /etc/kubernetes  /etc/kubernetes-bak  #backup k8s folders
sudo rm -rf $HOME/.kube  # Remove all configurations in .kube folder
sudo rm -rf $HOME/.kube
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf  /root/.kube/config # Copy the admin config back to .kube directory
sudo rm -rf /etc/kubernetes/*.conf
sudo kubeadm  init phase kubeconfig all  # redo the initialization phase, this will re-generate all kinds of certs
systemctl restart kubelet
The kubelet should back to its running state and kubectl should run successfully by running
systemctl status kubelet.service
kubelet.service - kubelet: The Kubernetes Node Agent
     Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
    Drop-In: /etc/systemd/system/kubelet.service.d
             └─10-kubeadm.conf
     Active: active (running) since Sun 2023-05-28 13:38:05 HKT; 11h ago
       Docs: https://kubernetes.io/docs/home/
   Main PID: 4052761 (kubelet)
      Tasks: 19 (limit: 18434)
     Memory: 149.8M
     CGroup: /system.slice/kubelet.service
             └─4052761 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --network-plugin=cni --pod-infra-container-image=k8s.gcr.io/pause:3.4.1

4. The connection to the server localhost:8080 was refused - did you specify the right host or port? when running any kubectl command

This is due to the absence of admin configuration in .kube directory, simply follow the official instruction to copy the admin.conf to the .kube config directory

mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

5. Cannot connect to Docker daemon, Is 'docker daemon' running on this host?: dial unix /var/run/docker.sock: connect: no such file or directory" when running pipeline in k8s gitlab runner

6. How do I allow private repository in minikube

Step1: Install minikube plugins for private repository and configure ecr

https://minikube.sigs.k8s.io/docs/handbook/registry/

Step2: Configure the deployment yaml

https://minikube.sigs.k8s.io/docs/tutorials/configuring_creds_for_aws_ecr/

7. My minikube load balancer service external IP's status is in "PENDING" state

Step1: Assign external IP by "minikube service <service_name>",  a browser window will popup with the accessible IP with external port

8. I made some mistakes in ConfigMap / Secrets, how can I do the update in an already running deployment?

Basically, a more k8s way to update the manifests are creating a new yaml file with -v* to indicate the version, this allow easy rolback to the previous deployment.

e.g. configmap-runner-script-gitlab-runner-v3, copy all the contents of configmap-runner-script-gitlab-runner-v2 and start the modification.

Run 

kubectl apply -f ./configmap-runner-script-gitlab-runner-v3.yaml

to apply the configMap once again, by default the changes apply will immediately propagated to all the pods, but it only applies when the configMap itself is mounted as volume, in our case, we need to rollout the statefulset to reflect the configMap changes by 
kubectl rollout -n gitlab restart statefulset/gitlab-ci-runner

The same applies to secrets and other related deployment resources. The update itself wouldn't cause any downtime to the service but may affect performance if insffucuent resources occurs during the rollout update process.

References

1. Different semantics of "label" in deployment yaml

2. Dash's meaning in YAML

3. Another description of YAML dash and indentation

4. Why do we need pods type when we have deployment type?

5. Kubernetes PV refuses to bind after delete/re-create

6. What is the difference between a Source NAT, Destination NAT and Masquerading?

7. Kubernetes NodePort vs LoadBalancer vs Ingress? When should I use what?

8. Connecting Applications with Services

9. kubectl doc - set selector

10. Kubelet cheatsheet

11. What is endpoint in k8s

12. Access IP outside the cluster

13. Labels in deployment spec and template

14. ReplicaSet

15. Ingress

16. (Metallb) Does kube-proxy not already do this?

17. 问题“The connection to the server....:6443 was refused - did you specify the right host or port?”的处理!

18. 如何使用kubeadm管理证书?

19. kubeadm更新证书-v1.22.2

20. The connection to the server localhost:8080 was refused - did you specify the right host or port?

21. Pull an Image from a Private Registry


For detailed steps of how this cluster is made, please refer to the relative blog post

2023年1月4日 星期三

Renew kubernetes (k8s) certificates

Background

In 12th Dec 2022, the certificates of all the k8s components expired, which cause the failure to perform CI tasks in the company software builds. However after updating the certificates following the wiki of redmine, I still cannot make the CI work. Thus restarting the master node (IP: 10.240.0.59).

Problem comes even worse, the kubelet service is not start up after reboot, all k8s services containers are not running. 

Troubleshooting

I try to find the logs for errors, but is somehow problematic. Before restarting the master node, I try the following command to retrieve logs of the gitlab runner pod, but seems like it is not working because there are 2 containers within this pod, k8s needs us to specify particular container to read the logs.

kubectl logs -f -n gitlab runner-yngqxnkk-project-737-concurrent-0zwbtj

Then I try to get the names of the containers within the pod with

kubectl describe pod -n gitlab runner-yngqxnkk-project-737-concurrent-0zwbtj | more

However I cannot get any information regarding to the container names. Then I use

kubectl get pods -n gitlab runner-yngqxnkk-project-737-concurrent-0zwbtj -o jsonpath='{.spec.containers[*].name}'

And then it list out "build helper", which is the name of the containers in the pod, I further run

kubectl logs -f -n gitlab runner-yngqxnkk-project-737-concurrent-0zwbtj -c helper

But nothing is output, my guess is that the containers are not even running anything (since pod is in pending state), therefore no container logs are being outputted. 

After that I decided restarting the whole master node (10.240.0.59), but sadly all kubelet service is not starting up, I again try to troubleshoot system log by

systemctl status kubelet -l

journalctl -xeu kubelet

And found out some useful log...

kubeconfig: stat /etc/kubernetes/bootstrap-kubelet.conf: no such file or directory

So the bootstrap-kubelet.conf really is not existed, and that caused the inability of start the kubelet service.

Solutions

First update the certificates for ALL master nodes. (my case: 10.240.0.59 / 51 / 52). The command to check the certificate expiration is 

sudo kubeadm alpha certs check-expiration

This command list all k8s components' certificate status, and let you know whether they are expired or not. Renew the certificates with 

sudo kubeadm alpha certs renew all

Check the status with the first command again, the date should be updated +1 year. Then manually renew the certificate following the commands below

$ cd /etc/kubernetes/pki/

$ mv {apiserver.crt,apiserver-etcd-client.key,apiserver-kubelet-client.crt,front-proxy-ca.crt,front-proxy-client.crt,front-proxy-client.key,front-proxy-ca.key,apiserver-kubelet-client.key,apiserver.key,apiserver-etcd-client.crt} ~/

$ kubeadm init phase certs all --apiserver-advertise-address <IP>

$ cd /etc/kubernetes/

$ mv {admin.conf,controller-manager.conf,kubelet.conf,scheduler.conf} ~/

$ kubeadm init phase kubeconfig all

$ reboot

The apiserver-advertise-address is the node IP (in my case 10.240.0.59). The above script basically backup the current keys (in case any problems occur), and re-initialize the cert generation phase, backup all the files in "/etc/kubernetes" before re-initializing kubeconfig phase.

After reboot, copy the admin.conf generated in kubeconfig generation phase to the kube configuration home directory

cp -i /etc/kubernetes/admin.conf $HOME/.kube/config

Check the status of the kubelet service by

sudo systemctl status kubelet.service

References