Symptoms
- Pods in not ready states as they failed the readiness probe.
Detail state of pods obtained from command ‘kubectl describe pods <pods-full-name>’ shows error as below in the Events section:
… Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning Unhealthy 29m (x3737 over 12h) kubelet Readiness probe failed: HTTP probe failed with statuscode: 500 Warning Unhealthy 4m35s (x5190 over 12h) kubelet Readiness probe failed: Get https://10.96.1.35:443/rest/application/readinessProbe: net/http: request canceled (Client.Timeout exceeded while awaiting headers) …
- At the same time from MN core.log, any requests sent to deployed microservice components (e.g., gdpr, ux1-marketplace, rde, psa, idp) endpoint URL shows that it failed with error "Name or service not known".
Example:
... Could not invoke endpoint url 'https://ux1-marketplace-connector:443/aps/marketplace/043154b7-32d7-49d6-8c9f-c0f2d15b37b7/updateIndex?fast=true' of application instance with UUID '48f544c4-f9f1-45b1-954c-ff4a68b9663d'. RESTEASY004655: Unable to invoke request: java.net.UnknownHostException: ux1-marketplace-connector: Name or service not known: ux1-marketplace-connector: Name or service not known ...
Notes: IF idp is deployed and enabled on your CBC Platform, this can mean that your platform will be rendered inacessible.
- Specifically for ux1-marketplace-connector pod. Entries as below can also be seen in its logs:
... Elasticsearch.Net.ElasticsearchClientException: Failed to ping the specified node.. Call: unknown resource ---> Elasticsearch.Net.PipelineException: Failed to ping the specified node. ---> System.OperationCanceledException: The operation was canceled. at System.Net.Http.HttpClient.HandleFinishSendAsyncError(Exception e, CancellationTokenSource cts) at System.Net.Http.HttpClient.FinishSendAsyncUnbuffered(Task`1 sendTask, HttpRequestMessage request, CancellationTokenSource cts, Boolean disposeCts) at Elasticsearch.Net.HttpConnection.Request[TResponse](RequestData requestData) at Elasticsearch.Net.RequestPipeline.Ping(Node node) --- End of inner exception stack trace --- at Elasticsearch.Net.RequestPipeline.Ping(Node node) at Elasticsearch.Net.Transport`1.P ing(IRequestPipeline pipeline, Node node) at Elasticsearch.Net.Transport`1.Request[TResponse](HttpMethod method, String path, PostData data, IRequestParameters requestParameters) --- End of inner exception stack trace --- ...
- IF the component cluster are deployed on AKS, restarted pods will remain in pending state with error as below in its log:
...
21-07-2021;06:15:20,885 null: com.microsoft.azure.storage.StorageException: The server encountered an unknown failure:
at deployment.rateddataexport-backend.war//com.microsoft.azure.storage.StorageException.translateException(StorageException.java:101)
...
...
Caused by: java.net.UnknownHostException: example.blob.core.windows.net
at java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:220)
...
- Ultimately, DNS lookup doesn't work from k8s node(s).
When DNS lookup tests following steps as below is performed on k8s node(s), they failed.
On MN node:
i) Get the kube-dns service ClusterIP:
[root@oss-core ~]# kubectl get svc -n kube-system -o wide | grep dns kube-dns ClusterIP 192.0.2.2 <none> 53/UDP,53/TCP,9153/TCP 250d k8s-app=kube-dns
ii) Get the coredns pods IP:
[root@oss-core ~]# kubectl get pods -n kube-system -o wide | grep dns coredns-5644d7b6d9-bt9nn 1/1 Running 1 250d 10.96.0.2 k8s <none> <none> coredns-5644d7b6d9-rdpnz 1/1 Running 1 250d 10.96.0.3 k8s <none> <none>
On k8s node(s):
iii) Get the clusterdomain name:
root@k8s:~# grep -i clusterdomain /var/lib/kubelet/config.yaml
clusterDomain: k8s.example.services
iv) Implement the nslookup test (to default namespace for example) on all available k8s nodes like below:
#/> nslookup kubernetes.default.svc.<clusterDomain> <kube-dns svc ClusterIP>
#/> nslookup kubernetes.default.svc.<clusterDomain> <coredns pods IP_1>
#/> nslookup kubernetes.default.svc.<clusterDomain> <coredns pods IP_2>
v) So when nslookup as in iv) above is performed on k8s node(s), all of them failed:
root@k8s:~# nslookup kubernetes.default.svc.k8s.example.services 192.0.2.2 Server: 192.0.2.2 Address: 191.0.2.2#53 ** server can't find kubernetes.default.svc.k8s.example.services: NXDOMAIN
root@k8s:~# nslookup kubernetes.default.svc.k8s.example.services 10.96.0.2 Server: 10.96.0.2 Address: 10.96.0.2#53 ** server can't find kubernetes.default.svc.k8s.example.services: NXDOMAIN
root@k8s:~# nslookup kubernetes.default.svc.k8s.example.services 10.96.0.3 Server: 10.96.0.3 Address: 10.96.0.3#53 ** server can't find kubernetes.default.svc.k8s.example.services: NXDOMAIN
Cause
kube-router stop working properly and would not pass traffic to other pods or services, (e.g., CoreDNS, kube-dns and other microservice components).
Resolution
Re-initialize kube-router pods by invoking command as below:
#/> kubectl delete pods <kube-router-fullname>-n kube-system
Once kube-router has been re-spawned and initialized, verify by implementing the DNS lookup test again on the k8s node(s).