Symptoms

  • Pods in not ready states as they failed the readiness probe. 

Detail state of pods obtained from command ‘kubectl describe pods <pods-full-name>’ shows error as below in the Events section:


Events:
  Type     Reason     Age                     From     Message
  ----     ------     ----                    ----     -------
  Warning  Unhealthy  29m (x3737 over 12h)    kubelet  Readiness probe failed: HTTP probe failed with statuscode: 500
  Warning  Unhealthy  4m35s (x5190 over 12h)  kubelet  Readiness probe failed: Get https://10.96.1.35:443/rest/application/readinessProbe: net/http: request canceled (Client.Timeout exceeded while awaiting headers)


  • At the same time from MN core.log, any requests sent to deployed microservice components (e.g., gdpr, ux1-marketplace, rde, psa, idp) endpoint URL shows that it failed with error "Name or service not known".

Example:

...
Could not invoke endpoint url 'https://ux1-marketplace-connector:443/aps/marketplace/043154b7-32d7-49d6-8c9f-c0f2d15b37b7/updateIndex?fast=true' of application instance with UUID '48f544c4-f9f1-45b1-954c-ff4a68b9663d'. RESTEASY004655: Unable to invoke request: java.net.UnknownHostException: ux1-marketplace-connector: Name or service not known: ux1-marketplace-connector: Name or service not known 
...

Notes: IF idp is deployed and enabled on your CBC Platform, this can mean that your platform will be rendered inacessible.


  • Specifically for ux1-marketplace-connector pod. Entries as below can also be seen in its logs:
...
Elasticsearch.Net.ElasticsearchClientException: Failed to ping the specified node.. Call: unknown resource ---> Elasticsearch.Net.PipelineException: Failed to ping the specified node. ---> System.OperationCanceledException: The operation was canceled.
   at System.Net.Http.HttpClient.HandleFinishSendAsyncError(Exception e, CancellationTokenSource cts)
   at System.Net.Http.HttpClient.FinishSendAsyncUnbuffered(Task`1 sendTask, HttpRequestMessage request, CancellationTokenSource cts, Boolean disposeCts)
   at Elasticsearch.Net.HttpConnection.Request[TResponse](RequestData requestData)
   at Elasticsearch.Net.RequestPipeline.Ping(Node node)
   --- End of inner exception stack trace ---
   at Elasticsearch.Net.RequestPipeline.Ping(Node node)
   at Elasticsearch.Net.Transport`1.P ing(IRequestPipeline pipeline, Node node)
   at Elasticsearch.Net.Transport`1.Request[TResponse](HttpMethod method, String path, PostData data, IRequestParameters requestParameters)
   --- End of inner exception stack trace ---
...


  • IF the component cluster are deployed on AKS, restarted pods will remain in pending state with error as below in its log:
...
21-07-2021;06:15:20,885 null: com.microsoft.azure.storage.StorageException: The server encountered an unknown failure:
        at deployment.rateddataexport-backend.war//com.microsoft.azure.storage.StorageException.translateException(StorageException.java:101)
...
...
Caused by: java.net.UnknownHostException: example.blob.core.windows.net
        at java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:220)
...


  • Ultimately, DNS lookup doesn't work from k8s node(s). 

When DNS lookup tests following steps as below is performed on k8s node(s), they failed.


On MN node:

i) Get the kube-dns service ClusterIP:

[root@oss-core ~]# kubectl get svc -n kube-system -o wide | grep dns
kube-dns        ClusterIP   192.0.2.2    <none>        53/UDP,53/TCP,9153/TCP   250d   k8s-app=kube-dns


ii) Get the coredns pods IP:

[root@oss-core ~]# kubectl get pods -n kube-system -o wide | grep dns
coredns-5644d7b6d9-bt9nn               1/1     Running   1          250d   10.96.0.2     k8s <none>           <none>
coredns-5644d7b6d9-rdpnz               1/1     Running   1          250d   10.96.0.3     k8s  <none>           <none>


On k8s node(s):

iii) Get the clusterdomain name:

root@k8s:~# grep -i clusterdomain /var/lib/kubelet/config.yaml
clusterDomain: k8s.example.services


            iv) Implement the nslookup test on k8s node following format as below:

                    #/> nslookup kubernetes.default.svc.k8s.<clusterDomain> <kube-dns svc ClusterIP>

                    #/> nslookup kubernetes.default.svc.k8s.<clusterDomain> <coredns pods IP_1>

                    #/> nslookup kubernetes.default.svc.k8s.<clusterDomain> <coredns pods IP_2>


 v) So when nslookup as in iv) above is performed on k8s node(s), all of them failed:

root@k8s:~# nslookup kubernetes.default.svc.k8s.example.services 192.0.2.2
Server:         192.0.2.2
Address:        191.0.2.2#53

** server can't find kubernetes.default.svc.k8s.example.services: NXDOMAIN


root@k8s:~# nslookup kubernetes.default.svc.k8s.example.services 10.96.0.2
Server:         10.96.0.2
Address:        10.96.0.2#53

** server can't find kubernetes.default.svc.k8s.example.services: NXDOMAIN


root@k8s:~# nslookup kubernetes.default.svc.k8s.example.services 10.96.0.3
Server:         10.96.0.3
Address:        10.96.0.3#53

** server can't find kubernetes.default.svc.k8s.example.services: NXDOMAIN


Cause

kube-router stop working properly and would not pass traffic to other pods or services, (e.g., CoreDNS, kube-dns and other microservice components).


Resolution

Re-initialize kube-router pods by invoking command as below:

#/> kubectl delete pods <kube-router-fullname>-n kube-system


Once kube-router has been re-spawned and initialized, verify by implementing the DNS lookup test again on the k8s node(s).


Internal