Symptoms
Linux Provisioning Gateway Host (LINPGH) has become unmanageable from OA. Details that help to identify this particular scenario:
- Server state is Off in PCP
- APS 1.2 tasks fail in OA Task Manager on attempt to perform operation on LINPGH
pa-agent
is started on LINPGH but it is not accepting new connections from OA Management NodeBig number of long running PHP-scripts are observed on LINPGH:
linpgh01 ~ # ps -eo pid,stime,etime,cmd --sort=start_time |grep php |grep -v grep 6082 Jan24 13-23:25:47 php -q some_script_name.php 28948 Jan25 12-18:38:52 php -q some_script_name.php 26316 Jan25 12-17:08:50 php -q some_script_name.php 28255 Jan25 12-12:08:47 php -q some_script_name.php 28257 Jan25 12-12:08:47 php -q some_script_name.php 10661 Jan25 12-08:07:49 php -q some_script_name.php .. 12586 Feb03 3-12:49:53 php -q some_script_name.php 31736 Feb05 1-06:19:47 php -q some_script_name.php 4462 Feb06 1-02:56:53 php -q some_script_name.php 4463 Feb06 1-02:56:53 php -q some_script_name.php 29342 Feb06 06:09:55 php -q some_script_name.php
APS PHP-scripts are started by
pa-agent
. It can be verified in output ofps auxww
:root 10108 0.0 0.2 38828 4772 ? S Feb05 0:00 /usr/local/pem/sbin/pa-agent --props-file /usr/local/pem/etc/pleskd.props --send-signal root 4460 0.0 0.0 65948 1344 ? S Feb06 0:00 \_ /bin/sh /usr/local/pem/APS/scripts/143/1.0-60/r26565_3399482224.sh root 4462 0.0 0.4 161564 8320 ? S Feb06 0:00 \_ php -q some_script_name.php
As soon as all
pa-agent
workers (default value is 10) are occupied by execution of such scripts, LINPGH becomes unmanageable from OA.Debugger shows that PHP-process waits for input infinitely (example for process with PID
29342
):linpgh01 ~ # strace -tTfFs10000 -p 29342 Process 29342 attached - interrupt to quit 05:38:29 read(3, <unfinished ...>
There is a TCP-connection created by the same proccess:
linpgh01 ~ # netstat -antpl | egrep 'Local Address|29342' Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp 0 0 LOCAL_IP:39018 REMOTE_IP:443 ESTABLISHED 29342/php linpgh01 ~ #
- On remote server side (server with REMOTE_IP) such connection does not exist. I.e., there is no TCP-connection from LINPGH port 39018 to local port 443
Cause
PHP-script did not recieve response from remote server due to a network issue. Corresponding TCP-connection is still open on LINPGH but it does not exist on remote service side.
PHP-settings that control maximum time of a script execution are ignored in this situation (values are in seconds):
linpgh01 ~ # egrep '^max_execution_time|^max_input_time' /etc/php.ini
max_execution_time = 30
max_input_time = 60
PHP-script will be executed endlessly in this case.
Resolution
Configure cron task on LINPGH that analyzes execution time of PHP-scripts. If execution time is more than specified timeout PHP-script is aborted forcibly. Default timeout is 36000
(10 hours)
Download script kill_stuck_php.sh into folder
/root/scripts/
on LINPGHAdd following lines into
/etc/crontab
:linpgh01 ~ # grep -i php /etc/crontab # Forcibly stop PHP-scripts that got stuck for more than 10 hours 0 */1 * * * root /root/scripts/kill_stuck_php.sh >> /var/log/kill_stuck_php.log 2>&1
- Execute
service crontab restart
Script will be started every hour. It will put execution results in to /var/log/kill_stuck_php.log
. Example:
linpgh01 ~ # cat /root/scripts/kill_stuck_php.log
Wed Feb 7 06:36:01 CET 2018
killing php process pid=4462 executable=php state=S user=root etime=1-03:54:56
killing php process pid=4463 executable=php state=S user=root etime=1-03:54:56
killing php process pid=31736 executable=php state=S user=root etime=1-07:17:50
# 3 process(es) killed
Wed Feb 7 07:00:01 CET 2018
Wed Feb 7 08:00:01 CET 2018