WeBWorK Main Forum

Mysql error - "Got an error reading communication packets"

Mysql error - "Got an error reading communication packets"

by Balagopal Pillai -
Number of replies: 10
Hi,

       I noticed that the webwork server's mysql log has  a lot of the following errors - 
 
       Aborted connection 38 to db: 'webwork' user: 'webwork' host: 'localhost' (Got an error reading communication packets)
 
        max_allowed_packet is already set to 256 MB and it didn't fix the issue. Webwork seems to work normally. The default db engine is myisam, OS is 
ubuntu 16.04 running as a vm. I was wondering if anybody has seen this before
with webwork. 
 
 Thanks
 Balagopal
 
In reply to Balagopal Pillai

Re: Mysql error - "Got an error reading communication packets"

by Balagopal Pillai -
I checked the following mysql variables -

 mysql> SHOW GLOBAL VARIABLES WHERE variable_name = 'max_allowed_packet';
+--------------------+-----------+
| Variable_name      | Value     |
+--------------------+-----------+
| max_allowed_packet | 268435456 |
+--------------------+-----------+
1 row in set (0.01 sec)

mysql> SHOW GLOBAL STATUS WHERE variable_name = 'aborted_clients';
+-----------------+-------+
| Variable_name   | Value |
+-----------------+-------+
| Aborted_clients | 59    |
+-----------------+-------+
1 row in set (0.00 sec)

            Does webwork need a larger packet size than 256 MB? Thanks
In reply to Balagopal Pillai

Re: Mysql error - "Got an error reading communication packets"

by Jason Aubrey -
Hi Balagopal,

I have not seen such errors before, but I have seen db connection errors in the apache error log on high demand servers. Do you see any db connection errors in the apache error log? Perhaps these are corresponding error messages. In that case, you probably have some frustrated users who haven't (yet) complained. Usually we have been able to resolve the db connection errors by tuning the settings and/or increasing the horsepower of the vm.

Jason

PS - I think it could be that the packet size may need to be larger than 256MB. I think that sometimes when a page loads, the db calls grab way more data than the page needs. But, if so, most likely this would only be a problem on the instructor side, not the student side.
In reply to Jason Aubrey

Re: Mysql error - "Got an error reading communication packets"

by Balagopal Pillai -
Hi Jason,

            There is no db related errors on apache logs. I read it on the mysql documentation that it might also be  caused by the application not properly closing mysql connections. But I have another issue that I am  troubleshooting at the moment. The server vm has 36 logical cpu cores  (out of 20 hyper threaded  physical cores) and 240 GB ram allocated (out of 256 GB total) So vm has quite a bit of resources  at its disposal. I have apache ServerLimit set to 1100 (assuming 200 MB per process) Last night  webwork server stopped responding. I checked the collectd monitor and both mysql and apache threads
created had ballooned up to their maximums . Logs show apache processes blocked for a long time, probably kernel panicked by then. It could have been some kind of dos attack. But I am trying to see if webwork contributed to that issue. The vm never ran out of ram. It used only about 25 GB or so ram.  

           I see that apache segfaults at least 20 times or so a day like the following and apparently it did that ever since I started using this vm  -
 
          AH00051: child pid 9386 exit signal Segmentation fault (11), possible coredump in /etc/apache2
 
The debian 8 install I had last term had zero issues with apache 2.4.10 and serving about 1,600 students on the same hardware. (non virtualized) Thanks

Balagopal
In reply to Balagopal Pillai

Re: Mysql error - "Got an error reading communication packets"

by Balagopal Pillai -

           Please see https://drive.google.com/drive/folders/0B6SKp-TALcHDU292VHVCZWVrdmc?usp=sharing and this is what happened last night. So I don't think I can blame a webwork issue for this. Having a spike of that many 
tcp connections for apache and lighttpd may point to external factors. Thanks

Balagopal
In reply to Balagopal Pillai

Re: Mysql error - "Got an error reading communication packets"

by Jason Aubrey -
Wow - those are some nice specs. I think your webwork server should *easily* be able to handle this load. I think you're right that it probably has something to do with tuning the ServerLimit, MaxRequestsPerChild, and related settings. (The ServerLimit of 1100 seems very high to me, even if it does correspond to the recommendations.) I'm no expert on these settings, but Danny Glin often has the right advice about them.

Jason
In reply to Jason Aubrey

Re: Mysql error - "Got an error reading communication packets"

by Balagopal Pillai -
Hi Jason,

          I think I am getting to the bottom of this. It turned out that all the 4 courses we have (~300 students or more per course) had an assignment deadline set for yesterday evening!  As you  can see from the ps count graph, both apache and mysql are hitting their max set limits (apache - 1100, mysql - 1200) The only odd thing is the kernel going wonky by the time I checked this in the morning. This is something I was worried about from the beginning. Each browser connection these days open 4 - 6 tcp connections and apache needs to fork a process per connection (that is my understanding of the pre fork model) So 1100 connections should only serve about 200-300 concurrent clients. The static content is served by
lighttpd.

             On the plus side, looks like I overestimated the memory needed by apache processes. During this time, 
collectd reported only <50 GB of usage. So may be I will look 
into bumping the max connections accepted by apache and mysql 4 fold to fit into 240 GB of available ram and keep an eye on this next time a deadline comes for an assignment. 

Thanks
Balagopal
In reply to Balagopal Pillai

Re: Mysql error - "Got an error reading communication packets"

by Balagopal Pillai -
Hi,

           I have been digging a bit into looking at what exactly happened that day when the vm stopped responding. I think I can exclude apache and mysql from the trouble makers. The mysql aborted connection issue is present even in the older debian setup. Just that the old mysql version didn't explicitly report those at the current log levels. A mysqlreport would show similar number of aborted connections on both the ubuntu vm and the old debian install. None of the students have reported any issues. Some mysql 
 documentation mentions that this abort error may be caused by not properly closing mysql connections. Seems like that might be the case here.
 
         Now with apache - Each process does use ~200 MB or so and my setting of 1100 as ServerLimit seems correct to correspond to  using ~220 GB of ram out of 240 GB allocated to the vm. The only change I made after the vm went down is to turn KeepAlive to off. It turns out that browsers like chrome and firefox open 4 to 6 established tcp connections and with the default keepalive timeouts, apache was keeping the established  tcp connections open for quite a bit more time after the browser is closed. With keepalive off, closing the tab would immediately drop the  established connection (except in chrome which needs the browser to close, otherwise established sessions are kept open for a little while - still less  time than when keepalive was on)  Yesterday night I accidentally noticed that the vm was a bit sluggish. Turns out that there were ~360 tcp connections  in TIME_WAIT state to our vendor hosted brightspace ip address! Looks like grade passback on submit may be the culprit here. If I had enabled KeepAlive  to on, all of these connections on TIME_WAIT might be established connections instead (it is a reasonable assumption) and would need almost as many  apache processes to serve them. So it is possible that the grade passback on submit may have contributed to something like a fork bomb for apache to cause that disruption. I am really hoping that the mass grade update to brightspace would start working before the fall term in 2017 so that I can disable this  grade passback on submit.
 
        To confirm this theory, I have enabled collectd tcp connections monitoring for remote port 443. I was only monitoring local port 443 and 8080 for incoming connections. The grade passback connections should be outgoing connections.  I was looking at ways to cut down the number of connections on time wait state. There is a good write up here on why tw_recycle and tw_reuse may be a bad idea when nat is involved on incoming connections -
https://vincent.bernat.im/en/blog/2014-tcp-time-wait-state-linux  

Balagopal
In reply to Balagopal Pillai

Re: Mysql error - "Got an error reading communication packets"

by Balagopal Pillai -
Now that I think about it a bit more, my keepalive theory could be wrong. 
Keeping the KeepAlive off would be what is conserving the apache pool of connections on high load. But that could also be the reason for not reusing the connection to brightspace ip address for grade passback so that a new connection gets created every time for passback. That could explain the large number of connections in time_wait state. But the number of apache processes at that point was only about 20. So why would grade passback create over 360 connections back to brightspace ip address is a mystery. 

Balagopal
In reply to Balagopal Pillai

Re: Mysql error - "Got an error reading communication packets"

by Danny Glin -
Without doing any actual digging, I have one theory:
It's possible that the WeBWorK code for the grade passback is not closing the connections once grades are sent. This would explain your 360 connections to brightspace from only 20 apache processes. Since an apache process serves multiple requests, it may be creating a connection each time an answer is submitted, and then not cleaning those up.
In reply to Danny Glin

Re: Mysql error - "Got an error reading communication packets"

by Balagopal Pillai -
Let me put a second thought on this scenario. I don't have to worry too much now if there is a large number of time_wait connections unless I am running out of the ip local port 
range (just over 28k by default). When the vm went down, I had apache keepalive turned on (now its off). If the connections on grade passback doesn't get closed properly and assuming that students would be submitting the assignments in a hectic pace (I watched the activity log for apache yesterday night and the "POST" lines are hectic indeed) wouldn't it be possible that apache (with KeepAlive on) could fork a lot more processes in a very quick time instead of relying on a handful of connections to brightspace keeping within the "KeepAlive" spirit? That would mean a lot more established connections to brightspace instead of time_wait in the current scenario and eventually would hit the 1,100 ceiling for apache processes (which it did, I have the collectd graph). Thanks Danny.