Symptoms: At what appeared to be random intervals, and independent of server load, all requests to the WeBWorK server froze (browsers continued to wait for response). After some time (as many as 15-20 minutes later), the server would resume without any apparent side effects. The timing log (on our server at /opt/webwork/webwork2/logs/timing.log) did not give any evidence of these stalls, with timings comparable to any other time of the day. Logging directly onto the server showed no unusual loads.
The Culprit: The mysql database was being backed up. In order to maintain database integrity, the database had a READ lock put on all of the tables so that no changes could be made. Once the back-up was complete, the lock was freed and the server resumed as before.
Observations in the Process of Debugging:
I took advantage of the option to turn on debugging in WEBWORK_ROOT/lib/WeBWorK/Constants.pm so that the apache2/error.log file gave step-by-step updates as to what step in responding to requests were obtained. I then noticed that during the stalls, the user-authentication process was interrupted in the log file, indicating that the stall was occurring during the Authentication process. (Caution: I also had to make a few edits so that unencrypted passwords were not ending up in the log-file.)
The reason the timing.log file did not show evidence of the stalls is that this routine only deals with the rendering stage. I have since modified our local installation so that the timing.log file records the full time to respond to a request as well as the rendering time.
When I started looking at the timing of the stalls, they were not actually random, but actually were occurring every 6 hours. When I sent the IT department my hypothesis for why we saw the stalls, they confirmed that these times corresponded to the SQL server being backed up.
Resolution: We have rescheduled the backup of the database tables to once a day and during the early morning hours when there should not be much activity.
Ongoing Issue: I think there is still a separate issue as there is a time in the morning when the actual machine load goes up and response times increase (but are at least manageable). My hypothesis is that there is some type of auto-update or other scheduled task occurring. The server admin has since modified some settings on auto-update, but I've yet to confirm if this does anything.