Installation

WeBWorK 2.18 Crashing

WeBWorK 2.18 Crashing

by Mark Watney -
Number of replies: 15

Hi all,

I am hoping to pick the collective brain here regarding a persistent issue with my local WeBWorK install.  

I am running WeBWorK 2.18 on Ubuntu 20.04.6 LTS - this is a virtual appliance that lives in one of the servers owned by my college’s IT department.  It was originally spooled up as the virtual image that was packaged with version 2.16 (that I updated to version 2.18 two years later).  A while back, it began to crash quite infrequently - maybe once every week or two - and would require an SSH session to reboot.  Lately, however, it has been crashing multiple times per day.  That is, it times out if you access it through a web browser. 

When it does crash, it does not appear to be a problem with Apache, as I’m still able to pull up the “It works!” Apache2 Ubuntu Default Page in a browser by subtracting the /webwork2 suffix from the URL.  Regardless, since I only have 4gb ram assigned to the VM, I still tried playing around with some of the settings, adjusting MaxRequestWorkers and MaxConnectionsPerChild in mpm_prefork.conf and mpm_event.conf.  This made no difference. 

I’ve looked at the logs in /var/log/ ,  /var/log//apache2 , and /opt/webwork/webwork2/logs right after a crash, but I didn’t see anything that seemed out of the ordinary. 

Can someone point me in the right direction to properly troubleshoot this?  Is there a log that I am missing and should be combing through for an obvious cause for the application to hang up?  In an ideal world, with Ubuntu 20.04.06's EOL approaching soon, I would just start from scratch and spool up a new virtual image with version 2.19.  However, for the sake of good digital hygiene practices, I have a deal with my IT partners that they handle the network configuration end and I handle the day-to-day admin responsibilities and troubleshooting for our WeBWorK server, and thus I would have to have them do a fresh install (and they are very much under-staffed).  I hope to keep 2.18 going just a little longer to get through the semester.  

Thank you so much in advance. 



In reply to Mark Watney

Re: WeBWorK 2.18 Crashing

by Danny Glin -
2.18 has a very different architecture than previous versions, so all of the tuning information on the forums for earlier versions is no longer relevant.  As you discovered, changing MaxRequestWorkers and MaxConnectionsPerChild will likely not have the desired effect.

Based on having similar experiences it sounds like the linux out-of-memory (OOM) killer might be killing processes.  When a system starts running out of memory the OOM killer looks for processes that are using a lot of memory and kills them indiscriminately, which could be the cause of your problems.  Where this is logged depends on your system settings, but you can take a look at /various/log/messages or the output of the dmesg command to see if there are instances of the OOM killer being invoked.

There are a couple of things you can do to mitigate this:

Tuning for 2.18 and newer
Apache no longer does the actual work of serving WeBWorK pages. If you are using apache it is just proxying connections to hypnotoad, which is the service that actually does the work. Hypnotoad is what you are starting or stopping when you run "sudo systemctl start/stop webwork2". The settings to tweak for 2.18 and up are in /opt/webwork/webwork2/conf/webwork2.mojolicious.yml:
  • workers is the number of processes that are started to serve WeBWorK requests.
  • accepts is the number of requests a worker serves before it is killed and a new one is spawned.

Lowering those two settings may help.

Setting hypnotoad to restart automatically if it exits.

I added the following two lines to the end of the [Service] section of /opt/webwork/webwork2/conf/webwork2.service, which say to restart the webwork2 service if it stops for any reason after a 5 second delay.

Restart=always

RestartSec=5

At least then if it crashes there is a good chance that it will be automatically restarted, though I suspect that this is not foolproof.

You'll have to restart the webwork2 service after you make any of these changes.  There's another step you need to complete if you change the service file: it's something to do with daemon-reload and I don't remember the exact command, but IIRC the system will warn you about this when you restart webwork2 and tell you what it is.

In reply to Danny Glin

Re: WeBWorK 2.18 Crashing

by Glenn Rice -

The command Danny is looking for is

sudo systemctl daemon-reload

although, as Danny said, if you change a service file and then try to reload or restart that service it will tell you that you need to run that command.

In reply to Danny Glin

Re: WeBWorK 2.18 Crashing

by Mark Watney -

Thanks guys, these are great insights!

I executed a few grep commands and couldn't find any instance of 'killed process' in the logs, so I wonder if the OOM killer is the culprit.  Still, I might have been simply poking around in the wrong places.  

Danny, you are correct that we are serving via proxy by Apache2 rather than have Hypnotoad serve directly; my IT partner is wary of directly serving content without a "name brand web server" in-between for security reasons, as he put it, so I'll have to stick to this arrangement for the time being.  Yet it didn't occur to me to tweak mojolicious settings (and it seems obvious now). 

I made the following changes in /opt/webwork/webwork2/conf/webwork2.mojolicious.dist.yml  (boldface added, as webwork2.mojolicious.yml did not exist in conf):

Changed workers from 25 to 15, and spare from 8 to 5

The comments in the file provide a good baseline for configuration per GB of ram, but provides a reminder to allow 2-4gb for system processes.  My VM is only allotted 4gb, so not much elbow room here.  I can ask my IT partner to up this if the machine on which it lives has enough resources.  4gb was enough when we first spooled up the image with 2.16, but is this a bit too low now?

I also went ahead and added the restart command to /opt/webwork/webwork2/conf/webwork2.service.  This is far more elegant than the rudimentary band-aid I had set up on crontab to do nightly reboots.  

Hopefully these changes are effective.  Why do you all think that the crashing became far more frequent over the last few months, in comparison to when I first updated to 2.18 in 2023?  Is it a matter of newer browsers being more resource hungry or something similar? 

In reply to Mark Watney

Re: WeBWorK 2.18 Crashing

by Sean Fitzpatrick -

For evidence of the oom-killer, look in the systemctl logs. I think 'systemctl status webwork2' will give you what you want.

In reply to Sean Fitzpatrick

Re: WeBWorK 2.18 Crashing

by Mark Watney -
Great idea. 'systemctl status webwork2' gives me a good live sense of how long the webwork2 service has been running, as well as how many processes (currently 16) and memory (currently 552mb) is associated with it. Does systemctl produce a dedicated log file somewhere?
In reply to Mark Watney

Re: WeBWorK 2.18 Crashing

by Danny Glin -

Proxying with apache is a pretty standard configuration.  There are cases where it would be mandatory, such as if you want to serve other websites/applications from the same server.  The only downside is that it adds a level of complexity.  You now have one more service running, which uses additional resources and gives you another thing to troubleshoot if things go wrong.

It's odd that crashes would become more frequent over time without any change in usage.  I'm assuming that the server has been restarted occasionally over the last year, otherwise it's possible that other tasks have been eating up memory over time, leaving less for WeBWorK.  This is all assuming that the crashes are due to memory.

WeBWorK is set up to look for webwork2.mojolicious.yml.  If that file isn't present, then it reads from the distribution version at webwork2.mojolicious.dist.yml.  If you're going to customize anything in that file, then it's recommended to copy webwork2.mojolicious.dist.yml to webwork2.mojolicious.yml and make the changes in webwork2.mojolicious.yml.  Although it will work to edit webwork2.mojolicious.dist.yml, it will cause headaches when you want to upgrade to a new version of WeBWorK.

In reply to Danny Glin

Re: WeBWorK 2.18 Crashing

by Mark Watney -
I, too, find it odd that crashes have become more frequent without any notable change in usage. This is a low-volume deployment; it serves around 50-75 students, depending on enrollments and who is teaching what each semester. This has been fairly consistent throughout the years.

To your question about the server: I used to restart it manually every few months, or when I would do routine housecleaning like updating the OPL at the start of a semester. With the recent frequent crashes, I've had to restart daily (so disruptive that I scheduled it in crontab).

Thanks for the webwork2.mojolicious.yml tip: I went ahead and made a copy from webwork2.mojolicious.dist.yml
In reply to Mark Watney

Re: WeBWorK 2.18 Crashing

by Mark Watney -

Update with some interesting results:

I tried to access WeBWorK via the standard URL this morning and it hung up again.  I SSHed in and ran 'systemctl status webwork2', which reported that the webwork2 service was actually up and running for 3 and a half hours (since my scheduled nightly system reboot).  Interestingly, manually stopping the service and starting it again via sudo 'systemctl start/stop webwork2' made no difference.  The only thing that got it up and running again was a full system reboot.  

I can't imagine there was heavy usage between the reboot at 4:00am and when I failed to access the login page at 7:30am. And according to systemctl status it seems like the OOM killer played no role in this. 

I'm stumped. 

In reply to Mark Watney

Re: WeBWorK 2.18 Crashing

by Danny Glin -
When you say that it hung up, what exactly are you seeing? Does the page time out after some period of time?

If you catch the system locked up again, try connecting directly to the webwork service via http://webwork.yourschool.edu:8080/webwork2. Depending on your firewall rules you might not be able to access this from a browser, so from the command line via ssh you can try
curl http://localhost:8080/webwork2

If that returns the html of the webwork landing page, then the webwork service is running properly and the problem is likely somewhere else.

Also check the status of the apache2 service, and try restarting it. It's possible that apache is the thing that crashed and not WeBWorK.

The other thing to check is that the disk is not full, though if this were the case I would expect that you would be seeing other symptoms as well.
In reply to Danny Glin

Re: WeBWorK 2.18 Crashing

by Mark Watney -

Yes, when I say that it hangs up, I mean that the page times out after some period of time.  Nobody can access WeBWorK via web browser, despite systemctl reporting that the webwork2 service is up and running.  

I will try accessing the URL you suggested next time it hangs up, which should be fairly soon; I have crontab scheduling a mid day reboot and a student reported 36 minutes after the reboot that it hung up again.  This has gone from hang ups every few weeks to multiple times per day now.  

I can also check the status of apache2 via systemctl, but I suspect this will be up and running since I'm able to see the "It works!" apache landing page at http://webwork.myschool.edu, even when http://webwork.myschool.edu/webwork2 hangs up. 

In reply to Mark Watney

Re: WeBWorK 2.18 Crashing

by Danny Glin -

I forgot that you mentioned that the apache landing page still works, so that means that apache is still running and doing its job.

I wonder if the individual webwork processes are getting tied up by some request that doesn't finish, and once all of them are stuck there are none left to serve new requests.  The "top" command will show you information about running processes.  Hit "shift-M" to sort by memory usage, or "shift-P" to sort by processor usage.

Next time it happens, try stopping the webwork2 service, and then double-check to see if any webwork processes are still running.  You can use "ps -ef |grep webwork".  When you stop a service it tries to stop all related processes, but I've seen cases where some processes still end up running.

In reply to Danny Glin

Re: WeBWorK 2.18 Crashing

by Mark Watney -
Just did some testing at the most recent hang up. Accessing http://webwork.myschool.edu:8080/webwork2 directly (which only works when on my school's vpn due to firewall rules) actually did bring up the webwork login page, whereas http://webwork.myschool.edu/webwork2 was still timing out. I also verified that I'm able to access direct with the curl command, which did indeed return the html of the landing page.

Running 'ps -ef |grep webwork' yielded the following output:
www-data    1377       1  0 12:26 ?        00:00:02 /opt/webwork/webwork2/bin/webwork2
www-data    1379    1377  0 12:26 ?        00:00:04 /opt/webwork/webwork2/bin/webwork2
www-data    1380    1377  0 12:26 ?        00:00:03 /opt/webwork/webwork2/bin/webwork2
www-data    1381    1377  0 12:26 ?        00:00:01 /opt/webwork/webwork2/bin/webwork2
www-data    1383    1377  0 12:26 ?        00:00:02 /opt/webwork/webwork2/bin/webwork2
www-data    1384    1377  0 12:26 ?        00:00:00 /opt/webwork/webwork2/bin/webwork2
www-data    1386    1377  0 12:26 ?        00:00:01 /opt/webwork/webwork2/bin/webwork2
www-data    1387    1377  0 12:26 ?        00:00:00 /opt/webwork/webwork2/bin/webwork2
www-data    1389    1377  0 12:26 ?        00:00:00 /opt/webwork/webwork2/bin/webwork2
www-data    1390    1377  0 12:26 ?        00:00:00 /opt/webwork/webwork2/bin/webwork2
www-data    1391    1377  0 12:26 ?        00:00:00 /opt/webwork/webwork2/bin/webwork2
www-data    1394    1377  0 12:26 ?        00:00:00 /opt/webwork/webwork2/bin/webwork2
www-data    1395    1377  0 12:26 ?        00:00:00 /opt/webwork/webwork2/bin/webwork2
www-data    1397    1377  0 12:26 ?        00:00:00 /opt/webwork/webwork2/bin/webwork2
www-data    1398    1377  0 12:26 ?        00:00:00 /opt/webwork/webwork2/bin/webwork2
www-data    1401    1377  0 12:26 ?        00:00:00 /opt/webwork/webwork2/bin/webwork2
wwadmin     4192    4176  0 13:42 pts/0    00:00:00 grep --color=auto webwork
Am I to take this to mean that there were 16 processes running?  I'm not sure what to make of this time stamp (note: I ran this command at 13:42).  Upon stopping the webwork2 service via systemctl, only the wwadmin process on the last line above remained. 


In reply to Mark Watney

Re: WeBWorK 2.18 Crashing

by Glenn Rice -

The STIME is the fifth column.  That is the start time of the process.

There will be as many processes as you have set for the number of workers in conf/webwork2.mojolicious.yml.  Those processes are the workers.

In reply to Mark Watney

Re: WeBWorK 2.18 Crashing

by Danny Glin -

Based on this it looks like webwork is running properly, but connections are not getting through for some reason.

It looks like your institution uses Cloudflare to proxy off-campus connections to your WeBWorK server.  You may need to engage your IT folks to see if there is anything showing up in the logs for the public-facing IP address that proxies to your WeBWorK server.

In reply to Danny Glin

Re: WeBWorK 2.18 Crashing

by Mark Watney -
My institution does indeed use Cloudflare to proxy the requests to our server. However, when a user is on campus and connected to wifi with the proper credentials, it bypasses Cloudflare. Yet in this scenario the hang up is identical for the on-campus user, so would that rule out something funny going on with proxied connections?