WeBWorK Main Forum

Estimating server load for online exams

Estimating server load for online exams

by Paul Vojta -
Number of replies: 10

Dear all,

With the shelter in place restrictions now in effect, some of our instructors are using webwork's gateway/quiz feature to give midterm exams (and they may give finals, too, depending on what I can come up with).  However, the midterm exams that they gave last week basically failed -- the server became overloaded and basically unusable.

Our server was working well for homework, but switching to giving exams caused students to log in all at once (different times for each class, enrollments varying from 170 to 400).

Are there any guidelines for how to determine how big of a (virtual) machine would suffice for the new usage mode (other than trial and epic fail)?

In reply to Paul Vojta

Re: Estimating server load for online exams

by Danny Glin -
The problem with gateway quizzes is that when a student generates a new quiz, a single apache process has to load and render all of the problems on the quiz, which can lead to several problems (as you've experienced).
  1. If there are a large number of problems on the quiz, then the apache process is tied up until they are all rendered.
  2. Each time a problem is rendered the memory usage of the apache process grows, and this memory is not recovered until that process is killed.  If your available RAM is used up, then these processes start using swap (virtual disk memory), which is extremely slow

If this gets too far, then it eventually locks up the system.

Most of this can be mitigated with the right tuning of apache settings.  Throwing more resources at the problem may not even help if apache is not using them efficiently.

This thread has some useful information about apache settings.  In particular, make sure that in your apache configuration you have

MaxConnectionsPerChild 100

If that is set to 0, then your WeBWorK server will eventually crash.  You will also want to tweak MaxRequestWorkers, and possibly a couple of other settings.  See the quoted thread for more details.

For reference, we have two WeBWorK web servers (load balanced) each with 8GB of RAM, and we have run exams with >100 students starting at the same time.  We recently ran a gateway quiz on WeBWorK with ~600 students with no issues, but the exam was open for 24 hours so we didn't have all the students starting at once.

There are a couple of things you can do on the setup of the exam to ease the load on your servers:

  1. Check for inefficiencies in the code of the problems your using.  Having a while loop that may require a large number of iterations can slow things down.  In an assignment you may never notice, but when you are rendering a lot of problems at once this can add up.
  2. Widen the window for the exam beyond the allowable time.  e.g. if you want to run the exam from 1:00 to 4:00, set the open date to 12:45 and the close date to 4:15, but still have a time limit of 180 minutes.  This way you won't have all of the students scrambling to start right at 1:00, which will spread out the load.
In reply to Danny Glin

Re: Estimating server load for online exams

by Edward Sternin -

Danny,

I want to predict that this topic will become very popular very soon.  We have just had a massive crash of the WebWork server that tried to provide a gateway quiz to 800 students simultaneously. This particular one consisted of about 100 multiple-choice questions.

See the students' view here: https://www.reddit.com/r/brocku/comments/gkaraz/webwork_no_ready_for_primetime/

Our sysadmins are tuning the server and giving it more resources for next week's test, but starting in the Fall, we will have 2000 students trying to attempt the same.  Would it even be possible to administer a quiz to this many students simultaneously, or must it be staggered?

Since the master apache process never reduces in size, this means staggering just the start times (by 10-15 minutes for each subgroup of students) will not do it, right? - we need to reduce the number of simultaneously active users? Staggering by the whole length of a test has some scheduling implications, and also means cheating will be even more rampant.

I found some useful suggestions on tuning the server for WebWork load here: https://hirebenjam.in /tag/webwork/ but perhaps the time is nigh to develop an official server load scaling guide?
[Danny? Michael?]

And most importantly: is there any way to reduce the WebWork overhead of many simultaneous users? I assume it is strictly a RAM problem: the CPU load comes in many small bursts, as the problems are rendered or answers checked.  Is this a fair assumption?

In reply to Edward Sternin

Re: Estimating server load for online exams

by Nathan Wallach -
I'm not experienced with gateway quizzes, and am a pretty new member of the WeBWorK community. However, I have great interest in capacity planning and issues of "scale" for WeBWorK for reasons unrelated to online exams - so I am going to toss in my 2 cents and some.

My impression is that in recent years "capacity planning" of WeBWorK servers became of little interest to the community, as modern servers have enough RAM and CPU resources so that typical homework usage does often create noticeable stress and performance problems for most institutions using WeBWorK. We simple allocate enough resources to our servers and do a passable job in tuning the configuration to avoid any significant level of complaints from the students and leave "well enough alone".

The discussions I have seem about how multiple problems are rendered at once in a gateway quiz makes it clear that this is a very demanding manner of using WeBWorK.

Large online exams (and very large numbers of students) can apparently quickly leave the realm of where current setups suffice, and leave us uncertain of what should be done to efficiently and cost-effectively provide the "scale" we may need.

There seems to be a real need for "the community" to work together on investigating and determining best practices for "server load scaling". At present, it does not seem that there is enough accumulated experience to provide a guide of the sort many of us (myself included) would all like to have, which is why there is no such guide available anywhere we know of.

It would also be very nice to have instructions of how to install and operate a "cluster" using a load-balancer and multiple (virtual) servers which could "scale up" and "scale down" (horizontally) as necessary based on expected demand (ex. scale up in advance for online exams). Note: The database capacity would probably also need to scale up/down, and not only the WeBWorK Apache server capacity.

It could be that using public cloud providers (with their capabilities to do elastic scaling and bill based on "usage") might be a good option for online exams, were we to know how to do that and be able to turn it on/off as needed. Such an approach would hopefully avoid the need for each institution to operate local WW servers/clusters whose capacity is large enough for their large online exams, but overkill for the rest of the time. Hopefully usage based costs would make this an affordable approach. It is probably also possible to get a similar result using on-site solutions using ad-hoc approaches to horizontal scaling, additional VMs, etc. all managed by the local staff as necessary for "high demand" events. Determining the pros and cons of these two options and the costs (both financial and "staff time") involved would be very helpful in my opinion.

Getting from where we are today to where some of want to be in the future will require the investment of effort by several people with a vested interest in the outcome.  Many of us (I speak at least for myself) do not have the background/experience to really do the configuration/testing/engineering needed to design the best practices for server/cluster scaling/capacity planning to be prepared for very high spurts of demand. It is very likely that a team of experienced WeBWorK "experts" together with some IT "scaling" experts would be far more able to advance the necessary investigations and planning effectively than just the "regular" WeBWorK community alone. More of our employers now have a need for WeBWorK at a large scale so hopefully some of them will allocate resources to help with finding the solutions.

I recently did some load testing to try to understand system capacity for large number of single problem render requests arriving in short periods of time. See https://webwork.maa.org/moodle/mod/forum/discuss.php?d=4748 and under some pretty intensive constant demand a moderately sized WW virtual server (3vCPU, 10GB RAM, WW in Docker on CentOS base OS). That server was able to support about 2500 single problem render requests per minute from about 100 "very demanding clients". Designing methodologies to load-test different use cases of different WW installations would be helpful in providing real data useful in the capacity planning decisions.

The memory ballooning of the Apache processes is certainly a critical restriction, but CPU power is needed to handle the render requests, as well as the need to replace the Apache child processes sufficiently frequently when the server is under significant load. In the gateway setting where all the questions of a quiz are rendered (and graded) at once (if I understand correctly) - CPU demand will probably be pretty high for each "request" so that having many students start a gateway quiz in a short time is very demanding on the server.

It terms of managing memory usage, I do not think that using "MaxConnectionsPerChild" is likely to be sufficient for Gateway quizzes, as each gateway request make multiple render request per call. I would recommend looking into also using "Apache2::SizeLimit" as discussed at https://webwork.maa.org/moodle/mod/forum/discuss.php?d=2692#p5887 but with the setting for "$Apache2::SizeLimit::CHECK_EVERY_N_REQUESTS" set to be a very small number (so that after a Gateway request, the memory jump will be detected quickly).

In a slightly different direction - maybe the "gateway quiz" is not the best assignment type for a large online exam. Maybe a new "assignment type" which is more "homework like" in rendering problems one at a time, but having some of the additional features that Gateway quizzes have could provide a better alternative for large online exams in WeBWorK. If the students could navigate from question to question and submit each one individually - the stress on the server would be lower. The price is that students would need to "flip" from question to question and as such make many small "render" requests. I'm not sure what would be needed, but it bears consideration. For now, using the existing "homework" assignment type with the assignment opened for just the few hours of an online exam might be able to support more students with less server problems than a Gateway quiz with the same set of questions. 

Some other discussion thread with discussion of load issues:

https://webwork.maa.org/moodle/mod/forum/discuss.php?d=4645
https://webwork.maa.org/moodle/mod/forum/discuss.php?d=3904
https://webwork.maa.org/moodle/mod/forum/discuss.php?d=3827
https://webwork.maa.org/moodle/mod/forum/discuss.php?d=2590
https://webwork.maa.org/moodle/mod/forum/discuss.php?d=2927
In reply to Edward Sternin

Re: Estimating server load for online exams

by Nathan Wallach -
"We have just had a massive crash of the WebWork server that tried to provide a gateway quiz to 800 students simultaneously. This particular one consisted of about 100 multiple-choice questions."

I personally would find it interesting to know more about the server which had this failure:
  • amount of RAM / CPUs,
  • Apache mpm_prefork settings,
  • base OS,
  • number of students who use this server for homeworks during a typical semester, and
  • some data on system load / performance near "large" homework deadlines.

It seems that https://webwork.maa.org/moodle/mod/data/view.php?d=5 would be a suitable place for such data to be collected, but it does not seem to be functional at present.

Such data would at least be helpful for others who want to plan their capacity for similar "regular" loads, as well as provide an indication that it is far below what is needed for an online exam of this size.

In reply to Nathan Wallach

Re: Estimating server load for online exams

by Edward Sternin -
Our server configuration/planning is done by our central ITS, and I have no access to this information. I have urged the sysadmin to join this discussion. I personally found your siege use quite illuminating, I think a "standard" load test should be developed along the same lines, and made into a part of the server installation/certification.

The same class has used the same server for homework with few issues, but anecdotally, there have been issues with a 300-student calculus class in the past semester. I have successfully run the final exam in a 120-person class, but it was a two-hour time-limited "homework" in a 96-hr time window. This ran OK, and seems like the only realistic way to do it. In a newly unproctored world, this seems less than satisfactory, but there it is.
In reply to Edward Sternin

Re: Estimating server load for online exams

by Glenn Rice -
One thing you can do that might reduce the load on the server is change the set up of the gateway quiz that you are using. Make it so that only one problem is displayed at a time. This almost reduces the server load to the same as for a homework problem. If a student opens a page of a gateway quiz, then the server parses and executes the pg file for each problem on the page. The pg files for the problems not on the current page are not parsed. So if you set it to one problem per page, then each request from the student will result in one pg file being parsed and executed. This will spread the server load more evenly. Instead of the server needing load numerous problems for all of the students right at the beginning of the exam, the server will only need to load one problem for each student. Also, each time a student changes a page the server only needs to load one problem.

Of course this won't help you if you already have it set for one problem per page.
In reply to Glenn Rice

Re: Estimating server load for online exams

by Edward Sternin -
It was already one problem per page, though one of the "problems" was a combination of 25 T/F statements, scored as one.
In reply to Edward Sternin

Re: Estimating server load for online exams

by Danny Glin -

Unfortunately when the Gateway Quiz mode was first built it was to serve a drop-in quiz model where there were typically not a large number of students starting at the same time, which is why many institutions are running into issues with large scale synchronous exams.

The thing that typically kills a WeBWorK server is running out of memory, which is what Edward experienced.  The way to mitigate this is to limit the number of simultaneous apache processes, and to try to limit the size of each process.

The master apache process doesn't typically grow over time, at least not to an extent where it becomes a memory hog.  The master process doesn't serve apache requests, so it doesn't suffer from the memory usage associated with serving PG problems.  The child processes are the ones that grow in memory usage, but there are a couple of ways to mitigate that.

Rendering a PG problem leads to some growth in the size of the apache child that does the rendering.  One of the issues with the gateway quiz module is that all of the problems on a page are rendered by a single process, which can lead to a significant ballooning of memory usage by that process.  The typical solution is to kill off these child processes frequently.  This is handled by the MaxConnectionsPerChild directive in the apache configuration.  I typically recommend setting this to 50 or 100, which means that a child will serve 50 or 100 requests before it is killed and a new process is spawned.  If your server is doing almost exclusively quizzes then it may make sense to go even lower.  The extreme would be to set it to 1, which would mean that a new child is spawned for every request.  This would likely almost solve the memory issue at the expense of an increase in processing/disk usage, as there is some overhead to create new processes.  It would likely mean that it would take longer for pages to load, but at least the server wouldn't end up thrashing.

Also make sure that MaxRequestWorkers is set appropriately for your server.  This is the maximum number of child processes that will be started at a time.  I typically budget 100MB to 200MB of RAM to each child process, but YMMV.  The lower this number the less likely you are to run out of memory, but it also means that fewer requests can be served at a time.

Please post your experiences here, as every institution uses WeBWorK slightly differently, so there isn't a one size fits all solution.

In reply to Paul Vojta

Re: Estimating server load for online exams

by Paul Vojta -
Recently, a problem came up in a large exam given at Berkeley. It is probably relevant to any large installation of WeBWorK.
The problem was that, after a point, most students couldn't connect to WeBWorK. Instead, they got errors "Too many connections".
As it turned out, the Apache web server was set up with MaxRequestWorkers = 150 (the installation default value), and the mysql server was set up with max_connections = 100 (also the default). This caused problems because (as I understand it) each WeBWorK worker process caches database connections, and can have as many open connections as there are courses in active use.
In particular, having max_connections < MaxRequestWorkers is just asking for trouble.
Here's a forum thread that discusses this issue in more detail: https://webwork.maa.org/moodle/mod/forum/discuss.php?d=4286
Note that our server has 32 GB of memory, so the recommended value of MaxRequestWorkers (at 5 per 1 GB of memory) comes out to 160, near the default value.
It has been suggested to use max_connections = 5000. See https://webwork.maa.org/moodle/mod/forum/discuss.php?d=2590
In reply to Paul Vojta

Re: Estimating server load for online exams

by Danny Glin -
Be careful about taking advice from old posts. The post suggesting using max_connections=5000 is from 2011, and much has changed since then.

The database interaction from an apache process is completely independent of which course it is accessing, so the only time an apache process will open an additional connection to the database is if it can't use the existing one because of some error. We have two web servers each with MaxRequestWorkers set to 50. I had max_connections on the DB server set to 500 because it was also hosting a Moodle database, and we never saw the "too many connections" error.

In fact, a current check shows the following:
MariaDB [(none)]> show global status like 'Max_used_connections';
+----------------------+-------+
| Variable_name | Value |
+----------------------+-------+
| Max_used_connections | 101 |
+----------------------+-------+

This means that there has never been more than 101 simultaneous connections since the DB was last restarted (which in this case was a couple of weeks ago, but there have been a number of gateway quizzes run since then). From this I'm concluding that each apache process only opens one connection.

You are correct, max_connections < MaxRequestWorkers is very bad, but you should only need to increase max_connections to a number larger than MaxRequestWorkers by some buffer amount. I would think that 200 would be more than enough.