Forums

Search results: 22

In addition to the changes to the Linux OOM settings, and to mpm_prefork.conf (lowering MaxConnectionsPerChild) you may want to try to use $Apache2::SizeLimit to limit the size of Apache server processes.

The idea is to prevent any individual apache process from getting to large. Since you are encountering problems, make sure that it polls the size often by using a small value of $Apache2::SizeLimit::CHECK_EVERY_N_REQUESTS 

To tune the size of $Apache2::SizeLimit::MAX_PROCESS_SIZE  keep an eye on the initial size of an Apache process and how it grows when many request are made (ex. when loading many problems). You want to allow moderate growth in size, but not to allow any process to get very big, as multiple oversized Apache processes are what is triggering the OOM condition.

See: https://webwork.maa.org/moodle/mod/forum/discuss.php?d=5008#p15317 and search the forums for older posts on this.

Installation -> mpm_prefork and RAM -> Re: mpm_prefork and RAM

by Nathan Wallach -

The library browser is a memory hog as it renders many problems in a single Apache worker. Gateway quizes also have similar issues.

You can reduce the memory it will use somewhat by:

  • lowering MaxSpareServers    (don't waste too much space on idle workers)
  • lowering MaxRequestWorkers    (do not allow to many to run, to avoid swapping and performance degredation)
  • lowering MaxConnectionsPerChild   (meant to kill workers before they get very large)

Adding in /etc/apache2/conf-enabled/webwork.conf (or wherever the real webwork.conf is on your system)

  # size limiter for Apache2 use Apache2::SizeLimit;
  $Apache2::SizeLimit::MAX_PROCESS_SIZE = 340000;
  $Apache2::SizeLimit::MAX_UNSHARED_SIZE = 340000;
  $Apache2::SizeLimit::CHECK_EVERY_N_REQUESTS = 5;

near the top of the <Perl> section. Search the forums for more about this.

I'm not experienced with gateway quizzes, and am a pretty new member of the WeBWorK community. However, I have great interest in capacity planning and issues of "scale" for WeBWorK for reasons unrelated to online exams - so I am going to toss in my 2 cents and some.

My impression is that in recent years "capacity planning" of WeBWorK servers became of little interest to the community, as modern servers have enough RAM and CPU resources so that typical homework usage does often create noticeable stress and performance problems for most institutions using WeBWorK. We simple allocate enough resources to our servers and do a passable job in tuning the configuration to avoid any significant level of complaints from the students and leave "well enough alone".

The discussions I have seem about how multiple problems are rendered at once in a gateway quiz makes it clear that this is a very demanding manner of using WeBWorK.

Large online exams (and very large numbers of students) can apparently quickly leave the realm of where current setups suffice, and leave us uncertain of what should be done to efficiently and cost-effectively provide the "scale" we may need.

There seems to be a real need for "the community" to work together on investigating and determining best practices for "server load scaling". At present, it does not seem that there is enough accumulated experience to provide a guide of the sort many of us (myself included) would all like to have, which is why there is no such guide available anywhere we know of.

It would also be very nice to have instructions of how to install and operate a "cluster" using a load-balancer and multiple (virtual) servers which could "scale up" and "scale down" (horizontally) as necessary based on expected demand (ex. scale up in advance for online exams). Note: The database capacity would probably also need to scale up/down, and not only the WeBWorK Apache server capacity.

It could be that using public cloud providers (with their capabilities to do elastic scaling and bill based on "usage") might be a good option for online exams, were we to know how to do that and be able to turn it on/off as needed. Such an approach would hopefully avoid the need for each institution to operate local WW servers/clusters whose capacity is large enough for their large online exams, but overkill for the rest of the time. Hopefully usage based costs would make this an affordable approach. It is probably also possible to get a similar result using on-site solutions using ad-hoc approaches to horizontal scaling, additional VMs, etc. all managed by the local staff as necessary for "high demand" events. Determining the pros and cons of these two options and the costs (both financial and "staff time") involved would be very helpful in my opinion.

Getting from where we are today to where some of want to be in the future will require the investment of effort by several people with a vested interest in the outcome.  Many of us (I speak at least for myself) do not have the background/experience to really do the configuration/testing/engineering needed to design the best practices for server/cluster scaling/capacity planning to be prepared for very high spurts of demand. It is very likely that a team of experienced WeBWorK "experts" together with some IT "scaling" experts would be far more able to advance the necessary investigations and planning effectively than just the "regular" WeBWorK community alone. More of our employers now have a need for WeBWorK at a large scale so hopefully some of them will allocate resources to help with finding the solutions.

I recently did some load testing to try to understand system capacity for large number of single problem render requests arriving in short periods of time. See https://webwork.maa.org/moodle/mod/forum/discuss.php?d=4748 and under some pretty intensive constant demand a moderately sized WW virtual server (3vCPU, 10GB RAM, WW in Docker on CentOS base OS). That server was able to support about 2500 single problem render requests per minute from about 100 "very demanding clients". Designing methodologies to load-test different use cases of different WW installations would be helpful in providing real data useful in the capacity planning decisions.

The memory ballooning of the Apache processes is certainly a critical restriction, but CPU power is needed to handle the render requests, as well as the need to replace the Apache child processes sufficiently frequently when the server is under significant load. In the gateway setting where all the questions of a quiz are rendered (and graded) at once (if I understand correctly) - CPU demand will probably be pretty high for each "request" so that having many students start a gateway quiz in a short time is very demanding on the server.

It terms of managing memory usage, I do not think that using "MaxConnectionsPerChild" is likely to be sufficient for Gateway quizzes, as each gateway request make multiple render request per call. I would recommend looking into also using "Apache2::SizeLimit" as discussed at https://webwork.maa.org/moodle/mod/forum/discuss.php?d=2692#p5887 but with the setting for "$Apache2::SizeLimit::CHECK_EVERY_N_REQUESTS" set to be a very small number (so that after a Gateway request, the memory jump will be detected quickly).

In a slightly different direction - maybe the "gateway quiz" is not the best assignment type for a large online exam. Maybe a new "assignment type" which is more "homework like" in rendering problems one at a time, but having some of the additional features that Gateway quizzes have could provide a better alternative for large online exams in WeBWorK. If the students could navigate from question to question and submit each one individually - the stress on the server would be lower. The price is that students would need to "flip" from question to question and as such make many small "render" requests. I'm not sure what would be needed, but it bears consideration. For now, using the existing "homework" assignment type with the assignment opened for just the few hours of an online exam might be able to support more students with less server problems than a Gateway quiz with the same set of questions. 

Some other discussion thread with discussion of load issues:

https://webwork.maa.org/moodle/mod/forum/discuss.php?d=4645
https://webwork.maa.org/moodle/mod/forum/discuss.php?d=3904
https://webwork.maa.org/moodle/mod/forum/discuss.php?d=3827
https://webwork.maa.org/moodle/mod/forum/discuss.php?d=2590
https://webwork.maa.org/moodle/mod/forum/discuss.php?d=2927
Edfinity was willing to tell me that they do load-balancing with multiple servers and shared storage, but their back-end WW servers are stateless (their front-end handles all the database work). This is a positive indication about the possibility of such large scale back-end WW clusters.

---

Capacity planning remains quite challenging without much data, so I decided to stress-test my development server using a simple free tool to try to get a reasonable estimate handle on the capacity it provides. I used siege https://www.joedog.org/siege-manual/ as it is very simple and sufficed for the basic sort of testing I wanted to do, but it may not be sufficient to test a much larger server.

Below is information on the server, and the test results.

The VM has 10GB RAM, 3 vCPU development/testing server (WW running inside Docker) on a CentOS base OS VM. siege was pointed to URLS for embedding problems in HTML pages, and only requested the main HTML part of problems via the html2xml interface. The tests were against a list of 40 html2xml problems URLs. During these load tests, I also used a browser to load one of those pages (not very often), and it would render, but the delay/latency certainly was noticeable when the server was under real stress (150 clients), and much less so when it was handling 80 clients, when it was more or less typical in time to load/render on screen.

This server gets quite stressed when 150 siege clients are hitting it in parallel (lots of swap activity was triggered), but functions reasonably well for 100 siege clients (very moderate swap activity, but CPU usage would max out at times - apparently when Apache processes were being started and stopped). Average response times were a bit better with only 80 siege clients. I suspect that testing with somewhere between 1800 and 200 clients would probably DoS the server due to excessive swapping and the OoM issues, but I did not try it in practice.

Note: The WW timing.log file does not seem to keep up with the load, and only a small fraction of the render calls are getting logged to it. In my case the file is on NFS, which may be hindering the logging code more than would occur on local storage, but I suspect that the WW logging code for timing.log is simply not up to handling this sort of load.

These simple tests seem to show that this server config can support up to about 2750 "renders" per minute when 100 clients are making constant streams of sequential requests, and about 2650 "renders" per minute when there are only 80 such clients (but with faster average response times). That seems to be a reasonable estimate of maximum capacity for this server instance.

After some tuning of mpm_prefork setting and Apache2::SizeLimit (see ttps://webwork.maa.org/moodle/mod/forum/discuss.php?d=2692#p5887 ) as follows, I got the results shown below from several "siege" runs with different settings.
  • MaxRequestWorkers set to 200
  • MaxConnectionsPerChild set 25
  • $Apache2::SizeLimit::MAX_PROCESS_SIZE = 420000;
  • $Apache2::SizeLimit::MAX_UNSHARED_SIZE = 420000;
  • $Apache2::SizeLimit::CHECK_EVERY_N_REQUESTS = 5;
There is certainly a tradeoff between the memory growth of the Apache processes and the CPU costs of starting up new Apache workers. It is certainly possible that more careful tuning could somewhat improve performance on the tests, but I'm not sure how much more effort on this is worthwhile at present. I do hope to arrange to run tests with some additional RAM/vCPU resources in the near future to see what sort of scaling / performance behavior I can observe.

(Note: I found it helpful to put the servers IP address in my /etc/hosts file to avoid DNS delays and some failures during earlier stress tests I tried in the same approach).

siege -c 150 -t120S -f /home/tani/.siege/url-01.txt

** SIEGE 4.0.4
** Preparing 150 concurrent users for battle.
The server is now under siege...
[alert] socket: select and discovered it's not ready sock.c:351: Connection
timed out
[alert] socket: read check timed out(30) sock.c:240: Connection timed out
[alert] socket: select and discovered it's not ready sock.c:351: Connection
timed out
[alert] socket: read check timed out(30) sock.c:240: Connection timed out

Lifting the server siege...
Transactions:                   4575 hits
Availability:                  99.93 %
Elapsed time:                 119.37 secs
Data transferred:              49.18 MB
Response time:                  3.83 secs
Transaction rate:              38.33 trans/sec
Throughput:                     0.41 MB/sec
Concurrency:                  146.89
Successful transactions:        4575
Failed transactions:               3
Longest transaction:           55.34
Shortest transaction:           0.33

siege -c 100 -t360S -f /home/tani/.siege/url-01.txt

[alert] socket: select and discovered it's not ready sock.c:351: Connection
timed out
[alert] socket: read check timed out(30) sock.c:240: Connection timed out
[alert] socket: select and discovered it's not ready sock.c:351: Connection
timed out
[alert] socket: read check timed out(30) sock.c:240: Connection timed out


Lifting the server siege...
Transactions:                  16685 hits
Availability:                  99.98 %
Elapsed time:                 359.95 secs
Data transferred:             191.27 MB
Response time:                  2.14 secs
Transaction rate:              46.35 trans/sec
Throughput:                     0.53 MB/sec
Concurrency:                   99.37
Successful transactions:       16685
Failed transactions:               3
Longest transaction:           43.86
Shortest transaction:           0.10


siege -c 80 -t180S -f /home/tani/.siege/url-01.txt
** SIEGE 4.0.4
** Preparing 80 concurrent users for battle.
The server is now under siege...[alert] socket: select and discovered it's not ready sock.c:351: Connection timed out
[alert] socket: read check timed out(30) sock.c:240: Connection timed out
[alert] socket: select and discovered it's not ready sock.c:351: Connection timed out
[alert] socket: read check timed out(30) sock.c:240: Connection timed out
[alert] socket: select and discovered it's not ready sock.c:351: Connection timed out
[alert] socket: read check timed out(30) sock.c:240: Connection timed out
[alert] socket: select and discovered it's not ready sock.c:351: Connection timed out
[alert] socket: read check timed out(30) sock.c:240: Connection timed out
[alert] socket: select and discovered it's not ready sock.c:351: Connection timed out
[alert] socket: read check timed out(30) sock.c:240: Connection timed out
[alert] socket: select and discovered it's not ready sock.c:351: Connection timed out
[alert] socket: read check timed out(30) sock.c:240: Connection timed out
[alert] socket: select and discovered it's not ready sock.c:351: Connection timed out
[alert] socket: read check timed out(30) sock.c:240: Connection timed out

Lifting the server siege...
Transactions:        7530 hits
Availability:       99.91 %
Elapsed time:      179.84 secs
Data transferred:       85.16 MB
Response time:        1.87 secs
Transaction rate:       41.87 trans/sec
Throughput:        0.47 MB/sec
Concurrency:       78.19
Successful transactions:        7531
Failed transactions:           7
Longest transaction:       32.54
Shortest transaction:        0.10

siege -c 80 -t180S -f /home/tani/.siege/url-01.txt
** SIEGE 4.0.4
** Preparing 80 concurrent users for battle.
The server is now under siege...[alert] socket: select and discovered it's not ready sock.c:351: Connection timed out
[alert] socket: read check timed out(30) sock.c:240: Connection timed out

Lifting the server siege...
Transactions:        8006 hits
Availability:       99.99 %
Elapsed time:      179.30 secs
Data transferred:       90.26 MB
Response time:        1.77 secs
Transaction rate:       44.65 trans/sec
Throughput:        0.50 MB/sec
Concurrency:       79.14
Successful transactions:        8006
Failed transactions:           1
Longest transaction:       36.11
Shortest transaction:        0.10
 
tani@lxtani:~$ siege -c 80 -t180S -f /home/tani/.siege/url-01.txt
** SIEGE 4.0.4
** Preparing 80 concurrent users for battle.
The server is now under siege...[alert] socket: select and discovered it's not ready sock.c:351: Connection timed out
[alert] socket: read check timed out(30) sock.c:240: Connection timed out
[alert] socket: select and discovered it's not ready sock.c:351: Connection timed out
[alert] socket: read check timed out(30) sock.c:240: Connection timed out
[alert] socket: select and discovered it's not ready sock.c:351: Connection timed out
[alert] socket: read check timed out(30) sock.c:240: Connection timed out

Lifting the server siege...
Transactions:        8048 hits
Availability:       99.96 %
Elapsed time:      179.57 secs
Data transferred:       91.50 MB
Response time:        1.76 secs
Transaction rate:       44.82 trans/sec
Throughput:        0.51 MB/sec
Concurrency:       78.86
Successful transactions:        8048
Failed transactions:           3
Longest transaction:       34.93
Shortest transaction:        0.10


Thanks. Centos 7 also ships with apache 2.4.x, like debian and ubuntu. Could this low memory consumption on redhat systems attribute to older versions of apache and mod_perl? (a wild guess) I will test it out with centos 6 and 7. Here are the settings I have now on the 128 GB server. This keeps the resident size mostly under control -

KeepAliveTimeout 2


 StartServers                    2
        MinSpareServers          2
        MaxSpareServers         5
        MaxRequestWorkers        600 (assuming 120 GB available for webwork)
        ServerLimit             600
        MaxConnectionsPerChild   100


$Apache2::SizeLimit::MAX_PROCESS_SIZE = 600000; (virtual size)
$Apache2::SizeLimit::MAX_UNSHARED_SIZE = 600000;
$Apache2::SizeLimit::CHECK_EVERY_N_REQUESTS = 20;



From what I can see, google chrome is the worst case keeping 6 connections active. With an 800+ student class, if at least half the students come in at the same time, theoretically  I could exhaust all 256 GB of ram assuming the ~200 MB prefork process size. Would an ngnix ssl proxy solution work in this case? Say chrome hits nginx with 6 connections, but nginx engages only one prefork apache connection. Would this break LTI? Thanks again for the responses
and help.  

Installation -> apache prefork memory consumption

by Balagopal Pillai -
 Our department operates a new webwork server (12 cores, 128 GB ram, debian 8, webwork 2.11) that is getting loaded now with more courses and there is one more on order (20 cores, 256 GB ram) to serve  a large 
course in fall (800+ students) So I was looking at the resource requirements for apache. Please see below from a test setup -

apache with ssl no webwork enabled on config -

 17835 www-data  20   0  277716  15616   8912 S   0.0  0.8   0:00.66 /usr/sbin/apach

 
apache with ssl  with webwork  apache config enabled - 

22130 www-data  20   0  489012 112412   1944 S   0.0  5.5   0:00.00 /usr/sbin/apach

           As you could see, enabling webwork bumps up the resident size to over 100 MB per process. On the production webwork server, I see that it grows continuously to a gigabyte and beyond. I have somewhat worked around it by dialing down MaxSpareServers,KeepAliveTimeout  and MaxConnectionsPerChild, adjusting MaxRequestWorkers and ServerLimit
by assuming 200 MB per apache process (instead of the 50 MB mentioned in one of the install guides) and with Apache2::SizeLimit using 600 MB virtual size as the limit for graceful termination of apache process (turns 
out to be 200 - 250 MB resident size) But if the apache process could use less than 50 MB resident size, it might be possible for me to squeeze in more courses on the server. 

         Please see the pmap output below. I could see a large anon allocation below apache that might be the reason for this ballooned up resident size -

0000558760782000       0       0       0 rw--- apache2
0000558760786000      12       8       8 rw---   [ anon ]
0000558760786000       0       0       0 rw---   [ anon ]
0000558761d48000  105160  105096  105096 rw---   [ anon ]

        I did some testing starting with a blank apache webwork config file and adding all lines up to 
eval "use lib '$pg_dir/lib'"; die $@ if $@;   Here is the process -

21945 www-data  20   0  324636  19200   1976 S   0.0  0.9   0:00.00 /usr/sbin/apach

        Then adding require Apache::WeBWorK; and restart apache. Please see below. This adds about
85 MB -

22136 www-data  20   0  480216 105604   1880 S   0.0  5.2   0:00.00 /usr/sbin/apach

        Then adding the rest of the lines back adds another few megabytes to the resident size -

22396 www-data  20   0  489004 112432   1972 S   0.0  5.5   0:00.00 /usr/sbin/apach

         Is there something in the webwork config I could modify to get the resident sizes to reasonable values and stop the ballooning effect when students access the courses? Thanks very much.

Balagopal
Hi Tony,

The problem may be that some of your apache processes are allowed to grow too large. You can limit them using the Apache2::SizeLimit module. (This module is already installed with apache2, so you don't want to download it from CPAN!). See discussion here:


If I recall correctly, the only thing we did was to add the line 

PerlCleanupHandler Apache2::SizeLimit

to the end of /etc/apache2/apache2.conf

and our slowdown problem went away. I don't think we did any of the other configuration mentioned in above link.

Lars

Installation -> WW 2.7, lighttpd, ApacheSizeLimit

by Hal Sadofsky -

Hi everyone,

I wonder if anyone has really explicit instructions for how to alter the configuration files that come with the WW 2.7 LiveDVD in order to use lighttpd.

I've already installed lighttpd, and started it.  I tried following the instructions here: http://webwork.maa.org/wiki/Installation_Manual_for_2.7_on_Ubuntu_12.04#Checking_for_and_installing_hotfixes which are quite good, but not apparently rewritten for WW 2.7 since they make reference to the global.conf file.

I tried to adapt them and my attempt didn't quite work.  I'm sure I can do this by trial and error, but at the moment I only have a production server and not a development server (a long story) so I'm not anxious to do trial and error while students are working on their homework, nor am I anxious to get up at 4am to do this while the server is idle.  (We have about 3000 students using the server this term.)

I also have the same question for implementing ApacheSizeLimit.  Alex Basyrov has nice instructions that didn't seem to quite work in my situation,  but I'm trying to avoid the trial and error approach on my live server.

thanks,  Hal