Features & Development

Supporting WW for very large user bases - request for information

Supporting WW for very large user bases - request for information

by Nathan Wallach -
Number of replies: 3

Various factors have led to significant interest in understanding what would be needed to run WW at a "very large" scale. By this I mean having a "server" or "cluster" or whatever be able to support say 100,000 users making use of embedded problems. Due the the ability of web-pages to embed multiple problems in a single web page, and the "scale" - peak loads could potentially involve the need to "render" and serve many thousands of questions per minute. 

In the long-term, it seems to be that such needs are probably bet met by using "cloud technologies" and horizontal scaling via something like Kubernetes. I have corresponded in the past with several people potentially interested in helping run WW well in Kubernetes.

In the short term, having information available on existing WW servers/clusters which serve large user bases (estimated number of users per semester, some data on server resources and loads) would help me and others understand how far WW can scale using current options (large servers, basic load-balancing and redundant servers).

Anyone able to share data about their "large" WW installation is invited to reply.

In reply to Nathan Wallach

Re: Supporting WW for very large user bases - request for information

by Nathan Wallach -
Edfinity was willing to tell me that they do load-balancing with multiple servers and shared storage, but their back-end WW servers are stateless (their front-end handles all the database work). This is a positive indication about the possibility of such large scale back-end WW clusters.

---

Capacity planning remains quite challenging without much data, so I decided to stress-test my development server using a simple free tool to try to get a reasonable estimate handle on the capacity it provides. I used siege https://www.joedog.org/siege-manual/ as it is very simple and sufficed for the basic sort of testing I wanted to do, but it may not be sufficient to test a much larger server.

Below is information on the server, and the test results.

The VM has 10GB RAM, 3 vCPU development/testing server (WW running inside Docker) on a CentOS base OS VM. siege was pointed to URLS for embedding problems in HTML pages, and only requested the main HTML part of problems via the html2xml interface. The tests were against a list of 40 html2xml problems URLs. During these load tests, I also used a browser to load one of those pages (not very often), and it would render, but the delay/latency certainly was noticeable when the server was under real stress (150 clients), and much less so when it was handling 80 clients, when it was more or less typical in time to load/render on screen.

This server gets quite stressed when 150 siege clients are hitting it in parallel (lots of swap activity was triggered), but functions reasonably well for 100 siege clients (very moderate swap activity, but CPU usage would max out at times - apparently when Apache processes were being started and stopped). Average response times were a bit better with only 80 siege clients. I suspect that testing with somewhere between 1800 and 200 clients would probably DoS the server due to excessive swapping and the OoM issues, but I did not try it in practice.

Note: The WW timing.log file does not seem to keep up with the load, and only a small fraction of the render calls are getting logged to it. In my case the file is on NFS, which may be hindering the logging code more than would occur on local storage, but I suspect that the WW logging code for timing.log is simply not up to handling this sort of load.

These simple tests seem to show that this server config can support up to about 2750 "renders" per minute when 100 clients are making constant streams of sequential requests, and about 2650 "renders" per minute when there are only 80 such clients (but with faster average response times). That seems to be a reasonable estimate of maximum capacity for this server instance.

After some tuning of mpm_prefork setting and Apache2::SizeLimit (see ttps://webwork.maa.org/moodle/mod/forum/discuss.php?d=2692#p5887 ) as follows, I got the results shown below from several "siege" runs with different settings.
  • MaxRequestWorkers set to 200
  • MaxConnectionsPerChild set 25
  • $Apache2::SizeLimit::MAX_PROCESS_SIZE = 420000;
  • $Apache2::SizeLimit::MAX_UNSHARED_SIZE = 420000;
  • $Apache2::SizeLimit::CHECK_EVERY_N_REQUESTS = 5;
There is certainly a tradeoff between the memory growth of the Apache processes and the CPU costs of starting up new Apache workers. It is certainly possible that more careful tuning could somewhat improve performance on the tests, but I'm not sure how much more effort on this is worthwhile at present. I do hope to arrange to run tests with some additional RAM/vCPU resources in the near future to see what sort of scaling / performance behavior I can observe.

(Note: I found it helpful to put the servers IP address in my /etc/hosts file to avoid DNS delays and some failures during earlier stress tests I tried in the same approach).

siege -c 150 -t120S -f /home/tani/.siege/url-01.txt

** SIEGE 4.0.4
** Preparing 150 concurrent users for battle.
The server is now under siege...
[alert] socket: select and discovered it's not ready sock.c:351: Connection
timed out
[alert] socket: read check timed out(30) sock.c:240: Connection timed out
[alert] socket: select and discovered it's not ready sock.c:351: Connection
timed out
[alert] socket: read check timed out(30) sock.c:240: Connection timed out

Lifting the server siege...
Transactions:                   4575 hits
Availability:                  99.93 %
Elapsed time:                 119.37 secs
Data transferred:              49.18 MB
Response time:                  3.83 secs
Transaction rate:              38.33 trans/sec
Throughput:                     0.41 MB/sec
Concurrency:                  146.89
Successful transactions:        4575
Failed transactions:               3
Longest transaction:           55.34
Shortest transaction:           0.33

siege -c 100 -t360S -f /home/tani/.siege/url-01.txt

[alert] socket: select and discovered it's not ready sock.c:351: Connection
timed out
[alert] socket: read check timed out(30) sock.c:240: Connection timed out
[alert] socket: select and discovered it's not ready sock.c:351: Connection
timed out
[alert] socket: read check timed out(30) sock.c:240: Connection timed out


Lifting the server siege...
Transactions:                  16685 hits
Availability:                  99.98 %
Elapsed time:                 359.95 secs
Data transferred:             191.27 MB
Response time:                  2.14 secs
Transaction rate:              46.35 trans/sec
Throughput:                     0.53 MB/sec
Concurrency:                   99.37
Successful transactions:       16685
Failed transactions:               3
Longest transaction:           43.86
Shortest transaction:           0.10


siege -c 80 -t180S -f /home/tani/.siege/url-01.txt
** SIEGE 4.0.4
** Preparing 80 concurrent users for battle.
The server is now under siege...[alert] socket: select and discovered it's not ready sock.c:351: Connection timed out
[alert] socket: read check timed out(30) sock.c:240: Connection timed out
[alert] socket: select and discovered it's not ready sock.c:351: Connection timed out
[alert] socket: read check timed out(30) sock.c:240: Connection timed out
[alert] socket: select and discovered it's not ready sock.c:351: Connection timed out
[alert] socket: read check timed out(30) sock.c:240: Connection timed out
[alert] socket: select and discovered it's not ready sock.c:351: Connection timed out
[alert] socket: read check timed out(30) sock.c:240: Connection timed out
[alert] socket: select and discovered it's not ready sock.c:351: Connection timed out
[alert] socket: read check timed out(30) sock.c:240: Connection timed out
[alert] socket: select and discovered it's not ready sock.c:351: Connection timed out
[alert] socket: read check timed out(30) sock.c:240: Connection timed out
[alert] socket: select and discovered it's not ready sock.c:351: Connection timed out
[alert] socket: read check timed out(30) sock.c:240: Connection timed out

Lifting the server siege...
Transactions:        7530 hits
Availability:       99.91 %
Elapsed time:      179.84 secs
Data transferred:       85.16 MB
Response time:        1.87 secs
Transaction rate:       41.87 trans/sec
Throughput:        0.47 MB/sec
Concurrency:       78.19
Successful transactions:        7531
Failed transactions:           7
Longest transaction:       32.54
Shortest transaction:        0.10

siege -c 80 -t180S -f /home/tani/.siege/url-01.txt
** SIEGE 4.0.4
** Preparing 80 concurrent users for battle.
The server is now under siege...[alert] socket: select and discovered it's not ready sock.c:351: Connection timed out
[alert] socket: read check timed out(30) sock.c:240: Connection timed out

Lifting the server siege...
Transactions:        8006 hits
Availability:       99.99 %
Elapsed time:      179.30 secs
Data transferred:       90.26 MB
Response time:        1.77 secs
Transaction rate:       44.65 trans/sec
Throughput:        0.50 MB/sec
Concurrency:       79.14
Successful transactions:        8006
Failed transactions:           1
Longest transaction:       36.11
Shortest transaction:        0.10
 
tani@lxtani:~$ siege -c 80 -t180S -f /home/tani/.siege/url-01.txt
** SIEGE 4.0.4
** Preparing 80 concurrent users for battle.
The server is now under siege...[alert] socket: select and discovered it's not ready sock.c:351: Connection timed out
[alert] socket: read check timed out(30) sock.c:240: Connection timed out
[alert] socket: select and discovered it's not ready sock.c:351: Connection timed out
[alert] socket: read check timed out(30) sock.c:240: Connection timed out
[alert] socket: select and discovered it's not ready sock.c:351: Connection timed out
[alert] socket: read check timed out(30) sock.c:240: Connection timed out

Lifting the server siege...
Transactions:        8048 hits
Availability:       99.96 %
Elapsed time:      179.57 secs
Data transferred:       91.50 MB
Response time:        1.76 secs
Transaction rate:       44.82 trans/sec
Throughput:        0.51 MB/sec
Concurrency:       78.86
Successful transactions:        8048
Failed transactions:           3
Longest transaction:       34.93
Shortest transaction:        0.10


In reply to Nathan Wallach

Re: Supporting WW for very large user bases - request for information

by Joe Macias -

Hi, 

As discussed, I am attaching a copy of Rederly's Architecture to show our implementation of webwork that would allow for scalability with little monitoring needed. 

Web Application Details - Brief Overview:

The backend and the renderer are both stand-alone and dockerized. Each service can run on its own. For example, you can make a request to the renderer with a webwork file path you'll receive the question. In each of these services, no data is stored within the containers. This allows us to leverage auto-scaling. We can scale up these services with minimal disruption to handle increased load on the platform, but also scale down when there isn’t enough demand. Our storage is stored in two places. EFS storage which stores our webwork files/content and mounted to every container, and our database that stores all data involving the application.

AWS Components:

Load Balancer

·       Elastic Load Balancing automatically distributes incoming application traffic across multiple targets, such as Amazon EC2 instances, containers, IP addresses, and Lambda functions.

·       Here is more information: https://aws.amazon.com/elasticloadbalancing/?nc=sn&loc=1

ECS

·       Amazon Elastic Container Service (Amazon ECS) is a highly scalable, fast, container management service that makes it easy to run, stop, and manage containers on a cluster.

·       Here is more information: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/Welcome.html

 Auto-Scaling

 ·       AWS Auto Scaling monitors your applications and automatically adjusts capacity to maintain steady, predictable performance

·       Here is more information: https://aws.amazon.com/autoscaling/

Relational-DB

·       Amazon Relational Database Service (Amazon RDS) makes it easy to set up, operate, and scale a relational database in the cloud.

·       Here is more information: https://aws.amazon.com/rds/ 

Cloudfront

·       Amazon CloudFront is a fast content delivery network (CDN) service that securely delivers data, videos, applications, and APIs to customers globally with low latency, high transfer speeds, all within a developer-friendly environment

·       Here is more information: https://aws.amazon.com/cloudfront/

In reply to Joe Macias

Re: Supporting WW for very large user bases - request for information

by Nathan Wallach -

Joe -  

I'm really grateful that Rederly is sharing this, so the community can hear about what can be done with WeBWorK at scale. 

Could you share some basic explanation of what roles the "back-end" handles, and what is being stored in the database (as I understand that Rederly does not use the standard WW databases at all).

I suppose that the "back-end" is Rederly's alternative to both some of the higher-level management functions provided by the "webwork2" layer of WW for "standard" WW systems and also a "broker" receiving web requests and creating the relevant calls to the PG problem "renderer". 

To what extent can a "renderer" container interact directly with external systems without such a "back-end" or with a very minimal "back-end"? The long-term motivation for the question is how to best provide an "html2xml" like interface for embedding WW questions in web-pages of various sorts, MOOC courses, electronic textbooks, etc. which can support "scale" (possibly at the expense of  sacrificing keeping records of answers submitted, etc.).

Would Rederly be able to share Docker files, etc. explaining what is in each image or at  least in the renderer image, and any custom code needed for the rendered Docker image?

At present, I am focused on other things, so I will not be following up much at present on the scale issues/ideas, but will try to participate in the discussions.