Skip to content

log-y/mini-http-server

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

mini-http-server

This is a simple HTTP server that handles multiple requests concurrently using a thread-safe worker queue, include testing and benchmarking showing performance gains.

Overview

Popular open-source tools like Nginx leverage concurrency for large throughput gains at scale. This is because if multiple requests arrive at the server simultaneously and they ask for different data, they should be answered concurrently and not serialized. I built a simple toy server to explore this topic that takes in GET requests, simulates some amount of work, then returns a default "hello world" string. Under the hood, I used a thread-safe worker queue to keep track of requests, a thread pool of workers, and a socket request handler that reads in requests on an open port.

Architecture

Queue

This is thread-safe worker queue. It's a wrapper around the C++ queue object, using a mutex and a lock_guard to ensure multiple threads can't write on the queue at once. This will prevent race conditions, where two threads try to process the first request of the queue at the same nanosecond.

I use a condition_variable to sleep or wake workers depending on the emptiness of the queue. If a worker tries to pop() the empty queue, it will be put to sleep until the queue is non-empty again. I wake other workers using notify_one() in push(), so when a new request is read in, we wake one worker to handle it.

Httpserver

This is responsible for network and request processing. HttpServer opens a port, listens for requests using a socket, and manages the thread logic.

There is one main thread, accept_loop, which runs a while(true) to listen to any requests inbound on our expected port. Once I get a request, I get its data and assign it a file descriptor. Then, I add it to our queue (note this operation is thread-safe, so other threads can't pull from our queue as I write to it. This would cause memory corruption if there's only one element in the queue).

The worker threads process the requests. They pull from the queue, answer the request, then either continue working or sleep depending on the emptiness of the queue. If the queue is not empty, they will continue working because their condition_variable is activated.

I use handle_request to simulate some dummy CPU work and to give a default network response of "hello world". I simulate the dummy CPU work by running one "for loop" for n iterations, which I will be calling "number of operations" for the rest of this report.

Testing

I used Apache Benchmark (ab) to test this. I ran ./server with a specified number of threads and then used ab to send a large number of requests at a certain req/sec to my endpoint. I varied the amount of CPU work, requests per second, and threads to explore how my program could handle different workloads.

Observations and Report

By using more worker threads, I found significant speedups in CPU-intensive and high throughput workloads. For example, when I used 1000 concurrent connections and 5 million operations in a request, I found that my 16-threaded server ran 12.7x faster than my single-threaded server. This is because in high throughput scenarios that use lots of CPU, more workers can process requests in parallel more efficiently.

It is very important to note that if the CPU work was either not intensive or if there were not many concurrent connections (or both), the speedups were either very minimal or nonexistent. Sometimes, the multi-threaded architecture actually was slower than the baseline in these scenarios.

This is because if the CPU work is minimal, each worker will process the request very quickly. Then, the worker would immediately check the queue for more work, lock the queue, then answer the request. If there are not that many inbound requests per second, the queue will NOT grow faster than it shrinks and we actually introduce MORE compute by giving the OS more threads and locking to handle. For example, if we send one request per second and the request is handled in 1 nanosecond, we spend almost the entire second waiting for the next request and the threads will be slept. When the next request is added to the queue, the OS needs to wake up a thread, run it, then sleep it again.

  • This is why cases testing just 5 operations per request and 1 concurrent connection are slower in multi-threaded architectures than single-threaded ones. For example, in answering 10,000 simple requests under 1 concurrent connection, 16 threads takes 0.121 ms/req while 1 thread is actually slightly faster at 0.118 ms/req.

The opposite is true for high compute, high throughput workloads. Under one thread, if tasks are being added to the queue at a faster rate than they are being completed, the queue will grow larger and we will 'fall behind', taking a longer average time to answer all requests. For example, if we add 1000 difficult CPU-intensive tasks to the queue per second and can only answer one task per second, the queue will grow very long and we will not be able to answer them at a reasonble rate. However, if we add multiple threads that could all answer one CPU-intensive task per second, this could be up to (number_of_threads) times faster. Two cases can happen here:

  1. We do not fall behind: If we get 8 new requests per second and we can answer 16 per second using 16 workers, our worker queue will stay empty and once requests stop coming in, we will finish almost instantly. This is very efficient.
  2. We do fall behind: This case still dramatically outperforms the benchmark because after requests stop coming in, we will still be processing the excess work in parallel. The condition variable ensures that as long as there is stuff on the queue, we will wake up workers to answer requests. The critical optimization is that our condition variable runs processing when the queue is non-empty, so the however many workers we had at our peak will still be processing until the entire queue is done, making this significantly more performant at handling high throughput, CPU-intensive tasks.

Here are a few specific cases and tests that illustrate this point.

  • Case 1: Answering 10k total requests @ 20 concurrent connections, each request costing 5mil operations. 20 threads took 6.4ms / req on average. 1 thread took 118ms / req on average. This is a speedup of 18.4x.
  • Case 2: Answering 1k total requests @ 1,000 concurrent connections, each request costing 5mil operations. 16 threads took 392ms / req on average. 1 thread took 7.986 seconds / req. This is a 20.37x speedup.
  • Case 3: Answering 10k total requests @ 1,000 current connections, each request costing 5mil operations. 20 threads took 340ms / req on average. 1 thread took 8.332 seconds / req. This is a 24.5x speedup.
  • Case 4: Answering 100k total requests @ 1,000 current connections, each request costing 5mil operations. 20 threads took 405ms / req on average. 1 thread took 12.608 seconds. This is a 31.13x speedup.

Case 1 illustrates optimal parallel processing. We get nearly 20x faster using 20 threads, which makes a lot of sense. Because there's only 20 connections, the requests do not build up on the queue, letting us handle these requests in parallel. Although the overhead introduced by locking and handling threads will prevent this case from ever being exactly 20x, we can still get pretty close. Note that in cases 2-4, the speedup multiplier is greater than the number of threads, which feels impossible. If we have 20 threads and perfectly parallelize our program, it feels like we can only get a 20x improvement. But as you can see, we had some speedups of 25x and 30x. Observing that our bigger speedups are from a higher total volume of requests, we can infer that our 25x and 30x speedups are because of lower performance of the single-threaded server at scale (in addition to the positive performance of our multi-threaded server).

  • Speculation: I think cache locality can best explain this difference. In a single thread trying to process a queue of thousands, we will see very many cache misses as it walks down the queue and reads in requests it hasn't 'touched' in many CPU cycles. The 20-threaded server might not have to deal with this. This is because the queue will be around 20 times shallower (or even shallower, if we process faster than the queue builds) due to parallel processing. This increases cache hits as well, as fresh nodes that have just been added might be quickly processed by a worker thread while it's still 'hot' in memory. Thus, our single-threaded instance with lots of requests will compound inefficiency. Our single thread in our 1-threaded architecture might not process at the same rate of any thread in our multi-threaded architecture -- it will process slower. And compound this effect over 20 different threads on 20 cores, we can start to see how we have achieved this 30x increase in speed. We can formally check this using 'perf stat' and looking at the cache hits/misses it finds in the two different cases but I am having trouble running it on my WSL system, so this is just speculation (for now).

Conclusion

Under this new implementation of a simple HTTP server using a worker pool, we have seen remarkable speedup increases in high throughput, CPU-intensive requests. In cases that do not have high throughput or CPU intensity, we have seen small performance losses. However, in the real world where there can rapid unpredictable bursts in traffic, this project shows the value of a multi-threaded architecture.

Compile and run

g++ main.cpp httpserver.cpp safequeue.cpp -pthread -o server

./server

ab -n 300000 -c 100 http://127.0.0.1:8080/

About

miniature http server for basic GET requests, leveraging concurrency to handle multiple requests simultaneously

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages