Greetings to my fellow nerds. A month ago, I had zero network programming experience. So I decided to fix that by building an epoll based HTTP server from scratch and benchmarked every major architectural change along the way.
Performance Benchmarks
Benchmark command:
wrk -t4 -c10000 -d10s http://127.0.0.1:8080/
Request: GET /index.html
Response: Static HTML file (~1500 bytes)
CPU: Intel i5-13420H (13th Gen)
Compiler: Clang (O3)
| Architecture | Throughput (req/sec) | Description |
|---|---|---|
| Blocking | ~15k | Single threaded blocking accept/read/write |
| Epoll (LT) | ~34k | Single threaded event loop utilizing non blocking I/O multiplexing |
| Epoll (LT, keep alive) | ~37.5k | Single threaded event loop with persistent connections |
| Epoll (LT, keep alive, sendfile) | ~41k | Single threaded event loop with persistent connections and zero copy file serving |
| Epoll (LT, keep alive, sendfile, multithreading) | ~125k | Multithreaded architecture running 4 concurrent epoll loops (optimal on test machine) |
Some Surprising Observations
Sendfile mattered less than I expected... for a server whose entire purpose is to serve files, I was expecting a bigger gain but maybe because my file was only ~1.5KB, it did not help much.
More threads made things worse:
| Worker Threads | Throughput (req/sec) |
|---|---|
| 1 | ~40k |
| 2 | ~95k |
| 3 | ~115k |
| 4 | ~125k |
| 5 | ~90k |
| 6 | ~90k |
| 8 | ~75k |
| 10 | ~70k |
| 12 | ~65k |
My CPU has 6 physical cores and 12 logical processors, I suspect that the cost of all the syscalls for every loop, context switching, and lock contention on shared kernel objects, dominated on higher thread counts, though I haven't fully investigated it yet.
Profiling with perf
| Function | Approx. CPU Samples |
|---|---|
| readSock() | ~22% |
| writeSock() | ~16% |
| parse() | ~8% |
| std::format() | ~7% |
| open() | ~3% |
| sendfile() | ~2.5% |
Turns out I'm still spending more time reading and parsing requests than sending responses, meaning there might still be room for batched reads or buffer pooling in a future iteration...
Final Thoughts
I could hunt for possible micro optimizations or even experiment with an edge triggered architecture but I'm kinda burnt out at this point and this feels like a great point to end this project... The codebase is pretty small (~1k LOC), so if anyone's interested in taking a look: GitHub Repository
Blogger's Review: This article illustrates the critical technical challenges and solutions encountered during the optimization of an epoll-based HTTP server, particularly the in-depth analysis of multithreading performance. The detailed exploration of performance bottlenecks offers valuable insights for future optimization directions.