The project I worked on this summer was divided into two major parts. First, I optimized the cryptography parallelism of the in-kernel WireGuard implementation. My goal was to improve overall speed, increase multi-core scalability, and fix some hard-to-reproduce race conditions. I spent the second half of the summer writing an Android application that serves as a convenient GUI frontend for WireGuard. My vision for the app was to be simple, yet very functional.
The work I did for this part of the project is available in the sh/queues-5-for-jason branch of the official repository.
We were unsatisfied with the current padata
-based implementation, due to some of its limitations,
like a hard maximum on the number of outstanding work items, conflicts with softirq processing
during work functions, reliance on a timer to work around race conditions, and a still-mysterious
race condition that caused list/memory corruption and had been affecting one of my machines.
The design uses a pair of queues (really, sets of queues) to enable parallel encryption of packet data, while ensuring that packets are always kept in order. One set of queues is per peer. Packets are added to this queue as soon as they are received from userspace, and they are kept in this queue until they are passed to the underlying physical network interface.
Once added to the per-peer queue, packets are added round-robin to a per-CPU queue to be encrypted. These per-CPU queues are shared among all peers, since the CPU-intensive encryption process can work on packets in any order. Having shared queues for the "slow" part also prevents once peer from starving any others out of CPU time.
Once encryption is finished, the packet is marked with a flag (CTX_FINISHED
) allowing it to be
transmitted. The per-peer transmission function takes all marked packets from the front of the
queue, dequeues them, and sends them out together. An equivalent process happens on the receiving
side for decryption and forwarding packets to userspace.
Because both queues are single-reader, multiple-writer, I was able to implement them with simple
single-cmpxchg
loops (found in src/queue.h
) instead of full spinlocks. This helps maximize the
multi-core scalability of the data pipeline. Additionally, due to better integration with other
parts of the WireGuard codebase, the three-stage send pipeline is reduced to two stages in the vast
majority of cases; a separate step associating packets with a keypair only needs to happen when
waiting for a handshake to complete.
Since I was able to remove the dependency on padata
, my total contribution is about 850 lines of
net reduction in code size. Excluding changes to the compatibility layer, however, I traded a 1000
line dependency for under a hundred lines in WireGuard itself, while at the same time improving
performance.
src/config.c | 2 +-
src/data.c | 432 ++++++++++++++++++++++++++++++----------------------------
src/device.c | 46 +++----
src/device.h | 12 +-
src/main.c | 12 +-
src/packets.h | 21 +--
src/peer.c | 28 +++-
src/peer.h | 9 +-
src/queue.h | 139 +++++++++++++++++++
src/receive.c | 5 +-
src/send.c | 96 ++-----------
src/timers.c | 4 +-
12 files changed, 445 insertions(+), 361 deletions(-)
Due to a busy schedule at the end of the summer, Jason, the WireGuard maintainer, did not have an opportunity to fully review and merge my changes before the end of GSoC. However, the changes should be merged soon.
- Integration of the BQL (byte queue limits) queue management system. This is a library provided to
help prevent queues from becoming too large, and dynamically resizing them based on observed
latency. This would track the total number of bytes held by WireGuard in its queues on all cores.
It also allows integration with
fq_codel
to ensure bufferbloat does not become a problem.- We already started work on this, but wanted to wait until the main changes were merged before putting too much time into it.
- Removal of
data.c
. All of its functions can and should be moved tosend.c
orreceive.c
, now that infrastructure for working withpadata
is removed.- Again, I have already done this, but it prevents rebasing, so it is best left until the main changes are merged.
- Some form of GRO (Generic receive offload). Right now, groups of packets up to 64KiB long are
encrypted together when sent in one
send(2)
system call from userspace. This minimizes queue lengths and setup/teardown costs of the cryptography functions. However, packets are received one by one, so the receive side has more overhead and much longer queue lengths. GRO lets us group packets for the same peer as they are received.- Unfortunately, the normal GRO path does not preserve the original packet length, which WireGuard needs in order to find the end of the encrypted data. Thus using it for WireGuard involves "stealing" packets from the networking subsystem and keeping our own list per keypair.
- Evaluate other methods of dividing encryption/decryption work among CPUs, besides round robin.
- All queues must have size limits, especially when the readers and writers compete for CPU time, because unbounded growth can easily consume all available RAM and bring down the system. I spent a while investigating an out-of-memory condition I thought was due to RCU grace periods that ended up just being due to the accidental removal of a queue length limit.
- Multithreaded programming in kernelspace is surprisingly easy, considering the three levels of preemption and the rules around them. Lockups are also quite easy, and the built-in kernel debugging tools are integral to finding and fixing them.
Source code for the app is available in the wireguard-android repository.
WireGuard was designed from the start for seamless roaming, and does not need to transmit any packets to maintain a "connection" when not in use. Both of these traits make it great for a mobile VPN solution, but there was not yet integration for any mobile OS. While the WireGuard kernel module would have to be ported to individual Android devices, another GSoC student wrote a userspace implementation this summer, which will work on any device and function without root access.
In its current state, the app allows creating and editing WireGuard VPN configurations, enabling and
disabling them, as well as renaming and deleting them. All of the different attributes used by the
kernel module (through wg(8)
) as well as wg-quick(8)
are supported, with the exception of
FwMark
, as it is used internally by Android.
Interfaces can be enabled or disabled from within the app; multiple configurations can be enabled
simultaneously, as long as they use different ports for the underlying UDP socket (ListenPort
).
Additionally, on devices running Android 7.0 Nougat or newer, one can be controlled via a custom
quick settings tile.
The app supports all devices running Android 5.0 Lollipop or newer, though the WireGuard module must be available, wg-quick must be installed, and the app must have root access so it can run wg-quick. The app uses material design, and supports both phone and tablet layouts. Keypairs can be generated from within the app, and the public key can be easily exported to the device's clipboard.
While the app is functional, there is a lot of work left to do. The app was developed and tested on the OnePlus 3/3T running SultanXDA's build of LineageOS 14.1. It should work on other phones and firmwares where root access and the WireGuard module are available. The WireGuard module is compatible with kernels back to Linux 3.10.
- Allow importing a configuration from a file or QR code. This would replace the "add" button with
a floating action menu. This is already implemented in the
fab
branch, but has not been merged yet due to some licensing issues about which FAB implementation to use. - Allow comments in config files. This is only relevant for imported configurations.
- Validate more config attributes (IPs, endpoints, etc.). Currently, only the interface name and private key are fully checked (though other fields have character type or length restrictions).
- Add an optional notification when a config is enabled. This would be especially useful on older devices where the quick settings tile is not available.
- Use a switch to toggle configs within the app. This requires implementing a custom View based on the Switch, because it is not possible to hook into the existing switch to programmatically control its state when touched.
- More robust state checking and error reporting. Currently, the app does not provide feedback when enabling a configuration fails, beyond it not transitioning to the "enabled" state. This can happen when the UDP port WireGuard listens on is reused, or there was bad syntax within a configuration attribute.
- Show runtime status (uptime, transfer stats) on config detail screen. Basically, copy the
relevant parts of what
wg(8)
shows for a running interface. - Support calling
wg(8)
directly (instead ofwg-quick(8)
) and the userspace Go implementation.
- The Android
VpnService
framework does not work at all with kernel-space VPNs. I spent the better part of a week crawling through layers upon layers of AOSP code, and there's simply no way around using a TAP device with it. Fortunately, this is fine for the userspace implementation, and the kernelspace implementation would have required root access anyway. - Fragments and "responsive" layouts are hard. I rewrote the state machine for the main Activity at least five times, each time thinking "I finally got it working in all cases" before finding some sequence of actions that was broken due to my app's view of its state not matching the Android system's view.
- Don't put an EditText inside of a ListView. Ever. If you don't trust me, ask all of these poor souls who tried it. The recommended solution is actually to reimplement most of the ListView yourself. Fortunately, this was not much more difficult than writing the ListAdapter to I had already written to integrate a ListView with data binding.