Hardware hotplug events on Linux, the gory details
tl;dr go here
One day, I suddenly wondered how to detect when a USB device is plugged or unplugged from a computer running Linux. For most users, this would be solved by relying on libusb. However, the use case I was investigating might not actually want to do so, and so this led me down a poorly-documented rabbit hole.
udev
By browsing the libusb source code, we can see that there are two hotplug backends: linux_netink.c and linux_udev.c. What is the difference?
If we pull up the original commit introducing those files, the commit description reads:
Add hotplug support to the Linux backend.
There are two ways to configure hotplug support for Linux: udev, and netlink. It is strongly recommened that udev support is used on systems that utilize udev. We reenforce this recommendation by defaulting to --with-udev=yes at configure time. To enable netlink support run configure with --with-udev=no. If udev support is enabled all device enumeration is done with udev.
I've certainly encountered udev before (usually in the context of changing permissions of USB devices so that I can access them without being root), but I suppose it's time to look deeper into what it actually does.
Fortunately, Free Electrons / Bootlin has the history well-covered. TL;DR, the kernel uses netlink to tell udev about devices, udev does its necessary handling of them, and then udev re-broadcasts these events to every other program interested in them.
The reason libusb so strongly recommends using the udev mechanism is to avoid race conditions. For example, udev might be in the process of changing Unix permissions, uploading firmware, or mode-switching USB devices.
But… how do we listen for these rebroadcasted events? Can we do it without linking in libudev? What IPC mechanisms are actually in use here? It turns out that udev and libudev have long since been folded into systemd while I wasn't looking. We're going to have to dive into the code and have a look.
netlink
Before continuing further, it might be necessary to give a brief overview of netlink. Netlink is a Linux-specific "network protocol" used to communicate usually between the kernel and userspace, using the BSD sockets API. It is particularly suitable for the kernel sending notifications to userspace (unlike syscalls which need to be initiated by userspace).
Netlink passes datagrams (like UDP) but can also pass ancillary data (like Unix domain sockets). Netlink also supports a somewhat-limited multicast capability, where many programs can receive events sent by one source.
Example code
At this point, it might be easier to have some code to reference:
#define _GNU_SOURCE
#include <ctype.h>
#include <stdio.h>
#include <stdbool.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <poll.h>
#include <arpa/inet.h>
#include <sys/socket.h>
// Define netlink data types, so that we don't need linux-specific header files to compile this
#define NETLINK_KOBJECT_UEVENT 15
#define MONITOR_GROUP_KERNEL 1
#define MONITOR_GROUP_UDEV 2
struct sockaddr_nl {
sa_family_t nl_family;
unsigned short nl_pad;
uint32_t nl_pid;
uint32_t nl_groups;
};
// MurmurHash2 (copy-pasta)
uint32_t MurmurHash2 ( const void * key, int len, uint32_t seed ) {
// 'm' and 'r' are mixing constants generated offline.
// They're not really 'magic', they just happen to work well.
const uint32_t m = 0x5bd1e995;
const int r = 24;
// Initialize the hash to a 'random' value
uint32_t h = seed ^ len;
// Mix 4 bytes at a time into the hash
const unsigned char * data = (const unsigned char *)key;
while (len >= 4) {
uint32_t k = *(uint32_t*)data;
k *= m;
k ^= k >> r;
k *= m;
h *= m;
h ^= k;
data += 4;
len -= 4;
}
// Handle the last few bytes of the input array
switch(len) {
case 3: h ^= data[2] << 16; /* fall through */
case 2: h ^= data[1] << 8; /* fall through */
case 1: h ^= data[0]; /* fall through */
h *= m;
};
// Do a few final mixes of the hash to ensure the last few
// bytes are well-incorporated.
h ^= h >> 13;
h *= m;
h ^= h >> 15;
return h;
}
// Helper function to print a hexdump (mostly not used)
void hexdump(unsigned char *x, size_t sz) {
size_t i;
for (i = 0; i < (sz + 15) / 16 * 16; i += 16) {
// header
printf("%08zx:\t", i);
// hex
for (size_t j = 0; j < 16; j++) {
if (i + j >= sz) {
printf(" ");
} else {
printf("%02x ", x[i + j]);
}
}
// ascii
printf("\t");
for (size_t j = 0; j < 16; j++) {
if (i + j >= sz) {
putchar(' ');
} else {
unsigned char c = x[i + j];
if (isprint(c))
putchar(c);
else
putchar('.');
}
}
// newline
printf("\n");
}
}
// Print a line of * characters
void print_stars() {
for (int _ = 0; _ < 80; _++)
putchar('*');
putchar('\n');
}
void print_kern_uevent_pkt(void *buf, size_t bufsz) {
// The buffer is a set of null-terminated strings (we blindly trust the kernel on this)
while (bufsz) {
int this_sz = printf("%s\n", (char *)buf);
buf += this_sz;
bufsz -= this_sz;
}
}
struct udev_packet_header {
// contains "libudev" with a null terminator
char libudev_magic[8];
// contains 0xfeedcafe (big-endian)
uint32_t magic;
// size of this header. *native* endian
uint32_t header_sz;
// offset to the (null-terminated strings) properties. *native* endian
uint32_t properties_off;
// size of the properties. *native* endian
uint32_t properties_len;
// hashes etc for filtering. big-endian
uint32_t subsystem_hash;
uint32_t devtype_hash;
uint32_t tag_bloom_hi;
uint32_t tag_bloom_lo;
};
char *startswith(char *buf, size_t sz, const char *thing) {
size_t thinglen = strlen(thing);
if (sz < thinglen)
return 0;
if (!memcmp(buf, thing, thinglen))
return buf + thinglen;
return 0;
}
void print_udev_pkt(void *buf, size_t bufsz) {
struct udev_packet_header hdr;
if (bufsz < sizeof(hdr)) {
printf("Invalid packet!\n");
hexdump(buf, bufsz);
return;
}
memcpy(&hdr, buf, sizeof(hdr));
// Munge header endianness
hdr.magic = ntohl(hdr.magic);
hdr.subsystem_hash = ntohl(hdr.subsystem_hash);
hdr.devtype_hash = ntohl(hdr.devtype_hash);
hdr.tag_bloom_hi = ntohl(hdr.tag_bloom_hi);
hdr.tag_bloom_lo = ntohl(hdr.tag_bloom_lo);
if (memcmp(hdr.libudev_magic, "libudev", 8) || hdr.magic != 0xfeedcafe) {
printf("Invalid packet magic!\n");
hexdump(buf, bufsz);
return;
}
print_kern_uevent_pkt(buf + hdr.properties_off, hdr.properties_len);
// Compute udev hashes
uint32_t actual_subsystem_hash = 0;
uint32_t actual_devtype_hash = 0;
uint64_t actual_bloom_filter = 0;
uint32_t propsz = hdr.properties_len;
char *prop = buf + hdr.properties_off;
while (propsz) {
size_t proplen = strlen(prop);
// Special properties
char *val;
if ((val = startswith(prop, propsz, "SUBSYSTEM=")))
actual_subsystem_hash = MurmurHash2(val, strlen(val), 0);
else if ((val = startswith(prop, propsz, "DEVTYPE=")))
actual_devtype_hash = MurmurHash2(val, strlen(val), 0);
else if ((val = startswith(prop, propsz, "TAGS="))) {
// skip leading :
val++;
size_t tagslen = strlen(val);
while (tagslen) {
// Find ':' char separator
size_t tagentsz = strchr(val, ':') - val;
uint32_t taghash = MurmurHash2(val, tagentsz, 0);
// Bloom filter impl
const uint32_t mask = 0b111111;
actual_bloom_filter |= 1ULL << ((taghash >> 0) & mask);
actual_bloom_filter |= 1ULL << ((taghash >> 6) & mask);
actual_bloom_filter |= 1ULL << ((taghash >> 12) & mask);
actual_bloom_filter |= 1ULL << ((taghash >> 18) & mask);
val += tagentsz + 1;
tagslen -= tagentsz + 1;
}
}
prop += proplen + 1;
propsz -= proplen + 1;
}
// Print out the udev special fields
printf("\n");
if (hdr.subsystem_hash != actual_subsystem_hash)
printf("Subsystem hash expected %08x actual %08x\n", hdr.subsystem_hash, actual_subsystem_hash);
else
printf("Subsystem hash %08x\n", hdr.subsystem_hash);
if (hdr.devtype_hash != actual_devtype_hash)
printf("DevType hash expected %08x actual %08x\n", hdr.devtype_hash, actual_devtype_hash);
else
printf("DevType hash %08x\n", hdr.devtype_hash);
if (hdr.tag_bloom_hi != actual_bloom_filter >> 32 || hdr.tag_bloom_lo != (actual_bloom_filter & 0xffffffff))
printf("Bloom filter expected %08x%08x actual %016llx\n", hdr.tag_bloom_hi, hdr.tag_bloom_lo, (unsigned long long)actual_bloom_filter);
else
printf("Bloom filter %08x%08x\n", hdr.tag_bloom_hi, hdr.tag_bloom_lo);
}
int main(int argc, char **argv) {
void *buf = 0;
size_t buf_sz = 0;
if (argc < 2) {
printf("Usage: %s kernel|udev\n", argv[0]);
return 1;
}
int udev_mode = !strcmp(argv[1], "udev");
if (!udev_mode)
printf("Listening to kernel events...\n");
else
printf("Listening to udev events...\n");
// Open netlink socket
int nlsock = socket(AF_NETLINK, SOCK_RAW | SOCK_CLOEXEC, NETLINK_KOBJECT_UEVENT);
if (nlsock == -1) {
perror("socket(AF_NETLINK)");
return -1;
}
printf("netlink fd %d\n", nlsock);
// Enable reading creds
int one = 1;
if (setsockopt(nlsock, SOL_SOCKET, SO_PASSCRED, &one, sizeof(one)) == -1) {
perror("setsockopt(SO_PASSCRED)");
return -1;
}
// Bind netlink socket to events we're interested in
struct sockaddr_nl sa_nl;
memset(&sa_nl, 0, sizeof(sa_nl));
sa_nl.nl_family = AF_NETLINK;
if (!udev_mode)
sa_nl.nl_groups = MONITOR_GROUP_KERNEL;
else
sa_nl.nl_groups = MONITOR_GROUP_UDEV;
if (bind(nlsock, (struct sockaddr *)&sa_nl, sizeof(sa_nl)) == -1) {
perror("bind");
return -1;
}
while (1) {
// Wait for a packet, and peek how big it is
ssize_t pkt_sz = recv(nlsock, 0, 0, MSG_PEEK | MSG_TRUNC);
if (pkt_sz == -1) {
perror("recv(peeking)");
return -1;
}
if (pkt_sz > buf_sz) {
buf = realloc(buf, pkt_sz);
buf_sz = pkt_sz;
}
// Actually receive the packet
union {
struct cmsghdr cmsg_hdr;
uint8_t buf[CMSG_SPACE(sizeof(struct ucred))];
} aux;
struct iovec iov = {
.iov_base = buf,
.iov_len = pkt_sz,
};
struct msghdr msg = {
.msg_name = 0,
.msg_namelen = 0,
.msg_iov = &iov,
.msg_iovlen = 1,
.msg_control = &aux,
.msg_controllen = sizeof(aux),
.msg_flags = 0,
};
if (recvmsg(nlsock, &msg, 0) != pkt_sz) {
perror("recvmsg");
return -1;
}
if (msg.msg_flags & (MSG_TRUNC | MSG_CTRUNC)) {
printf("truncated message!\n");
return -1;
}
// Print packet
print_stars();
struct ucred creds;
memcpy(&creds, CMSG_DATA(&aux), sizeof(creds));
printf("pid %d uid %d gid %d\n\n", creds.pid, creds.uid, creds.gid);
// It may be important to check what uid messages come from, but we don't bother here
if (!udev_mode)
print_kern_uevent_pkt(buf, pkt_sz);
else
print_udev_pkt(buf, pkt_sz);
}
return 0;
}
Listening to kernel events
To listen to the events that the kernel normally sends to udev, we need to create a AF_NETLINK socket with protocol NETLINK_KOBJECT_UEVENT (protocol is not typically used and is 0 for TCP and UDP, but we do need to specify it here):
int nlsock = socket(AF_NETLINK, SOCK_RAW | SOCK_CLOEXEC, NETLINK_KOBJECT_UEVENT);
We then need to specify which netlink multicast groups we want to listen to, using the bind syscall:
struct sockaddr_nl sa_nl;
memset(&sa_nl, 0, sizeof(sa_nl));
sa_nl.nl_family = AF_NETLINK;
sa_nl.nl_groups = MONITOR_GROUP_KERNEL;
bind(nlsock, (struct sockaddr *)&sa_nl, sizeof(sa_nl));
In this case, we want group MONITOR_GROUP_KERNEL which is 1. This is hardcoded in the kernel.
And… that's it! At this point, recv, recvmsg, and similar syscalls can be used to obtain data. The example code above performs some extra work to dynamically resize buffers and to receive Unix credentials, but we can ignore all of that for now.
Kernel event messages
Messages from the kernel consist of a list of null-terminated strings. The following is an example (there are no newlines in the message, but they have been added for readability):
add@/devices/pci0000:00/0000:00:08.1/0000:04:00.3/dwc3.1.auto/xhci-hcd.2.auto/usb4/4-1/4-1.4␀
ACTION=add␀
DEVPATH=/devices/pci0000:00/0000:00:08.1/0000:04:00.3/dwc3.1.auto/xhci-hcd.2.auto/usb4/4-1/4-1.4␀
SUBSYSTEM=usb␀
MAJOR=189␀
MINOR=386␀
DEVNAME=bus/usb/004/003␀
DEVTYPE=usb_device␀
PRODUCT=b95/1790/200␀
TYPE=0/0/0␀
BUSNUM=004␀
DEVNUM=003␀
SEQNUM=7176␀
The first line consists of an "action", @, and a device path under sysfs (i.e. normally at /sys/devices/pci…). The rest of the lines contain key-value pairs that depend on the individual drivers and subsystems in the kernel. udev is then expected to match this information with rules it knows about in order to set up the new device.
Listening to udev rebroadcasted events
Recall that kernel events were not the thing we were actually interested in. We wanted udev's version of the events. Browsing through libudev's source code, we can see that udev events are also broadcast using netlink. Even though netlink is often used to communicate with the kernel, NETLINK_KOBJECT_UEVENT allows for userspace-to-userspace communication. We just have to change the multicast group in our sample program to MONITOR_GROUP_UDEV which is 2.
If we hexdump the messages we receive, they look like this:
00000000: 6c 69 62 75 64 65 76 00 fe ed ca fe 28 00 00 00 libudev.....(...
00000010: 28 00 00 00 2f 02 00 00 a9 30 e9 67 00 00 00 00 (.../....0.g....
00000020: 02 08 20 08 00 40 10 09 55 44 45 56 5f 44 41 54 .. ..@..UDEV_DAT
00000030: 41 42 41 53 45 5f 56 45 52 53 49 4f 4e 3d 31 00 ABASE_VERSION=1.
00000040: 41 43 54 49 4f 4e 3d 61 64 64 00 44 45 56 50 41 ACTION=add.DEVPA
00000050: 54 48 3d 2f 64 65 76 69 63 65 73 2f 70 63 69 30 TH=/devices/pci0
…
In addition to the key-value strings, there is now a binary header.
udev packet format
This is the section you probably came here for.
udev's packet format is versioned, and the version in common use for at least the past 10-15 years has been version 0xfeedcafe. Searches of GitHub also show a version 0xcafe1dea, but it's not clear when the transition between the two happened. There does not seem to be any effort for backwards nor forwards compatibility.
udev's packet format exists in the code here. The following is my own equivalent version:
struct udev_packet_header {
// contains "libudev" with a null terminator
char libudev_magic[8];
// contains 0xfeedcafe (big-endian)
uint32_t magic;
// size of this header. *native* endian
uint32_t header_sz;
// offset to the (null-terminated strings) properties. *native* endian
uint32_t properties_off;
// size of the properties. *native* endian
uint32_t properties_len;
// hashes etc for filtering. big-endian
uint32_t subsystem_hash;
uint32_t devtype_hash;
uint32_t tag_bloom_hi;
uint32_t tag_bloom_lo;
};
A number of fields in this header use the native endianness of the udev process. This might cause issues with cross-endianness qemu-user processes, but I haven't personally tested this. There does not appear to be any explicit provision for handling this situation, but header_sz can be used to sniff for the appropriate endiannness.
header_sz is not used by libudev, only properties_off. In practice, these two fields contain the same value.
udev transmits several hashes in order to allow message receivers to use BPF for filtering. This avoids the kernel unnecessarily waking up uninterested processes, which could potentially save performance or power. This is not done by the demo program above.
subsystem_hash is a MurmurHash2 hash of the SUBSYSTEM= key. If the key isn't present, the value is 0.
devtype_hash is a MurmurHash2 hash of the DEVTYPE= key. If the key isn't present, the value is 0.
tag_bloom_hi and tag_bloom_lo form a 64-bit Bloom filter of the entries in the TAG= key (which are normally separated by : characters). If no keys are present, the value is 0. A Bloom filter is a special data structure based on hashes which can either return "this element is certainly not in the set" or "this element might be in the set, but this could also just be a false positive". In this case, it allows BPF to preemptively filter out events that definitely do not contain the proper TAGs.
The Bloom filter uses 4 small hashes, where each small hash takes a different slice of bits from the MurmurHash2 of the tag. This is shown in the sample code above:
const uint32_t mask = 0b111111;
bloom_filter |= 1ULL << ((taghash >> 0) & mask);
bloom_filter |= 1ULL << ((taghash >> 6) & mask);
bloom_filter |= 1ULL << ((taghash >> 12) & mask);
bloom_filter |= 1ULL << ((taghash >> 18) & mask);
"Security Considerations" (RFC 6919)
These udev netlink messages are supposed to be sent with the credentials (process ID, user ID, group ID) of the sending process. libudev will not accept messages without this. The SO_PASSCRED option is used to enable receving credentials, as described in the man page for Unix domain sockets on Linux.
Kernel messages are sent with these values all 0. This is important for udev to check in order to avoid taking action due to spoofed messages.
Messages from udev to generic userspace programs are also expected to come from either uid 0 or else something relating to user namespaces which I don't fully understand. I also don't understand why it is necessary for random programs to check this, since netlink normally only allows uid 0 to send messages.