Setup of a virtualized instance of Windows 11 on a Fedora 37 host.
- CPU: Intel i7-13700K
- GPU: AMD RX6900 XT
- USB: ASM3142 USB 3 Gen 2
- Storage:
- NVMe: WD SN850X 2TB
- NVMe: SP P34A80 1TB
# lstopo-no-graphics --cpukinds
CPU kind #0 efficiency -1 cpuset 0x00ff0000
FrequencyMaxMHz = 4200
FrequencyBaseMHz = 2600
CoreType = IntelAtom
CPU kind #1 efficiency -1 cpuset 0x0000f0ff
FrequencyMaxMHz = 5300
FrequencyBaseMHz = 3400
CoreType = IntelCore
CPU kind #2 efficiency -1 cpuset 0x00000f00
FrequencyMaxMHz = 5400
FrequencyBaseMHz = 3400
CoreType = IntelCore
# lspci -nnD
0000:03:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21 [Radeon RX 6800/6800 XT / 6900 XT] [1002:73bf] (rev c0)
0000:03:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21/23 HDMI/DP Audio Controller [1002:ab28]
0000:04:00.0 Non-Volatile memory controller [0108]: Sandisk Corp Device [15b7:5030] (rev 01)
0000:05:00.0 Non-Volatile memory controller [0108]: Phison Electronics Corporation E12 NVMe Controller [1987:5012] (rev 01)
0000:0d:00.0 USB controller [0c03]: ASMedia Technology Inc. ASM2142/ASM3142 USB 3.1 Host Controller [1b21:2142]
0000:0f:00.0 USB controller [0c03]: ASMedia Technology Inc. ASM2142/ASM3142 USB 3.1 Host Controller [1b21:2142]
-
CPU allocation
lstopo
output helps identify E-cores- Thanks, Intel!
- Dedicate all P-core threads (
0-15
) for virtualization0-1
for emulator2-15
for vCPUs
- Leave E-cores for housekeeping
-
PCI passthrough targets:
- GPU:
1002:73bf
+1002:ab28
- NVMes:
15b7:5030
+1987:5012
- USB:
1b21:2142
- GPU:
- Linux Kernel >= 6.1.0
- Virtualization
libvirtd
qemu
virt-manager
- O/S tuning
oslat
tuna
+Virtualization
+VT-d
-Resizable BAR
Want:
- CPU Isolation
- Hugepage-backed VM memory
Implementation:
tuna
- kernel thread + IRQ migration
systemd
- cgroups management
- userland process affinity
# grubby --update-kernel=ALL --args="iommu=pt intel_iommu=on hugepagesz=1G hugepages=20 pci=noaer pci-stub.ids=1b21:2142,1002:73bf,1002:ab28 nohz_full=2-15"
pci=noaer
:qemu
will refuse to continue once errors are received through AER- Errors not relevant (
UnsupportedTLP
,ACSViolation
): mask soqemu
continues - (Potential alternative): Tweaking AER error masks using
setpci
- Could guest reset masks?
- Errors not relevant (
pci-stub.ids
1b21:2142
(USB):- Fedora 37 has
xhci_hcd
built-in.- Binding to
vfio-pci
must be done in userspace
- Binding to
- Fedora 37 has
1002:73bf
,1002:ab28
(GPU):- BIOS resizes
BAR2
(doorbells) ->256MB
- Code 43
- Manually resize just
BAR0
->16GB
usingresourceN_resize
viasysfs
- Bind to
pci-stub
, unbind in userspace, resize, then bind tovfio-pci
- Bind to
- Want:
Capabilities: [200 v1] Physical Resizable BAR BAR 0: current size: 16GB, supported: 256MB 512MB 1GB 2GB 4GB 8GB 16GB BAR 2: current size: 2MB, supported: 2MB 4MB 8MB 16MB 32MB 64MB 128MB 256MB
- AMD software in guest still shows Resizable BAR as disabled, but device manager shows a large BAR
- Manual resize = useless?
- BIOS resizes
- no
isolcpus
?- Deprecated, use
cpusets
+nohz_full
- Deprecated, use
# /etc/systemd/system/dev-hugepages1G.mount
[Unit]
Description=Mount and reserve 1G hugepages
[Mount]
Type=hugetlbfs
What=nodev
Where=/dev/hugepages1G
Options=pagesize=1G,min_size=20G,mode=01777
[Install]
WantedBy=local-fs.target
# systemctl enable dev-hugepages1G.mount
-
Separate slice for VM processes
# /etc/systemd/system/windows11.slice [Unit] Description=Windows 11 VM Slice [Slice] AllowedCPUs=0-15 CPUAccounting=off MemoryAccounting=off TasksAccounting=off IOAccounting=off IPAccounting=off
-
Sidecar tuning service
-
Unit file
# /etc/systemd/system/windows11-virt-setup.service [Unit] Description=System setup for Windows 11 VM BindsTo=windows11.slice After=windows11.slice [Service] ExecStart=/usr/local/bin/windows11-virt.sh setup ExecStop=/usr/local/bin/windows11-virt.sh teardown Type=oneshot RemainAfterExit=true [Install] WantedBy=windows11.slice
BindsTo
stops the service when the slice terminates.
-
Tuning script
# /usr/local/bin/windows11-virt.sh #!/bin/bash set -exuo pipefail declare -a SERVICES SERVICES=('user.slice' 'system.slice' 'init.scope') ISOLATE='0-15' HOUSEKEEPING='16-23' SETUP=0 if [[ 'setup' == "${1}" ]]; then ALLOWED="${HOUSEKEEPING}" SETUP=1 elif [[ 'teardown' == "${1}" ]]; then ALLOWED='0-23' else echo "Unknown argument ${1}" > /dev/stderr exit 1 fi if ((!SETUP)); then echo 'member' > '/sys/fs/cgroup/windows11.slice/cpuset.cpus.partition' tuna -G --cpus="${ALLOWED}" --include --no_uthreads --affect_children fi for service in "${SERVICES[@]}"; do systemctl set-property --runtime "${service}" "AllowedCPUs=${ALLOWED}" done if ((SETUP)); then tuna -G --cpus="${ISOLATE}" --isolate --no_uthreads --affect_children echo 'isolated' > '/sys/fs/cgroup/windows11.slice/cpuset.cpus.partition' fi
set-property
--runtime
to discard on reboot- Processes' affinity masks automatically expand to a wider set of
AllowedCPUs
cpuset.cpus.partition
- Write
isolated
to disable scheduler load balancing in Cgroups v2 - File should readback as
isolated
- Write
-
- Start
# systemctl start windows11.slice
# systemctl is-active windows11-virt-setup.service
active
- Affinity check
# tuna -Q
# users affinity
0 timer 0xffffff
8 rtc0 0xff0000
9 acpi 0xff0000
14 INTC1085:00 0xff0000
16 16-fasteoi 0xff0000
17 17-fasteoi 0xff0000
18 i801_smbus 0xff0000
...
# tuna -P
thread ctxt_switches
pid SCHED_ rtpri affinity voluntary nonvoluntary cmd
1 OTHER 0 0xff0000 8521 6113 systemd
2 OTHER 0 0xff0000 851 22 kthreadd
3 OTHER 0 0xff0000 2 0 rcu_gp
4 OTHER 0 0xff0000 2 0 rcu_par_gp
5 OTHER 0 0xff0000 2 0 slub_flushwq
6 OTHER 0 0xff0000 2 0 netns
8 OTHER 0 0 4 0 kworker/0:0H-events_highpri
10 OTHER 0 0xff0000 2 0 mm_percpu_wq
12 OTHER 0 0xff0000 2 0 rcu_tasks_kthread
...
- Load balancing check
# cat /sys/fs/cgroup/windows11.slice/cpuset.cpus.partition
isolated
- Test with
oslat
# systemd-run --scope --slice windows11 oslat -z -C 0 -c 2-15 -D 5m
Running scope as unit: run-r4c6ae11896114cb3852220ec447a960c.scope
oslat V 2.40
Total runtime: 300 seconds
Thread priority: default
CPU list: 2-15
CPU for main thread: 0
Workload: no
Workload mem: 0 (KiB)
Preheat cores: 14
Pre-heat for 1 seconds...
Test starts...
Test completed.
Core: 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Counter Freq: 3413 3413 3424 3424 3413 3413 3413 3413 3412 3412 3412 3412 3413 3413 (Mhz)
001 (us): 22328499873 2232 1 OTHER 0 0xffffff 8720 6116 systemd
2 OTHER 0 0xffffff 864 22 kthreadd
3 OTHER 0 0xffffff 2 0 rcu_gp
4 OTHER 0 0xffffff 2 0 rcu_par_gp
5 OTHER 0 0xffffff 2 0 slub_flushwq
6 OTHER 0 0xffffff 2 0 netns 8494667 22324714431 22324667471 22335658938 22335654502 22372294941 22372336577 22373760707 22373760682 22346725751 22346681497 22329952966 22329987288
002 (us): 1 1 1 1 1 1 1 1 1 1 1 1 1 1
003 (us): 1 1 1 1 1 3 1 1 2 1 2 2 1 1
004 (us): 2 2 1 1 2 1 1 1 1 2 0 1 1 1
...
008 (us): 1 1 1 1 1 1 1 1 1 1 1 1 1 1
...
010 (us): 0 0 0 0 0 0 0 0 0 1 0 0 0 0
011 (us): 0 0 0 0 0 0 0 0 1 0 0 1 0 0
012 (us): 1 1 0 0 0 1 0 0 0 0 0 0 0 0
...
032 (us): 0 0 0 0 0 0 0 0 0 0 0 0 0 0 (including overflows)
Minimum: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 (us)
Average: 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 (us)
Maximum: 12 12 8 8 8 12 8 8 11 10 8 11 8 8 (us)
Max-Min: 11 11 7 7 7 11 7 7 10 9 7 10 7 7 (us)
Duration: 300.390 300.390 299.425 299.425 300.390 300.390 300.390 300.390 300.478 300.478 300.478 300.478 300.390 300.390 (sec)
- Do usual desktop things for 5m. Celebrate after seeing results.
- Stop slice
systemctl stop windows11.slice
- Verify affinity masks:
# tuna -Q
# users affinity
0 timer 0xffffff
8 rtc0 0xffffff
9 acpi 0xffffff
14 INTC1085:00 0xffffff
16 16-fasteoi 0xffffff
17 17-fasteoi 0xffffff
...
# tuna -P
1 OTHER 0 0xffffff 8720 6116 systemd
2 OTHER 0 0xffffff 864 22 kthreadd
3 OTHER 0 0xffffff 2 0 rcu_gp
4 OTHER 0 0xffffff 2 0 rcu_par_gp
5 OTHER 0 0xffffff 2 0 slub_flushwq
6 OTHER 0 0xffffff 2 0 netns
...
- Stub rebind service
-
Unit file
# /etc/systemd/system/vfio-rebind-stub@.service [Unit] Description=Rebind vfio-pci to stubbed devices [Service] Type=oneshot RemainAfterExit=yes ExecStart=/usr/local/bin/vfio-pci-rebind.sh %i [Install] WantedBy=default.target
-
Rebind script
#!/bin/bash set -euo pipefail IFS=$'\n' ids=("$(lspci -m -D -d "${1}" | cut -f 1 -d ' ')") for i in "${ids[@]}" do echo "${i}" > /sys/bus/pci/drivers/pci-stub/unbind if [[ "${1}" == "1002:73bf" ]] then # Resizing BAR 2 to 256MB = Code 43 #echo 8 > "/sys/bus/pci/devices/${i}/resource2_resize" echo 14 > "/sys/bus/pci/devices/${i}/resource0_resize" fi echo "${i}" > /sys/bus/pci/drivers/vfio-pci/bind done
-
Enable rebind services:
# systemctl enable vfio-rebind-stub@1002:73bf.service \ vfio-rebind-stub@1002:ab28.service vfio-rebind-stub@1b21:2142.service
-
# lspci -Dnnk
0000:03:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21 [Radeon RX 6800/6800 XT / 6900 XT] [1002:73bf] (rev c0)
Subsystem: ASRock Incorporation Device [1849:5212]
Kernel driver in use: vfio-pci
Kernel modules: amdgpu
0000:03:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21/23 HDMI/DP Audio Controller [1002:ab28]
Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21/23 HDMI/DP Audio Controller [1002:ab28]
Kernel driver in use: vfio-pci
Kernel modules: snd_hda_intel
0000:04:00.0 Non-Volatile memory controller [0108]: Sandisk Corp Device [15b7:5030] (rev 01)
Subsystem: Sandisk Corp Device [15b7:5030]
Kernel driver in use: vfio-pci
Kernel modules: nvme
0000:05:00.0 Non-Volatile memory controller [0108]: Phison Electronics Corporation E12 NVMe Controller [1987:5012] (rev 01)
Subsystem: Phison Electronics Corporation E12 NVMe Controller [1987:5012]
Kernel driver in use: vfio-pci
Kernel modules: nvme
0000:0d:00.0 USB controller [0c03]: ASMedia Technology Inc. ASM2142/ASM3142 USB 3.1 Host Controller [1b21:2142]
Subsystem: ASMedia Technology Inc. ASM2142/ASM3142 USB 3.1 Host Controller [1b21:2142]
Kernel driver in use: vfio-pci
0000:0f:00.0 USB controller [0c03]: ASMedia Technology Inc. ASM2142/ASM3142 USB 3.1 Host Controller [1b21:2142]
Subsystem: ASMedia Technology Inc. ASM2142/ASM3142 USB 3.1 Host Controller [1b21:2142]
Kernel driver in use: vfio-pci
# lspci -s 03:00.0 -vvv | grep BAR
Capabilities: [200 v1] Physical Resizable BAR
BAR 0: current size: 16GB, supported: 256MB 512MB 1GB 2GB 4GB 8GB 16GB
BAR 2: current size: 2MB, supported: 2MB 4MB 8MB 16MB 32MB 64MB 128MB 256MB
Snippets:
-
CPU pinning:
<vcpu placement="static" cpuset="2-15">14</vcpu> <cputune> <vcpupin vcpu="0" cpuset="2"/> <vcpupin vcpu="1" cpuset="3"/> <vcpupin vcpu="2" cpuset="4"/> <vcpupin vcpu="3" cpuset="5"/> <vcpupin vcpu="4" cpuset="6"/> <vcpupin vcpu="5" cpuset="7"/> <vcpupin vcpu="6" cpuset="8"/> <vcpupin vcpu="7" cpuset="9"/> <vcpupin vcpu="8" cpuset="10"/> <vcpupin vcpu="9" cpuset="11"/> <vcpupin vcpu="10" cpuset="12"/> <vcpupin vcpu="11" cpuset="13"/> <vcpupin vcpu="12" cpuset="14"/> <vcpupin vcpu="13" cpuset="15"/> <emulatorpin cpuset="0-1"/> </cputune>
-
Memory
<currentMemory unit="KiB">20971520</currentMemory> <memoryBacking> <hugepages> <page size="1048576" unit="KiB"/> </hugepages> <nosharepages/> </memoryBacking>
-
Topology
<cpu mode="host-passthrough" check="none" migratable="off"> <topology sockets="1" dies="1" cores="7" threads="2"/> <cache mode="passthrough"/> </cpu>
-
Cgroup configuration:
<resource> <partition>/windows11</partition> </resource>
-
Hyper-V enlightenments:
<hyperv mode="custom"> <relaxed state="on"/> <vapic state="off"/> <spinlocks state="on" retries="8191"/> <vpindex state="on"/> <runtime state="on"/> <synic state="on"/> <stimer state="on"/> <reset state="on"/> <frequencies state="on"/> <tlbflush state="on"/> <ipi state="on"/> </hyperv>
Libvirt in Fedora 37 does not support
avic
.
The VM Domain was named Windows11
.
# cat /etc/libvirt/hooks/qemu
#!/bin/bash
set -euxo pipefail
GUEST_NAME="${1}"
ACTION="${2}"
if [[ "Windows11" != "${GUEST_NAME}" ]]; then
echo "ignoring guest ${GUEST_NAME}"
exit 0
fi
case "${ACTION}" in
'prepare' ) systemctl start windows11.slice;;
'release' ) systemctl stop windows11.slice;;
* ) echo "unknown action ${ACTION}: ignoring"
esac
-
Persistent names for DDC i2c buses:
-
Find
i2c-dev
device nodesDisplay 1 I2C bus: /dev/i2c-11 EDID synopsis: Mfg id: ACR Model: XB273U NV Serial number: Manufacture year: 2021 EDID version: 1.3 VCP version: 2.2 Display 2 I2C bus: /dev/i2c-15 EDID synopsis: Mfg id: LEN Model: LEN P24h-20 Serial number: Manufacture year: 2020 EDID version: 1.4 VCP version: 2.2
-
Find attributes that are (hopefully) persistent
# udevadm info --attribute-walk /dev/i2c-11 looking at device '/devices/pci0000:00/0000:00:02.0/i2c-11/i2c-dev/i2c-11': KERNEL=="i2c-11" SUBSYSTEM=="i2c-dev" DRIVER=="" ATTR{name}=="i915 gmbus tc3" ATTR{power/control}=="auto" ATTR{power/runtime_active_time}=="0" ATTR{power/runtime_status}=="unsupported" ATTR{power/runtime_suspended_time}=="0" looking at parent device '/devices/pci0000:00/0000:00:02.0/i2c-11': KERNELS=="i2c-11" SUBSYSTEMS=="i2c" DRIVERS=="" ATTRS{delete_device}=="(not readable)" ATTRS{name}=="i915 gmbus tc3" ATTRS{new_device}=="(not readable)"
Good candidate:
ATTR{name}
-
Write udev rules
SUBSYSTEM=="i2c-dev", ATTR{name}=="i915 gmbus tc3", SYMLINK+="ddc-igp-hdmi", TAG+="systemd", ENV{SYSTEMD_ALIAS}="/dev/ddc-igp-hdmi" SUBSYSTEM=="i2c-dev", ATTR{name}=="AUX USBC2/DDI TC2/PHY C", SYMLINK+="ddc-igp-dp", TAG+="systemd", ENV{SYSTEMD_ALIAS}="/dev/ddc-igp-dp"
-
-
Find USB device to trigger on
# lsusb.py -i usb2 1d6b:0003 09 1IF [USB 3.10, 20000 Mbps, 0mA] (xhci-hcd 0000:00:14.0) hub 2-1 0557:2415 09 1IF [USB 3.20, 10000 Mbps, 0mA] (ATEN INTERNATIONAL Co USB3.2 Hub) hub 2-1.2 05e3:0627 00 0IFs [USB 3.00, 5000 Mbps, ] (XXXXXX USB Storage USB Storage) 2-1.3 05e3:0625 09 1IF [USB 3.20, 10000 Mbps, 0mA] (GenesysLogic USB3.1 Hub) hub
Watch for
05e3:0625
(second USB hub attached to switch)-
Write UDEV rules
SUBSYSTEM=="usb", ENV{ID_MODEL_ID}=="0625", ENV{ID_VENDOR_ID}=="05e3", TAG+="systemd" ACTION=="remove", SUBSYSTEM=="usb", ENV{PRODUCT}=="5e3/625/*", TAG+="systemd"
-
-
Create services to trigger display switching through DDC on USB device hotplug
-
Unit file
# /etc/systemd/system/kvm-switch-local.service [Unit] Description=Switch monitor inputs to local OS # Should have used an alias here instead # Was lazy BindsTo=sys-devices-pci0000:00-0000:00:14.0-usb2-2\x2d1-2\x2d1.3.device After=sys-devices-pci0000:00-0000:00:14.0-usb2-2\x2d1-2\x2d1.3.device After=dev-ddc\x2digp\x2ddp.device After=dev-ddc\x2digp\x2dhdmi.device [Service] Type=oneshot ExecStart=/usr/local/bin/ddc-switch-input-local.sh ExecStop=/usr/local/bin/ddc-switch-input-vm.sh RemainAfterExit=true [Install] WantedBy=sys-devices-pci0000:00-0000:00:14.0-usb2-2\x2d1-2\x2d1.3.device
# systemctl enable kvm-switch-local.service
-
Display switching scripts:
# cat /usr/local/bin/ddc-switch-input-local.sh #!/bin/bash set -euxo pipefail hdmi_desired='x11' dp_desired='x0f' hdmi_devpath="$(readlink -f /dev/ddc-igp-hdmi)" dp_devpath="$(readlink -f /dev/ddc-igp-dp)" hdmi_busno="${hdmi_devpath#/dev/i2c-}" dp_busno="${dp_devpath#/dev/i2c-}" hdmi_curstate="$(ddcutil -t -b "${hdmi_busno}" getvcp 0x60 | cut -d ' ' -f 4)" dp_curstate="$(ddcutil -t -b "${dp_busno}" getvcp 0x60 | cut -d ' ' -f 4)" if [[ $hdmi_curstate != "${hdmi_desired}" ]] then ddcutil -b "${hdmi_busno}" setvcp 0x60 "0${hdmi_desired}" fi if [[ $dp_curstate != "${dp_desired}" ]] then ddcutil -b "${dp_busno}" setvcp 0x60 "0${dp_desired}" fi
#!/bin/bash set -euxo pipefail hdmi_desired='x12' dp_desired='x11' hdmi_devpath="$(readlink -f /dev/ddc-igp-hdmi)" dp_devpath="$(readlink -f /dev/ddc-igp-dp)" hdmi_busno="${hdmi_devpath#/dev/i2c-}" dp_busno="${dp_devpath#/dev/i2c-}" hdmi_curstate="$(ddcutil -t -b "${hdmi_busno}" getvcp 0x60 | cut -d ' ' -f 4)" dp_curstate="$(ddcutil -t -b "${dp_busno}" getvcp 0x60 | cut -d ' ' -f 4)" if [[ $hdmi_curstate != "${hdmi_desired}" ]] then ddcutil -b "${hdmi_busno}" setvcp 0x60 "0${hdmi_desired}" fi if [[ $dp_curstate != "${dp_desired}" ]] then ddcutil -b "${dp_busno}" setvcp 0x60 "0${dp_desired}" fi
-
- Isolated + NOHZ_FULL vCPUs not undergoing automatic frequency scaling.