Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save Chester-Gillon/e03872fac2c3f52d04325539c419842e to your computer and use it in GitHub Desktop.
Save Chester-Gillon/e03872fac2c3f52d04325539c419842e to your computer and use it in GitHub Desktop.
HP Z4 G4 not booting with after update to AlmaLinux 8.10

1. HP Z4 G4 not booting with after update to AlmaLinux 8.10

A HP Z4 G4 using a Xeon(R) W-2123 CPU was updated from AlmaLinux 8.9 to AlmaLinux 8.10, which updated the Kernel version. Following the update the boot hung. After the GRUB menu timeout when when to boot the default Kernel just got a black screen with a non-flashing text cursor at the top left of the screen. Didn't respond to the keyboard, so had to hold the the power button to force a power off. Hung in the same way in multiple boot attempts.

Secure boot is enabled.

2. Initial work-around to boot previous Kernel by default

As an initial work-around in the GRUB menu selected the previous Kernel version and when had booted used grubby to change the default Kernel to the previous version.

This is the previous version which still boots when changed back to the default:

$ sudo grubby --info=DEFAULT
index=1
kernel="/boot/vmlinuz-4.18.0-513.24.1.el8_9.x86_64"
args="ro resume=/dev/mapper/almalinux_skylake--alma-swap rd.lvm.lv=almalinux_skylake-alma/root rd.lvm.lv=almalinux_skylake-alma/swap rhgb quiet $tuned_params intel_iommu=on"
root="/dev/mapper/almalinux_skylake--alma-root"
initrd="/boot/initramfs-4.18.0-513.24.1.el8_9.x86_64.img $tuned_initrd"
title="AlmaLinux (4.18.0-513.24.1.el8_9.x86_64) 8.9 (Midnight Oncilla)"
id="5de40fe84eb94d6da4f1b208114750bb-4.18.0-513.24.1.el8_9.x86_64"

And this is the updated version which hangs during boot on the HP Z4 G4:

$ sudo grubby --info=0
index=0
kernel="/boot/vmlinuz-4.18.0-553.el8_10.x86_64"
args="ro resume=/dev/mapper/almalinux_skylake--alma-swap rd.lvm.lv=almalinux_skylake-alma/root rd.lvm.lv=almalinux_skylake-alma/swap rhgb quiet $tuned_params"
root="/dev/mapper/almalinux_skylake--alma-root"
initrd="/boot/initramfs-4.18.0-553.el8_10.x86_64.img $tuned_initrd"
title="AlmaLinux (4.18.0-553.el8_10.x86_64) 8.10 (Cerulean Leopard)"
id="5de40fe84eb94d6da4f1b208114750bb-4.18.0-553.el8_10.x86_64"

How to completely remove a recent kernel installaton? on the Rocky Linux forum reports the same issue in terms of the Kernel version which hangs during boot, and the previous version which boots. No solution offered in that forum thread.

3. Same Kernel version boots OK on a different PC

On a different PC, with a i5-2310 CPU also updated from AlmaLinux 8.9 to 8.10. On this other PC the same Kernel boots OK:

$ sudo grubby --info=DEFAULT
[sudo] password for mr_halfword: 
index=3
kernel="/boot/vmlinuz-4.18.0-553.el8_10.x86_64"
args="ro crashkernel=auto resume=/dev/mapper/almalinux-swap rd.lvm.lv=almalinux/root rd.lvm.lv=almalinux/swap rhgb quiet $tuned_params"
root="/dev/mapper/almalinux-root"
initrd="/boot/initramfs-4.18.0-553.el8_10.x86_64.img $tuned_initrd"
title="AlmaLinux (4.18.0-553.el8_10.x86_64) 8.10 (Cerulean Leopard)"
id="9d2f49d60f7e49e0b06aaed533ddfcd8-4.18.0-553.el8_10.x86_64"

This PC doesn't have secure boot enabled, as not supported in the BIOS.

4. Haven't managed to get much debug output

With the problematic Kernel version used GRUB to select it and try and edited the Kernel command line to get more debug.

Removed the rhgb and quiet options. Outputs the following before the boot hangs:

EFI stub: UEFI Secure Boot is enabled.

Adding the following didn't produce more output:

  • debug
  • ignore_loglevel
  • initcall_debug

5. BIOS update didn't fix problem

The current BIOS version is 02.91. A new version 02.92 is available. Updated the BIOS using the BIOS option which downloads and flashes the update BIOS from the HP website.

The BIOS update didn't change the failure mode.

6. Disabling secure boot didn't fix the problem

Disabled secure boot, but the problematic still hung.

With secure boot disabled, the different in behaviour was that with rhgb and quiet options was back to just a blank screen when the boot hung. Therefore, re-enabled secure boot.

7. Try different System Security options

The UEFI Secure Boot is enabled. text appears in the xen_efi_get_secureboot function in the arch/x86/xen/efi.c.

Tried changing the System Security options in the BIOS:

  1. Disabled Intel Trusted Execution Technology (TXT). This didn't change the behaviour.
  2. Also disabled Intel virtualization Technology (VTx). This didn't change the behaviour.
  3. Also disabled Intel virtualization Technology for Directed I/O (VTd). For the Kernel which boots, this disabled use of the IOMMU. This didn't change the behaviour of the problematic Kernel.

As a result, restored the original System Security options.

8. Also fails to boot following a further Kernel update

Following an update to Kernel 4.18.0-553.5.1.el8_10.x86_64 that later Kernel version also:

  • Causes the HP Z4 G4 to hang during boot in the same way.
  • Boots successfully on the different PC with a i5-2310 CPU.

Set the default Kernel back to the working version:

[mr_halfword@skylake-alma ~]$ sudo grubby --set-default-index=2
The default is /boot/loader/entries/5de40fe84eb94d6da4f1b208114750bb-4.18.0-513.24.1.el8_9.x86_64.conf with index 2 and kernel /boot/vmlinuz-4.18.0-513.24.1.el8_9.x86_64
[mr_halfword@skylake-alma ~]$ sudo grubby --info=DEFAULT
index=2
kernel="/boot/vmlinuz-4.18.0-513.24.1.el8_9.x86_64"
args="ro resume=/dev/mapper/almalinux_skylake--alma-swap rd.lvm.lv=almalinux_skylake-alma/root rd.lvm.lv=almalinux_skylake-alma/swap rhgb quiet $tuned_params intel_iommu=on"
root="/dev/mapper/almalinux_skylake--alma-root"
initrd="/boot/initramfs-4.18.0-513.24.1.el8_9.x86_64.img $tuned_initrd"
title="AlmaLinux (4.18.0-513.24.1.el8_9.x86_64) 8.9 (Midnight Oncilla)"
id="5de40fe84eb94d6da4f1b208114750bb-4.18.0-513.24.1.el8_9.x86_64"

9. Prevented HP Z4 G4 from booting by disabling all PCIe and M.2 slots

Entered the BIOS and disabled all the PCIe and M.2 slots. The theory was that if the newer Kernel then booted, could enable the slots one a time to try and identify a PCIe device causing the hang during boot.

However, the PC no longer boots. The power led flashwes 3 long blinks in Red and 3 short blinks in White. According to Table 7-3 Interpreting POST diagnostic front panel lights and audible codes in the Maintenance and Service Guide that means:

Category Major/minor code Description
Hardware 3.3 The embedded controller has timed out waiting for BIOS to return from graphics initialization

10. Recovering from the power up failure caused by disabling all PCIe and M.2 slots

Pressed the CMOS button on the motherboard to reset the CMOS settings. To get access to the buttom had to remove the Intel X722 network card in PCIe slot 5. Took a few attempts for the CMOS settings to be cleared. Not sure due to not due to presssing the button long enough, or not long enough with power removed.

Got a physical presense prompt on the monitor to confirm secure boot has been disabled.

The PC power cycled itself several time before reporting AMT Global Status. 974 AMT global status does not match Bios-setup. This is due to the issue in On a HP Z4 G4 where bricked the Intel AMT on the PC. In the BIOS under Advanced v -> Remote Management Options disabled Intel Management Engine to avoid the AMT error at boot.

The reset of the CMOS settings cause the M.2 SSD with Windows to be the boot, and didn't have the GRUB menu.

Selected the boot menu in the BIOS and selected the HDD with AlmaLinux installed. That repaired the boot menu, and the GRUB menu re-appeared at boot.

After this CMOS settings reset, and minimum further changes to allow to boot, Kernel 4.18.0-553.5.1.el8_10.x86_64 still hung at boot.

11. Disabling all PCIe and M.2 slots except the NVIDIA graphics card didn't help

Used the BIOS to disable all PCIe and M.2 slots, except the PCIe slot 2 with the NVIDIA Quadro P2000. Kernel 4.18.0-553.5.1.el8_10.x86_64 still hung during boot.

11.1. BIOS Headless Boot option

The HP BIOS BCU Windows BiosConfigUtility64.exe when gets all the BIOS options reports the parameter:

Headless Boot
	*Disable
	Enable

Tried enabling it, but no other options were exposed about where to redirect the BIOS output to.

If try and boot without a graphics card fitted then:

  1. Need some way of re-directing GRUB to a serial console, to be able to change the selected Kernel.
  2. Need some way pf re-directing the Kernel output during boot to a serial console, to view any diagnostics.
  3. If disable the PCIe slot containing the graphics card, rather than removing the graphics card, need the HP BIOS BCU for Linux installed to be able to change the BIOS settings to re-enabled the PCIe slot for the graphics card at the end of the test.

Therefore, changed Headless Boot back to Disable and didn't try and boot with it enabled.

12. Problem isn't a corrupt installation

The Kernel 4.18.0-553.5.1.el8_10.x86_64 which boots on the PC with a i5-2310 was installed on an external USB HDD.

Connected the external USB to the HP Z4 G4 and used the BIOS boot menu to boot from the external HDD:

  1. Attempting to boot the Kernel 4.18.0-553.5.1.el8_10.x86_64 hung.
  2. Booting the 4.18.0-513.24.1.el8_9.x86_64 Kernel worked.

This means the issue on the HP Z4 G4 isn't due to the installation of the later Kernel versions being corrupt, but rather an interaction with the later Kernel versions and some component of the HP Z4 G4.

13. Restoring BIOS options

Following having to reset the CMOS settings to default, need to restore the BIOS options.

Re-enabled SDCARD boot option in the BIOS.

For other options used the HP BCU under Windows.

Enable secure boot:

C:\SWSetup\SP143621>BiosConfigUtility64.exe /setvalue:"Configure Legacy Support and Secure Boot","Legacy Support Disable and Secure Boot Enable"
<BIOSCONFIG Version="" Computername="COMPUTERNAME" Date="2024/06/09" Time="20:33:07" UTC="1">
        <SETTING changeStatus="pass" name="Configure Legacy Support and Secure Boot" returnCode="0">
                <OLDVALUE><![CDATA[Legacy Support Disable and Secure Boot Disable]]></OLDVALUE>
                <VALUE><![CDATA[Legacy Support Disable and Secure Boot Enable]]></VALUE>
        </SETTING>
        <SUCCESS msg="No errors occurred" />
        <Information msg="BCU return value" real="0" translated="0" />
</BIOSCONFIG>

The readback of the "Configure Legacy Support and Secure Boot" didn't update until after had rebooted, part of which the BIOS power cycles the PC which think is required for a change to secure boot to take effect.

Disable the internal speakers:

C:\SWSetup\SP143621>BiosConfigUtility64.exe /setvalue:"Internal Speakers","Disable"
<BIOSCONFIG Version="" Computername="COMPUTERNAME" Date="2024/06/09" Time="20:34:32" UTC="1">
        <SETTING changeStatus="pass" name="Internal Speakers" returnCode="0">
                <OLDVALUE><![CDATA[Enable]]></OLDVALUE>
                <VALUE><![CDATA[Disable]]></VALUE>
        </SETTING>
        <SUCCESS msg="No errors occurred" />
        <Information msg="BCU return value" real="0" translated="0" />
</BIOSCONFIG>

Enable Advanced Error Control:

C:\SWSetup\SP143621>BiosConfigUtility64.exe /setvalue:"Advanced Error Control","Enable"
<BIOSCONFIG Version="" Computername="COMPUTERNAME" Date="2024/06/09" Time="20:35:42" UTC="1">
        <SETTING changeStatus="pass" name="Advanced Error Control" returnCode="0">
                <OLDVALUE><![CDATA[Disable]]></OLDVALUE>
                <VALUE><![CDATA[Enable]]></VALUE>
        </SETTING>
        <SUCCESS msg="No errors occurred" />
        <Information msg="BCU return value" real="0" translated="0" />
</BIOSCONFIG>

And disable error handling on the slots with FPGAs fitted, to stop the BIOS reporting a fatal error when the FPGA is re-loaded and therefore the PCIe link dropped:

C:\SWSetup\SP143621>BiosConfigUtility64.exe /setvalue:"Slot 3 Error Handling","Disable"
<BIOSCONFIG Version="" Computername="COMPUTERNAME" Date="2024/06/09" Time="20:37:34" UTC="1">
        <SETTING changeStatus="pass" name="Slot 3 Error Handling" returnCode="0">
                <OLDVALUE><![CDATA[Enable]]></OLDVALUE>
                <VALUE><![CDATA[Disable]]></VALUE>
        </SETTING>
        <SUCCESS msg="No errors occurred" />
        <Information msg="BCU return value" real="0" translated="0" />
</BIOSCONFIG>
C:\SWSetup\SP143621>BiosConfigUtility64.exe /setvalue:"M.2 SSD0 Error Handling","Disable"
<BIOSCONFIG Version="" Computername="COMPUTERNAME" Date="2024/06/09" Time="20:38:20" UTC="1">
        <SETTING changeStatus="pass" name="M.2 SSD0 Error Handling" returnCode="0">
                <OLDVALUE><![CDATA[Enable]]></OLDVALUE>
                <VALUE><![CDATA[Disable]]></VALUE>
        </SETTING>
        <SUCCESS msg="No errors occurred" />
        <Information msg="BCU return value" real="0" translated="0" />
</BIOSCONFIG>

Enable hot-plug on the slot with the TEF1001 FPGA board:

C:\SWSetup\SP143621>BiosConfigUtility64.exe /setvalue:"Slot 3 Hot Plug","Enable"
<BIOSCONFIG Version="" Computername="COMPUTERNAME" Date="2024/06/09" Time="20:41:11" UTC="1">
        <SETTING changeStatus="pass" name="Slot 3 Hot Plug" returnCode="0">
                <OLDVALUE><![CDATA[Disable]]></OLDVALUE>
                <VALUE><![CDATA[Enable]]></VALUE>
        </SETTING>
        <SUCCESS msg="No errors occurred" />
        <Information msg="BCU return value" real="0" translated="0" />
</BIOSCONFIG>

C:\SWSetup\SP143621>BiosConfigUtility64.exe /setvalue:"Slot 3 Hot Plug Buses","8"
<BIOSCONFIG Version="" Computername="COMPUTERNAME" Date="2024/06/09" Time="20:41:30" UTC="1">
        <SETTING changeStatus="pass" name="Slot 3 Hot Plug Buses" returnCode="0">
                <OLDVALUE><![CDATA[0]]></OLDVALUE>
                <VALUE><![CDATA[8]]></VALUE>
        </SETTING>
        <SUCCESS msg="No errors occurred" />
        <Information msg="BCU return value" real="0" translated="0" />
</BIOSCONFIG>

14. Remove unusable Kernel 4.18.0-553.el8_10.x86_64

Until the root cause is fixed, want to avoid any automatic Kernel updates from removing the old Kernel which still boots.

Therefore, remove 4.18.0-553.el8_10.x86_64 which is the 1st Kernel version on which the problem was found.

The following identifies the kernel-modules-4.18.0-553.el8_10.x86_64 and kernel-core-4.18.0-553.el8_10.x86_64 packages as the candidates to be removed:

$ yum provides /lib/modules/4.18.0-553.el8_10.x86_64/kernel/drivers/infiniband/hw/mlx5/mlx5_ib.ko.xz 
Last metadata expiration check: 1 day, 2:07:46 ago on Sat 08 Jun 2024 21:25:59 BST.
kernel-modules-4.18.0-553.el8_10.x86_64 : kernel modules to match the core
                                        : kernel
Repo        : @System
Matched from:
Filename    : /lib/modules/4.18.0-553.el8_10.x86_64/kernel/drivers/infiniband/hw/mlx5/mlx5_ib.ko.xz

kernel-modules-4.18.0-553.el8_10.x86_64 : kernel modules to match the core
                                        : kernel
Repo        : baseos
Matched from:
Filename    : /lib/modules/4.18.0-553.el8_10.x86_64/kernel/drivers/infiniband/hw/mlx5/mlx5_ib.ko.xz

$ yum provides "/boot/*4.18.0-553.el8_10.x86_64*"
Last metadata expiration check: 1 day, 2:08:41 ago on Sat 08 Jun 2024 21:25:59 BST.
kernel-core-4.18.0-553.el8_10.x86_64 : The Linux kernel
Repo        : @System
Matched from:
Filename    : /boot/.vmlinuz-4.18.0-553.el8_10.x86_64.hmac
Filename    : /boot/System.map-4.18.0-553.el8_10.x86_64
Filename    : /boot/config-4.18.0-553.el8_10.x86_64
Filename    : /boot/initramfs-4.18.0-553.el8_10.x86_64.img
Filename    : /boot/symvers-4.18.0-553.el8_10.x86_64.gz
Filename    : /boot/vmlinuz-4.18.0-553.el8_10.x86_64

kernel-core-4.18.0-553.el8_10.x86_64 : The Linux kernel
Repo        : baseos
Matched from:
Filename    : /boot/.vmlinuz-4.18.0-553.el8_10.x86_64.hmac
Filename    : /boot/System.map-4.18.0-553.el8_10.x86_64
Filename    : /boot/config-4.18.0-553.el8_10.x86_64
Filename    : /boot/initramfs-4.18.0-553.el8_10.x86_64.img
Filename    : /boot/symvers-4.18.0-553.el8_10.x86_64.gz
Filename    : /boot/vmlinuz-4.18.0-553.el8_10.x86_64

kernel-debug-core-4.18.0-553.el8_10.x86_64 : The Linux kernel compiled with
                                           : extra debugging enabled
Repo        : baseos
Matched from:
Filename    : /boot/.vmlinuz-4.18.0-553.el8_10.x86_64+debug.hmac
Filename    : /boot/System.map-4.18.0-553.el8_10.x86_64+debug
Filename    : /boot/config-4.18.0-553.el8_10.x86_64+debug
Filename    : /boot/initramfs-4.18.0-553.el8_10.x86_64+debug.img
Filename    : /boot/symvers-4.18.0-553.el8_10.x86_64+debug.gz
Filename    : /boot/vmlinuz-4.18.0-553.el8_10.x86_64+debug

Removed the packages:

$ sudo yum remove kernel-core-4.18.0-553.el8_10.x86_64 kernel-modules-4.18.0-553.el8_10.x86_64
Dependencies resolved.
================================================================================
 Package                  Arch       Version                  Repository   Size
================================================================================
Removing:
 kernel-core              x86_64     4.18.0-553.el8_10        @baseos      71 M
 kernel-modules           x86_64     4.18.0-553.el8_10        @baseos      25 M
Removing dependent packages:
 kernel                   x86_64     4.18.0-553.el8_10        @baseos       0  
 kernel-modules-extra     x86_64     4.18.0-553.el8_10        @baseos     686 k

Transaction Summary
================================================================================
Remove  4 Packages

Freed space: 97 M
Is this ok [y/N]: y
Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
  Preparing        :                                                        1/1 
  Erasing          : kernel-4.18.0-553.el8_10.x86_64                        1/4 
  Running scriptlet: kernel-4.18.0-553.el8_10.x86_64                        1/4 
  Erasing          : kernel-modules-extra-4.18.0-553.el8_10.x86_64          2/4 
  Running scriptlet: kernel-modules-extra-4.18.0-553.el8_10.x86_64          2/4 
  Erasing          : kernel-modules-4.18.0-553.el8_10.x86_64                3/4 
  Running scriptlet: kernel-modules-4.18.0-553.el8_10.x86_64                3/4 
  Running scriptlet: kernel-core-4.18.0-553.el8_10.x86_64                   4/4 
  Erasing          : kernel-core-4.18.0-553.el8_10.x86_64                   4/4 
  Running scriptlet: kernel-core-4.18.0-553.el8_10.x86_64                   4/4 
  Verifying        : kernel-4.18.0-553.el8_10.x86_64                        1/4 
  Verifying        : kernel-core-4.18.0-553.el8_10.x86_64                   2/4 
  Verifying        : kernel-modules-4.18.0-553.el8_10.x86_64                3/4 
  Verifying        : kernel-modules-extra-4.18.0-553.el8_10.x86_64          4/4 

Removed:
  kernel-4.18.0-553.el8_10.x86_64                                               
  kernel-core-4.18.0-553.el8_10.x86_64                                          
  kernel-modules-4.18.0-553.el8_10.x86_64                                       
  kernel-modules-extra-4.18.0-553.el8_10.x86_64                                 

Complete!

The files for that Kernel version have been removed from /boot:

$ ls /boot/*4.18.0-553.el8_10.x86_64*
ls: cannot access '/boot/*4.18.0-553.el8_10.x86_64*': No such file or directory

And removed from the list of Kernels reported by grubby:

$ sudo grubby --info=ALL
index=0
kernel="/boot/vmlinuz-4.18.0-553.5.1.el8_10.x86_64"
args="ro resume=/dev/mapper/almalinux_skylake--alma-swap rd.lvm.lv=almalinux_skylake-alma/root rd.lvm.lv=almalinux_skylake-alma/swap rhgb quiet $tuned_params"
root="/dev/mapper/almalinux_skylake--alma-root"
initrd="/boot/initramfs-4.18.0-553.5.1.el8_10.x86_64.img $tuned_initrd"
title="AlmaLinux (4.18.0-553.5.1.el8_10.x86_64) 8.10 (Cerulean Leopard)"
id="5de40fe84eb94d6da4f1b208114750bb-4.18.0-553.5.1.el8_10.x86_64"
index=1
kernel="/boot/vmlinuz-4.18.0-513.24.1.el8_9.x86_64"
args="ro resume=/dev/mapper/almalinux_skylake--alma-swap rd.lvm.lv=almalinux_skylake-alma/root rd.lvm.lv=almalinux_skylake-alma/swap rhgb quiet $tuned_params intel_iommu=on"
root="/dev/mapper/almalinux_skylake--alma-root"
initrd="/boot/initramfs-4.18.0-513.24.1.el8_9.x86_64.img $tuned_initrd"
title="AlmaLinux (4.18.0-513.24.1.el8_9.x86_64) 8.9 (Midnight Oncilla)"
id="5de40fe84eb94d6da4f1b208114750bb-4.18.0-513.24.1.el8_9.x86_64"
index=2
kernel="/boot/vmlinuz-0-rescue-5de40fe84eb94d6da4f1b208114750bb"
args="ro resume=/dev/mapper/almalinux_skylake--alma-swap rd.lvm.lv=almalinux_skylake-alma/root rd.lvm.lv=almalinux_skylake-alma/swap rhgb quiet"
root="/dev/mapper/almalinux_skylake--alma-root"
initrd="/boot/initramfs-0-rescue-5de40fe84eb94d6da4f1b208114750bb.img"
title="AlmaLinux (0-rescue-5de40fe84eb94d6da4f1b208114750bb) 8.6 (Sky Tiger)"
id="5de40fe84eb94d6da4f1b208114750bb-0-rescue"

Confirmed that the default Kernel is still 4.18.0-513.24.1.el8_9.x86_64 which boots:

$ sudo grubby --info=DEFAULT
index=1
kernel="/boot/vmlinuz-4.18.0-513.24.1.el8_9.x86_64"
args="ro resume=/dev/mapper/almalinux_skylake--alma-swap rd.lvm.lv=almalinux_skylake-alma/root rd.lvm.lv=almalinux_skylake-alma/swap rhgb quiet $tuned_params intel_iommu=on"
root="/dev/mapper/almalinux_skylake--alma-root"
initrd="/boot/initramfs-4.18.0-513.24.1.el8_9.x86_64.img $tuned_initrd"
title="AlmaLinux (4.18.0-513.24.1.el8_9.x86_64) 8.9 (Midnight Oncilla)"
id="5de40fe84eb94d6da4f1b208114750bb-4.18.0-513.24.1.el8_9.x86_64"

15. Attempt to enable serial console to capture debug output

The HP Z4 G4 PC has a dual port serial card fitted:

$ lspci -nn | grep Serial
09:00.0 Serial controller [0700]: Nanjing Qinheng Microelectronics Co., Ltd. CH352/CH382 PCI/PCIe Dual Port Serial Adapter [1c00:3253] (rev 10)

Which looks to be using PCI-Express based Dual UARTs and printer port chip CH382

When AlmaLinux had booted, was able to use PuTTY to communicate using /dev/ttyS0 set to 115200 baud to a USB to serial adapter on a different Ubuntu PC.

15.1. Serial output when working Kernel boots

With the Kernel which boots, used grub to edit the command line and replace rhgb quiet with console=ttyS0,115200. During boot:

  1. EFI stub: UEFI Secure Boot is enabled. still appeared on the monitor.
  2. The rest of the boot messages appears on the serial port:
    [    0.000000] Linux version 4.18.0-513.24.1.el8_9.x86_64 (mockbuild@x64-builder02.almalinux.org) (gcc version 8.5.0 20210514 (Red Hat 8.5.0-20) (GCC)) #1 SMP Mon Apr 8 11:23:13 EDT 2024
    [    0.000000] Command line: BOOT_IMAGE=(hd1,gpt2)/vmlinuz-4.18.0-513.24.1.el8_9.x86_64 root=/dev/mapper/almalinux_skylake--alma-root ro resume=/dev/mapper/almalinux_skylake--alma-swap rd.lvm.lv=almalinux_skylake-alma/root rd.lvm.lv=almalinux_skylake-alma/swap console=ttyS0,115200 intel_iommu=on
    <<snip>>
    AlmaLinux 8.10 (Cerulean Leopard)
    Kernel 4.18.0-513.24.1.el8_9.x86_64 on an x86_64
    
    skylake-alma login: 
    

The ttyS0 device has IO port 0x20C0:

$ sudo cat /proc/tty/driver/serial
[sudo] password for mr_halfword: 
serinfo:1.0 driver revision:
0: uart:XR16850 port:000020C0 irq:16 tx:21019 rx:0 RTS|CTS|DTR|DSR
1: uart:XR16850 port:000020C8 irq:16 tx:0 rx:0
2: uart:unknown port:000003E8 irq:4
3: uart:unknown port:000002E8 irq:3

Tried some other options to get the EFI stub: UEFI Secure Boot is enabled. to appear on the serial console, to get more debug about when the problematic Kernel hangs. They attempts weren't successful:

  1. If add initcall_debug console=ttyS0,115200 get additional calling lines which start at time [0.282895], after 401 lines of previous output.
  2. If add console=ttyS0,115200 efi=debug get additional efi: lines which start at time [0.000000], after 59 lines of previous output.
  3. If add earlyprintk=serial,ttyS0,115200 get no serial output.
  4. If add earlycon=uart8250,io,0x20c0,115200 get no serial output. dmesg had earlycon: uart8250 at I/O port 0x20c0 (options '115200') so the option was recognised.
  5. If add console=uart8250,io,0x20c0,115200:
    • Kernel doesn't boot
    • The monitor only dsplays EFI stub: UEFI Secure Boot is enabled.
    • The only output on the serial port is some zero bytes. The answer to Is it possible to set serial speed for an early kernel boot log to a MMIO UART? suggests this may be related to a non-standard UART clock.

16. Update to Kernel 4.18.0-553.8.1.el8_10.x86_64 allows PC to boot

Installed further updates, including Kerrnel version 4.18.0-553.8.1.el8_10.x86_64, and with that Kernel the PC boots. From scanning the release notes not obvious which change fixed the problem with the boot hanging.

The update removed intel_iommu=on from the Linux command line as per Custom Linux commandline options lost after Kernel update:

$ cat /proc/cmdline 
BOOT_IMAGE=(hd1,gpt2)/vmlinuz-4.18.0-553.8.1.el8_10.x86_64 root=/dev/mapper/almalinux_skylake--alma-root ro resume=/dev/mapper/almalinux_skylake--alma-swap rd.lvm.lv=almalinux_skylake-alma/root rd.lvm.lv=almalinux_skylake-alma/swap rhgb quiet

Added the option back:

[mr_halfword@skylake-alma ~]$ sudo grubby --update-kernel=DEFAULT --args=intel_iommu=on
[mr_halfword@skylake-alma ~]$ sudo grubby --info=DEFAULT
index=0
kernel="/boot/vmlinuz-4.18.0-553.8.1.el8_10.x86_64"
args="ro resume=/dev/mapper/almalinux_skylake--alma-swap rd.lvm.lv=almalinux_skylake-alma/root rd.lvm.lv=almalinux_skylake-alma/swap rhgb quiet $tuned_params intel_iommu=on"
root="/dev/mapper/almalinux_skylake--alma-root"
initrd="/boot/initramfs-4.18.0-553.8.1.el8_10.x86_64.img $tuned_initrd"
title="AlmaLinux (4.18.0-553.8.1.el8_10.x86_64) 8.10 (Cerulean Leopard)"
id="5de40fe84eb94d6da4f1b208114750bb-4.18.0-553.8.1.el8_10.x86_64"

Rebooted after changed the command line. The IOMMU was on, secure boot enabled and the Kernel wasn't tainted:

[mr_halfword@skylake-alma ~]$ cat /proc/cmdline 
BOOT_IMAGE=(hd1,gpt2)/vmlinuz-4.18.0-553.8.1.el8_10.x86_64 root=/dev/mapper/almalinux_skylake--alma-root ro resume=/dev/mapper/almalinux_skylake--alma-swap rd.lvm.lv=almalinux_skylake-alma/root rd.lvm.lv=almalinux_skylake-alma/swap rhgb quiet intel_iommu=on
[mr_halfword@skylake-alma ~]$ dmesg|grep lockdown
[    0.000000] Kernel is locked down from EFI secure boot; see man kernel_lockdown.7
[    9.852952] Lockdown: swapper/0: Hibernation is restricted; see man kernel_lockdown.7
[mr_halfword@skylake-alma ~]$ ~/fpga_sio/software_tests/eclipse_project/bind_xilinx_devices_to_vfio.sh 
IOMMU devices present: dmar0  dmar1  dmar2  dmar3
Loading vfio-pci module
[sudo] password for mr_halfword: 
Bound vfio-pci driver to 0000:15:00.0 10ee:7024 [0002:0003]
Waiting for /dev/vfio/41 to be created
Giving user permission to IOMMU group 41 for 0000:15:00.0 10ee:7024 [0002:0003]
Bound vfio-pci driver to 0000:30:00.0 10ee:7011 [0000:0000]
Waiting for /dev/vfio/89 to be created
Giving user permission to IOMMU group 89 for 0000:30:00.0 10ee:7011 [0000:0000]
[mr_halfword@skylake-alma ~]$ ~/Downloads/kernel-chktaint 
Kernel not Tainted

Tests using the IOMMU were successful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment