Skip to content

Instantly share code, notes, and snippets.

@TheTechRobo
Last active May 13, 2022 23:56
Show Gist options
  • Save TheTechRobo/feee2e73050167557aa56b28060f6a4d to your computer and use it in GitHub Desktop.
Save TheTechRobo/feee2e73050167557aa56b28060f6a4d to your computer and use it in GitHub Desktop.
How to recover a drive that fails to open with "No such device or address"

Prerequisites

  • Linux kernel 2.7+ (with headers)

Symptoms

Your symptoms might be different, with the same root problem, but these were mine)

  • lsblk shows the drive, but DOES NOT SHOW the partitions, like:
    • $ lsblk
      sda    3G
        sda1 1G
        sda2 2G
      sdb    1.8T
      $ # notice that sdb doesn't have any partitions
      $ # even though it does
      
  • Any program trying to open the drive (whether it be mount, ddrescue, even grep) will freeze for a while and say:
    • Failed to open /dev/<device>: No such device or address
    • NOT to be confused with Failed to open /dev/<device>: No such file or directory

Cause + solution

I used to think this was caused by the drive crashing or something. Instead, it's Linux.

Modern drives, when a read fails or takes a while, will try over and over (or wait it out). The timeout for this in modern drives is very high, and often not configurable. (In this case, you couldn't configure it even if it was available, as smartctl wouldn't be able to open the drive.)

If a read takes too long, Linux will decide that the drive is not okay and ask the drive to reset. But the drive will be so focused on processing the already-made request, it will ignore the reset request. After a while, Linux will reset the ATA connection. This is why you see No such device or address rather than No such file or directory: the file was found, but in the process of being opened, it disappeared.

The timeout by Linux is fully configurable. While 690 (about 11 minutes and 30 seconds) is very overboard, I found it better to play it safe than to risk having to spin the drive down and up (powering it off and on) so it would be recognised again.

$ echo 690 | sudo tee /sys/block/sdb/device/timeout

Unfortunately, this only takes effect for a read if the read is started after the timeout is set. This means the original timeout will take effect while Linux is probing for partitions. To solve this, prevent it from probing automatically with this livepatch. Then, add the timeout. Finally, if you need the partitions loaded so you can use partitions directly (like /dev/sda1), run sudo partprobe /dev/<device>.

If a timeout of 690 isn't enough, the drive probably crashed.

If you have any questions, contact me at TheTechRobo#7420 on Discord or TheTechRobo on hackint IRC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment