I've had mystery eSata drops on my EX495 ever since WHS V1 and Then still on WHS 2011. I finally found the fix and am happy.
I let Matt know about this, he's the dev from the SMART (monitoring) WHS add-in. Here is his interesting response."You’ve made some interesting discoveries/observations here. It’s intriguing to me that you could reproduce the problem by subjecting the enclosure to intense I/O, or by allowing SMART tools to run against it under minimal I/O. The link state power management problem seems to affect a lot of different hardware. I have an OCZ Octane 128GB SSD in both my work-issued laptop (by day I’m a Microsoft SharePoint consultant for HP) and my personal laptop. Both exhibited a peculiar behavior of freezing for 30 seconds at seemingly random times during the day. The system would always become responsive, so it was more of an annoyance than anything. The system event log would show an error along the lines of “the device \\Harddisk0\ did not respond within the timeout period.”
In both cases the guilty party was an Intel ICH SATA/RAID controller and the fix was to go into the Registry and turn off the—you guessed it—link state power management!
According to the SATA specification, there are many different commands you can send to a device—things like DEVICE IDENTIFY, SMART READ DATA, etc. If a device doesn’t recognize a command, the device should return an error code so that the program that issued the command knows the operation failed.
Something I found out in developing WindowSMART and Home Server SMART, particularly when I got to the part where I started supporting device self-testing (short, extended, conveyance), was that there are a LOT of devices that don’t conform to the specification. Rather than returning an error code, the device just seems to die silently.
As an example, you can send a command to the device and it’ll return—as a number of minutes—the length of each test it supports. If the return value is zero, the test is not supported by the device. In the example of the OCZ Octane 128, this particular SSD doesn’t support any of the tests. Of course, in the UI, I dim the button to allow you to run a test if the device doesn’t support it. Out of curiosity, I did want to see what happens if I send the test command to the OCZ Octane. Doing to effectively renders the laptop inoperative. The resolution is to do a full power cycle on the laptop, and the device returns to normal operating.
The moral of the story here is that I’m guessing the enclosure and/or some (or all) of the devices within it don’t support the link state power management command, so if the host sends that command to the device(s), they die silently and stop responding to commands. And Windows eventually detects they’re no longer responding and finally takes them offline.
By the same token, when a SMART tool sends commands to the devices, it’s possible one of them doesn’t recognize a command and it locks up, and seemingly all of them follow suit. Probably because commands start queuing up in the I/O controller due to the locked-up device. And eventually the whole controller goes down.
Matt"
Edited by JazJon, 15 July 2012 - 06:25 PM.