Jump to content
RESET Forums (homeservershow.com)

Gen8 disk issues - a real head scratcher


Robbidoo
 Share

Recommended Posts

Hi folks,

 

I have never had a stable server with the gen8 I bought just over a year ago. It all revolves around disk reliability. I have tried a lot of stuff to remedy this and I am really running out of ideas, so I'm hoping the wisdom on here might be able to help me.

 

I did hijack a thread about a year ago here: http://homeservershow.com/forums/index.php?/topic/9114-raidahci-confusion-and-129-errors/ wherethe conclusion was that the drives were bad. However there is a lot of detail there if you are so interested. I will capture the latest and most relevant detail below:

 

Ok here's the setup to start with:

 

Server: Gen8 Microserver with a Xeon 1230 and 8GB RAM. Running Win Server 2012 R2.

 

Running the latest Stablebit Drivepool and Scanner.

 

- Booting using the USB stick with bootloader trick to allow booting from SATA5

- Sandisk SSD for OS, plugged into SATA5 (optical drive cable)

- Using on-board SATA controller in AHCI mode for all disks

- All mechanical disks are using GPT and formatted as a single volume. These have no drive letters and are all presented up through Drivepool.

 

In the four drive bays:
 

2 x 3TB WD Red

2 x 4TB WD Red

 

The 4TB disks have never been replaced, but both 3TB disks are RMA replacements. 

 

After the server being up for some minutes or hours, the disk controller will report timeouts to one of the disks. It is always one of the 3TB Reds which reports these. At this point the IO lag is horrendous, and I have little choice but to shut the machine down and pull a disk.

 

With just 2 x 4TB and 1 x 3TB, under nominal IO load, the server is fine and I see no errors.

 

Now, I have previously RMA'd this 3TB disk twice, and still see the same issues when adding it to the system and rebalancing disks using drivepool. This led me to consider that perhaps I had a cable / backplane issue, or maybe a controller issue. So I bought both a replacement cable to allow me to run the disks sitting outside the server, and a pci-e mini-SAS controller. 

 

First I tried the new controller. Same issues occurred within 30 mins or so. Then I tried the new cable with the new controller. Same issues. New cable with on-board controller - same issues.

 

I want to add another data point. With three disks in the system, 2x 4TB and 1x 3TB, if I apply massive IO load (drivepool rebalancing, transcoding movies, bittorrent writing around 15MB/s to those disks), the disks will disappear from the bus. Massive error logs, but the key one being "drive x was surprised removed".  This kind of behaviour only under heavy load would suggest to me perhaps a PSU issue? 

 

So I am running out of commonality, and components to replace.


My next steps:

 

- I have ordered a 4TB HGST disk, to add to the array instead of the 3TB Reds (which I have lost faith in anyway, especially as both are reconditioned RMA replacements).

 

- I am going to try and rig a known-good ATX PSU up to the MS, and see if I can re-create the load issues.

 

Any thoughts, tips, solutions, would be most welcome. This is getting quite boring now but I'm not sure what else to do other than keep chasing down replacement parts.

 

 

 

 

Link to comment
Share on other sites

Three thoughts:

 

- Are you running latest firmware on everything?

- Have you checked that you don't have any temperature problems?

- Have you tested RAID mode?

 

This last one would be my biggest recommendation. My Gen8 is running ESXi (currently 6.0 U2) which is notoriously unstable when used in AHCI mode, so it doesn't fully surprise me that this is also the case under Windows. When you switch to RAID mode, the RAID configuration utility (you can access it from the ILO through the auto-provisioning interface) will give you an option to auto-create RAID-0 arrays one per disk, and if it behaves the same as mine did, you won't lose any data on the disks (but I'm not guaranteeing that, so best to have backups of any valuable data).

 

You would have to reinstall Windows. You also wouldn't need the "boot from USB hack", which I personally think is a terrible thing to do anyway.

Link to comment
Share on other sites

Thanks rotor

 

Will check firmwares but the MS is on the one with the fan speed hack that came out a while ago.

 

With regard to raid mode, I'm currently seeing all of the above issues when using the pci-e controller, so not sure how much this will help, but I can try I suppose.

Link to comment
Share on other sites

Given you have replaced one of the drives, you've replaced the cable and you've replaced the controller, can I suggest the following?

 

- Get hold of another SSD (just any old thing you have lying around -- a spinning disk will do as well)

- Change the B120i to RAID mode (leaving your 4 data disks plugged into the PCI-E controller)

- Remove the USB device you've been booting from

- Install a clean Windows 2012 R2 onto the temporary SSD (RAID mode allows you to boot off SATA port 5)

 

You've now ruled out issues with the booting from USB functionality, and you haven't touched the data or configuration on the 4 data disks.

Link to comment
Share on other sites

When you RMA'd the previous 3TB drive, did you test it in another system to confirm the issue with the drive?

Link to comment
Share on other sites

When you RMA'd the previous 3TB drive, did you test it in another system to confirm the issue with the drive?

 

Yeah I tried but to be honest I can't remember if this particular one was at fault in another machine. It's the 5th or 6th total from two 3TB drive purchases.

 

Just to rule out options, have you tried running this setup without DrivePool?

 

No not yet. I don't have a good use case for this server without it, but it looks like that's next on the list...

 

Given you have replaced one of the drives, you've replaced the cable and you've replaced the controller, can I suggest the following?

 

- Get hold of another SSD (just any old thing you have lying around -- a spinning disk will do as well)

- Change the B120i to RAID mode (leaving your 4 data disks plugged into the PCI-E controller)

- Remove the USB device you've been booting from

- Install a clean Windows 2012 R2 onto the temporary SSD (RAID mode allows you to boot off SATA port 5)

 

You've now ruled out issues with the booting from USB functionality, and you haven't touched the data or configuration on the 4 data disks.

 

rotor, over the weekend I tried exactly what you suggested above. New SSD, new install of win 2012 r2, and 4x individual disk raid sets.

 

Everything was looking good. I left it rebalancing with drivepool overnight, this morning one of the disks had been "surprise removed". 

 

It was the 3TB that I suspected anyway, so I rebalanced onto the other three, started two torrents going, and bang one of the 4TB disks has been "surprise removed"

 

 

So what have I learned?

 

Might be drivepool. I can get rid of it but it seems odd that so many others are using it on these boxes without issue. To start with I'm going to disable its simultaneous replication and read interleaving.

 

I'm open to buying another product if anyone can recommend one. Having file-level duplication across disks is all I'm after really, and when it was working, it was perfect.

Link to comment
Share on other sites

So what have I learned?

 

Might be drivepool. I can get rid of it but it seems odd that so many others are using it on these boxes without issue. To start with I'm going to disable its simultaneous replication and read interleaving.

 

I'm open to buying another product if anyone can recommend one. Having file-level duplication across disks is all I'm after really, and when it was working, it was perfect.

If you've paid for DrivePool, can you log a support case with them? This must be hugely frustrating for you, I really hope you get to the bottom of it.

Link to comment
Share on other sites

Thanks rotor I appreciate your support. (and yes I'm a drivepool customer)

 

The thing is, when it goes wrong, it manifests as a driver / hardware error - drivepool is operating a layer or two above that so I can't really see how it can be the software's fault. The commands are already abstracted and being processed lower down the stack.

 

If anyone thinks differently about it I'd love to hear your ideas.

Edited by Robbidoo
Link to comment
Share on other sites

Just to update, I've hooked up another ATX power supply external to the case and am going to try and thrash the disks to see if it will fail again with that on.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

×
×
  • Create New...