Jump to content
RESET Forums (homeservershow.com)

New BackBlaze article about Seagate ST3000DM001


ikon
 Share

Recommended Posts

Well, we do know that 477 (867 - 390) of the drives that were deployed from Jan to Jun were Internals; 87 more than the Externals. That means that at least 55% (477/867) of the failed drives were Internal, and that Internal drives had a minimum overall failure rate of 27% (477/1751). I don't think that's useless info.

 

I completely disagree that the BB people believe think there is no issue with shucked drives. In fact, their statement is that, regardless of whether the drives are Internal or External, the failure rate is unacceptably high. I agree with that.

 

Regardless of whether BB has tracked every last bit of data that would be scientifically ideal, I think there is valuable info that can be gleaned from the stats.

At least 477 of the failed drives were internals.

You have done more analysis above than they have. No one thinks their failure rate is acceptable.

Read the comments at the bottom of the article. They are very vague about working with Seagate. That is extremely odd. Every drive vendor wants to dive into these situations to get to root cause. With proper data capture, they should be able to determine type of failure by drive and manufacturing location. Also they should be able to determine if there are any issues with handling on the in bound or handling in house. 

Link to comment
Share on other sites

I've said it before, and I'll say it again.  I think BackBlaze have a fundamental issue with over-handling drives.  Every other DAS/NAS has hotswap bays that means when you're swapping a dead drive you don't have to haul the machine out of the rack.  Dragging a heavy-ass StoragePod out of a rack is only going to add excessive knocks and vibrations to the drives.

 

As for them assembling their machines then putting them in the back of a Transit and dragging them to their colo... :o

Link to comment
Share on other sites

I've said it before, and I'll say it again.  I think BackBlaze have a fundamental issue with over-handling drives.  Every other DAS/NAS has hotswap bays that means when you're swapping a dead drive you don't have to haul the machine out of the rack.  Dragging a heavy-ass StoragePod out of a rack is only going to add excessive knocks and vibrations to the drives.

 

As for them assembling their machines then putting them in the back of a Transit and dragging them to their colo... :o

The storage pod itself isn't good for the drives. They're jamming in as many desktop rated drives as possible into their custom enclosure with no regard to vibration or temperature. The only thing they care about is as much storage as possible and cheap as possible. They basically run the storage pods until a large enough number of drives has failed to warrant them unracking it and swapping out the duds. It is basically "how not to make a storage array" but on a massive scale.

 

They've got a blog article about how they don't see any correlation between temperature and failure rates but at the end of the day, it is a high precision mechanical system.

 

Watch the below video to see how vibration affects hard drives. That is just the guy shouting at the array. Imagine all the drives in the enclosure all vibrating with their random seek patterns.

 

Edited by GotNoTime
  • Like 1
Link to comment
Share on other sites

At least 477 of the failed drives were internals.

You have done more analysis above than they have. No one thinks their failure rate is acceptable.

Read the comments at the bottom of the article. They are very vague about working with Seagate. That is extremely odd. Every drive vendor wants to dive into these situations to get to root cause. With proper data capture, they should be able to determine type of failure by drive and manufacturing location. Also they should be able to determine if there are any issues with handling on the in bound or handling in house.

 

Thanks for adding that. In my initial post attempt I absolutely had "at least" in there, but the forum software did something weird and forced me to restart the post from scratch, so I missed putting it in the second time.

 

Regarding both rates being unacceptable, that was BB's own statement.

 

I'm not sure what's going on about working with Seagate. I'm not sure if they do, or want to, work with any of the manufacturers.

 

Regarding handling, one thing seems clear: the extra handling required in order to shuck drives out of external enclosures doesn't appear to have adversely affected the longevity of those drives, at least compared to the non-shucked versions of the same m0del. Also, unless BB was treating the ST3000DM001 drives way more harshly than other models, it appears that differences in handling couldn't be more than a minor consideration when it comes to this specific model, again compared to other makes/models.

 

I do not agree that sliding a Storage Pod out to replace a drive is necessarily going to cause "excessive shocks". In fact, I found this on their blog:

 

Technically the system will support hot-swapping drives. However, we never do this. Hot-swapping drives on any system increases the likelihood of something going wrong. Since these pods are top-loaded and require moving the server to replace drives, it’s one more reason to power down before swapping drives.

 

So, it would appear that shocks to HDDs from sliding out a Storage Pod, in order to do hot swapping, is not a factor after all. Even more importantly, it certainly appears to me that BB is not only aware of the importance of minimizing heat and vibration, their pods are specifically designed to address these concerns. Would I do even more to minimize vibration? Absolutely, but I'm an HDD anti-heat, anti-vibration fanatic. I would have drives suspended in elastic rubber 'hammocks' :)

 

The same goes for preassembling a pod and transporting it: we do not know how that is done. It's entirely possible that having the drives mounted in the pod actiually poses less shock risk to the drives than if they were shipped in the typical packing manufacturers use. A lot of it would depend on how the drives are mounted. From BB's description, it seems like their pods have quite a bit of anti-vibration dampening built in.

 

Other than what they've published in their blog, I don't have any insider info on how BB tracks their drives. I have a sneaking suspicion they have more data than we've seen, but I don't know it as a fact.

Link to comment
Share on other sites

 

Regarding handling, one thing seems clear: the extra handling required in order to shuck drives out of external enclosures doesn't appear to have adversely affected the longevity of those drives, at least compared to the non-shucked versions of the same m0del. Also, unless BB was treating the ST3000DM001 drives way more harshly than other models, it appears that differences in handling couldn't be more than a minor consideration when it comes to this specific model, again compared to other makes/models.

 

How do you come to that conclusion? Where is the data to support that?

I would have drives suspended in elastic rubber 'hammocks' :)

 

That would actually be a very bad thing to do. Drives need to be firmly mounted and isolated.

Link to comment
Share on other sites

How do you come to that conclusion? Where is the data to support that?

 

Not sure which part you mean, so I'll respond to both:

  1. the Jan to Jun stat covers the shucking of external drives. As stated, at least 55% of the failed drives were internal. That means, at most, external drives accounted for 45% of the failures. Since internal drives are the majority, it follows that the shucking of the externals could not have caused shortened life span compared to internals. IOW, both types could have failed miserably.

     

    One could try to make the argument that all 390 of the external drives could have failed, leading to a 100% failure rate of them, but that's countered by the second line (Jul to Dec), where the majority of drives were shucked, and yet the failure rate was actually less.

     

  2. since BB is using all kinds of drive models, and the ST3000DM001 model failed way more often than others, it is highly unlikely, to say the least, that handling is responsible for the difference. It's always possible that the ST3000DM001 model is ultra sensitive to handling, making it more vulnerable, but I would argue that that is just another example of how the drive model is faulty.

That would actually be a very bad thing to do. Drives need to be firmly mounted and isolated.

 

Utterly, totally, and completely disagree, and I base that on experience. I have used many drive suspension techniques over the years: rubber grommets, silicone grommets, sponge foam, springs, fiberglass insulation, bungee cord, rubber bands, surgical tubing, latex rubber, etc.. The best I ever found was to suspend HDDs in springy rubber, preferably silicone or latex. The silicone seems to be best in that it stands up to heat better. In any case, really good, springy mounts completely dampen vibration, particularly between drives and the chassis, and therefor prevent inter-drive vibration and harmonics.

 

The one thing you can lose, a little bit, is cooling. When drives are not bolted to a chassis or drive cage, there is little to zero heat conduction from the drive to the chassis. Of course, it's not like conduction to a chassis is a very good way to cool drives in any case, but I compensated for it with fan cooling.

 

I am not sure where you got the idea that HDDs need to be firmly mounted, but I urge you to try a soft suspension mount, see for yourself. As an added bonus, you will likely find that the whole computer is quieter with suspended drives.

Link to comment
Share on other sites

Not sure which part you mean, so I'll respond to both:

  1. the Jan to Jun stat covers the shucking of external drives. As stated, at least 55% of the failed drives were internal. That means, at most, external drives accounted for 45% of the failures. Since internal drives are the majority, it follows that the shucking of the externals could not have caused shortened life span compared to internals. IOW, both types could have failed miserably.

     

    One could try to make the argument that all 390 of the external drives could have failed, leading to a 100% failure rate of them, but that's countered by the second line (Jul to Dec), where the majority of drives were shucked, and yet the failure rate was actually less.

     

  2. since BB is using all kinds of drive models, and the ST3000DM001 model failed way more often than others, it is highly unlikely, to say the least, that handling is responsible for the difference. It's always possible that the ST3000DM001 model is ultra sensitive to handling, making it more vulnerable, but I would argue that that is just another example of how the drive model is faulty.

 

Utterly, totally, and completely disagree, and I base that on experience. I have used many drive suspension techniques over the years: rubber grommets, silicone grommets, sponge foam, springs, fiberglass insulation, bungee cord, rubber bands, surgical tubing, latex rubber, etc.. The best I ever found was to suspend HDDs in springy rubber, preferably silicone or latex. The silicone seems to be best in that it stands up to heat better. In any case, really good, springy mounts completely dampen vibration, particularly between drives and the chassis, and therefor prevent inter-drive vibration and harmonics.

 

The one thing you can lose, a little bit, is cooling. When drives are not bolted to a chassis or drive cage, there is little to zero heat conduction from the drive to the chassis. Of course, it's not like conduction to a chassis is a very good way to cool drives in any case, but I compensated for it with fan cooling.

 

I am not sure where you got the idea that HDDs need to be firmly mounted, but I urge you to try a soft suspension mount, see for yourself. As an added bonus, you will likely find that the whole computer is quieter with suspended drives.

1. The only argument I am making is there is not enough data to draw any conclusions. Perhaps if we had more information it would become obvious that it is not an internal versus external thing at all. Wouldn't it be odd if we found that all the failed drives came from the same production line? You keep trying to validate this study and there is just not enough data to do that.

2. A drive that is not firmly mounted and isolated is more vulnerable to shock and vibration.

3. I got the idea from drive manufacturers. I can not remember the last time I had a drive fail. So I think I am doing alright.

Your logic or lack there of reminds me of my first day in Statistical Analysis in college. The professor wrote two facts on the board. Ice cream consumption is highest in the month of August. The number of sexual assaults is highest in the month of August. He then wrote, 'eating ice cream causes people to commit sexual assault'. He then turned to the class and said, 'next class turn in a one page brief either in agreement or disagreement with my statement'. 

Link to comment
Share on other sites

I have been asking questions on the BB blog. Someone linked me to this, I had not seen this post before:

https://www.backblaze.com/blog/backblaze_drive_farming/

The pictures tell me these guys don't get drive handling in general. The trunk full of drives is bad enough, but how many times do you think the tower of  blister packed drives fell over? 

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

×
×
  • Create New...