Jump to content
RESET Forums (homeservershow.com)

Home Server SMART beta (1.7.4.28)


msawyer91
 Share

Recommended Posts

Good evening! I wanted to let everyone know that I posted a new version of Home Server SMART today, version 1.7.4.28, available here: http://www.dojonorthsoftware.net/Freebies/HomeServerSMARTBeta.aspx

 

I've seen that HSS has been talked about quite a bit in the forums here. Some of the feedback has been very positive, and other feedback has shown there is plenty of room for improvement. The discussions seem to focus on the current production builds, mainly the November 2010 release (1.1.36.6) or January 2010 (1.1.0.21).

 

The first major beta released earlier this year made HSS run as a Windows service on WHS, allowing full automation of disk health monitoring, with a configurable disk polling interval. Disk problems will raise server events (warning/critical) and, when using Alex Kuretz's Remote Notification add-in, you can receive email and SMS notifications when there is a problem needing attention.

 

I tried to incorporate features people requested and that I believed would be useful, including the ability to "ignore" certain warnings, such as a low bad sector count that doesn't get worse or that pesky CRC error count.

 

Unfortunately, there are bound to be bugs, and HSS is not immune to them. The latest released beta fixed four major bugs, two of which caused HSS to display very little or no data and one that could cause the WHS Console to crash. Hopefully with another round of good testing I can verify that these bugs are indeed put to bed as well as making sure I didn't introduce any new ones!

 

That all said, I am respectfully requesting that folks download and try out the latest beta. It, like its predecessors, is and always will be 100% free. There are no ads, no spyware, etc. All I ask is that if you find any bugs, please visit https://www.dojonorthsoftware.net/bugtraq and post them. If an item you find is identical to an existing bug, please update the existing issue instead of creating a new one.

 

I would like for Home Server SMART to not just a robust application, but one that provides significant value to your Windows Home Server. If it's not providing good value to you, then that means I'm not doing a good job in providing you with good software. And thus all of your feedback, whether in the form of praise or criticism, is appreciated in order to make this a solid add-in for your WHS!

 

Best regards,

Matt

Link to comment
Share on other sites

Thanks for all your hard work Matt. It is truly appreciated. I want to say something, but I do not want to get you upset or anything; I'm hoping it is constructive feedback that will make Home Server Smart even better.

 

The one thing I find with Home Server Smart is that it doesn't give me a really quick good/bad picture of my drives. Instead of having to click on each drive to see what it's SMART status is, I would love for there to be a page that summarizes the status of all drives at once,with 'traffic light' type icons: Red=imminent failure, Yellow=some parameters are getting worse, Green=no issues found or drive is stable. An option to make this the default page would be awesome.

 

cheers.

Link to comment
Share on other sites

Thanks for all your hard work Matt. It is truly appreciated. I want to say something, but I do not want to get you upset or anything; I'm hoping it is constructive feedback that will make Home Server Smart even better.

 

The one thing I find with Home Server Smart is that it doesn't give me a really quick good/bad picture of my drives. Instead of having to click on each drive to see what it's SMART status is, I would love for there to be a page that summarizes the status of all drives at once,with 'traffic light' type icons: Red=imminent failure, Yellow=some parameters are getting worse, Green=no issues found or drive is stable. An option to make this the default page would be awesome.

 

cheers.

Hi ikon,

 

Thank you for your feedback. I certainly appreciate your suggestion, but I believe that the feature you're requesting may already be implemented. If you have not yet installed the beta, you may find that what you're looking for is already there.

 

In the production versions, January and November of last year, this was lacking--and someting lots of folks were asking for.

 

The beta versions do provide, in the physical disk selector, an icon next to each drive--this seemed the most logical place to put such an icon so folks would get an "at a glance" view of their disk health. If any disks were degraded, it would be readily apparent so they could click on the disk in question. I've attached a screen capture of my server as an example. Since I don't have any disks in ill health, I deliberately lowered HSS' temperature thresholds to generate some critical conditions. Healthy disks show green; degraded ones show a yellow exclamation and critical ones show a red X. If HSS cannot get the SMART data, a blue circle with a question mark will appear--this may be the case with some USB disks, physical RAID volumes, etc.

 

hss_unhealthy.png

 

Please let me know if this is what you're looking for, and thank you for your feedback!

 

Matt

Link to comment
Share on other sites

Hi ikon,

 

Thank you for your feedback. I certainly appreciate your suggestion, but I believe that the feature you're requesting may already be implemented. If you have not yet installed the beta, you may find that what you're looking for is already there.

 

In the production versions, January and November of last year, this was lacking--and someting lots of folks were asking for.

 

The beta versions do provide, in the physical disk selector, an icon next to each drive--this seemed the most logical place to put such an icon so folks would get an "at a glance" view of their disk health. If any disks were degraded, it would be readily apparent so they could click on the disk in question. I've attached a screen capture of my server as an example. Since I don't have any disks in ill health, I deliberately lowered HSS' temperature thresholds to generate some critical conditions. Healthy disks show green; degraded ones show a yellow exclamation and critical ones show a red X. If HSS cannot get the SMART data, a blue circle with a question mark will appear--this may be the case with some USB disks, physical RAID volumes, etc.

 

hss_unhealthy.png

 

Please let me know if this is what you're looking for, and thank you for your feedback!

 

Matt

Thanks Matt. This looks pretty good in fact. What do the green 'asterisks' at the far left indicate; they're new to me? Are they perhaps the factors you monitor to determine overall health?

 

I have downloaded the latest beta. Will install soon.

 

BTW, thanks for the quick reply.

Link to comment
Share on other sites

I see if you double-click the parameter, you get a pop-up box with a lot of information. This is a nice job, Matt.

oooh, cool. Will have to check that out.

Link to comment
Share on other sites

This is a good a place as any to ask this question about SMART reporting. Looking at the categories - Threshold | Value | Worst - and taking Reallocated Sectors Count as an example, does the number in the Threshold column mean that that is the actual number of reallocated sectors allowed before the drive is flagged for imminent failure?

Link to comment
Share on other sites

Thanks Matt. This looks pretty good in fact. What do the green 'asterisks' at the far left indicate; they're new to me? Are they perhaps the factors you monitor to determine overall health?

 

I have downloaded the latest beta. Will install soon.

 

BTW, thanks for the quick reply.

 

The green asterisks are for attributes that I regard as "super critical" attributes. Different drive manufacturers can use any set of SMART attributes they like and don't always use the same attributes the same way as other manufacturers, or even across different models of their own drives! So to be able to interpret every attribute is very difficult, and since I have a regular job--HSS is just a hobby freelance development--it's much more difficult for me to track down every manufacturer and model to get their APIs, specs, etc.

 

And so I've selected a few that I've found to be the most crucial indicators of disk death, based on my own experiences as well as reading various research papers on SMART and seeing how other SMART software producers treat them.

 

For all attributes, if the value falls equal to the non-zero threshold, the disk goes to a warning state and if the value falls to less than the non-zero threshold, the disk goes to critical (per the SMART specification, this is called Threshold Exceeds Condition, or TEC). If a value falls to zero, and the threshold is zero, I classify the disk as "geriatric" (old age).

 

For the "super critical" attributes, I monitor their changes more closely, and changes in their raw data will trigger warnings/degraded status. The ones marked with an asterisk (*) are ones where I haven't figured out how to best monitor them, because they're not very straightforward (i.e. Read Error Rate -- Seagate drives are constantly changing the raw data value, and Western Digital drives (most of them) only use the raw data if problems are detected). The super criticals are:

 

01 - Read Error Rate*

03 - Spin-up Time*

05 - Reallocated Sector Count (aka Bad Sectors)

07 - Seek Error Rate*

0A - Spin Retry Count

B8 - End-to-End Error

C4 - Reallocation Event Count

C5 - Current Pending (Bad) Sector Count

C6 - Offline Uncorrectable Sector Count

 

I also keep track of the Ultra DMA CRC Errror Count. In my experience I've found that a high incidence of these means you've got a bad data cable, or in the case of a USB disk, the USB bridge chip is not compatible with the OS and this can cause data corruption.

 

I might take the green asterisk off of 01, 03 and 07 until I can determine a more "sure-fire" way of determining whether or not the changes to the raw data are a problem.

 

And, I still have to get the help documentation updated before I make the final release! ;)

 

Matt

Link to comment
Share on other sites

This is a good a place as any to ask this question about SMART reporting. Looking at the categories - Threshold | Value | Worst - and taking Reallocated Sectors Count as an example, does the number in the Threshold column mean that that is the actual number of reallocated sectors allowed before the drive is flagged for imminent failure?

SMART is, in my opinion, the "standard that isn't" because vendors get a lot of latitude in how they use the values, thresholds, etc. For instance, I've found that my Seagate drives have their raw data for read error rate constantly changing, whereas my Western Digital drives show all zeros. And my WD drives that have the end-to-end error attribute have a threshold/value/worst of 97/100/100 but Seagate is 99/100/100. So I don't know how it can really be called a "standard" when vendors have so many degrees of freedom in how to implement it. Alas, I digress...

 

The "value" is actually a vendor-specific "normalized value" of the raw data. The "worst" column is supposed to be the worst, or lowest, the normalized value has ever fallen to. Values can go up and down. Thresholds never change. When a the value of the "value" column is less than the threshold, the Threshold Exceeds Condition, or TEC, exists. When TEC exists, according to the SMART specification, the drive will fail within 24 hours. In reality, I've seen seemingly healthy drives drop dead without warning, and drives with multiple TEC survive for years. I don't try to fight the political battles of SMART...I just try to provide data that is as accurate as possible and leave it up to the user to decide whether they want to chance it if a drive goes amiss.

 

Let's consider the bad sector count. A vendor could decrease the value for every 25 bad sectors detected. So if the value is 100 and the threshold is 95, you'd need 150 bad sectors to cause TEC. In other words, 25 bad sectors decreases the value by one, another 25 drops it another one (down to 98), etc. But a different vendor may require 50 bad sectors to drop the value by one, whereas another could drop the value after only 10 bad sectors.

 

In the case of bad sectors, reallocation events, end-to-end errors, offline uncorrectable sectors and spin retries, the raw data tells you (in hexadecimal) how many of those exist. So if you see all zeros, that's good! Remember that these are hex, so if you see the number "000000000035" for the raw data of an attribute, you have 53, not 35.

 

I regard any number of bad sectors, reallocations, pending, end-to-end errors as bad. Google published a paper on failure trends in disks (http://labs.google.com/papers/disk_failures.pdf) and they found that a great number of drives don't survive too long once bad sectors start popping up.

 

I do give you the option in HSS to "ignore" problems. So let's say you get a bad sector, and the count doesn't go up for a few weeks. You can choose to ignore it, and you will no longer be warned about it. However, if another bad sector pops up, you'll get alerted again. This way if you choose to ignore something, you won't be warned about it unless it gets worse. I've had some disks with a few bad sectors and they didn't get worse for many, many months and so I leave it up to individual users, on a disk-by-disk basis, to decide whether or not they want to ignore a problem. Just click on the degraded attribute in question and an "Ignore Problem" button will appear. You are NOT allowed to ignore TEC conditions as this is regarded as a catastrophic failure condition.

 

Matt

Link to comment
Share on other sites

Thanks for that explanation. Let's see if I have this right.

 

  • Once the value of 'Current' falls below the value of 'Threshold', the drive is in danger of imminent failure?

  • Using attribute 05 as an example, a manufacturer could decide that for every 25 reallocated sectors they will drop the value of 'Current' by 1? So unless we knew what their criteria/formula was for decrementing the value of 'Current', we could never know the true Reallocated Sectors Count?

  • Certain manufacturers can decided to hide the actual value of certain attributes, such as Read Error Rate? And they probably have their own internal calculations for when to flag Read Error Rate as critical, but until errors reach that threshold, we won't see it? Or, in the case of WD where they report zeroes for Read Error Rate, is it that they just don't actually monitor it period? That would be odd.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share


×
×
  • Create New...