tojoski

SBS2011 Freezes during client backups

25 posts in this topic

Ok I realize this is SBS and not WHS, but here's a doozie for you guys.. I'm about to pull my hair out!

 

One of my clients is running SBS2011 Essentials for their file / application server. All of a sudden, after about 8 months without so much as a hiccup, the server will freeze when a client pc is backing up to it. Server backup still runs fine and the server runs great as long as no client attempts to do a backup.

 

I blew away the Client Computer Backups content and attempted to start over, no dice... I can restore the server back to an earlier date, but I'd rather avoid that if possible. Theres just so much to make sure is right if I have to do a restore.. (SQL Databases and Config for Microsoft Dynamics Point of Sale)

 

There are 2 partitions, the C and D, both of which are on a Raid 5 array, and I've ran a consistency check on the array and it came out fine.

 

So I dont think its a disk issue, there are no issues other than the client backups..

 

Ideas?

Share this post


Link to post
Share on other sites

I'll throw some things out to check:

 

Permisisons

RAM

AV interferance

Ample space to perform the backups with Shadow Copy room

 

What is in the Event Logs when this starts to happen? Do the clients have any issues accessing files?

Share this post


Link to post
Share on other sites

I still wouldn't rule out a disk issue. RAID5 check utilities don't typically (AFAIK) do a surface analysis of the drives.

 

Also, what about the network? The NIC could be a bit flaky, such that it can handle normal traffic OK (with retransmits, but not enough to hang anything up) but falls over when the extra traffic of a backup starts.

 

I also agree with jmwills about RAM; you might want to run MemTest86 on it.

 

All of that said, I'm not, at this point, really thinking it's hardware. If it's feasible, I would arrange a time when I could go in with a spare drive, install it into the box after removing all the existing ones, install SBS on it and test backups. That would take a while (maybe 1 to 2 hours) but it would isolate whether it's a hardware or software issue. Best of all would be if you have a duplicate system at your place that you could use to pre-install SBS and just take in the already configured drive. I know not many people do it nowadays but I used to always try to maintain duplicate hardware at my shop that I could use to support clients. Course, I'm not in that business any more.....

Share this post


Link to post
Share on other sites

When I first started looking at it, I suspected a disk error as well..because when I opened the recycle bin, I got the error " the recycle bin for drive D is corrupt".. oh great :(

 

I did a chkdsk on both partitions and they came out fine..

 

The NIC is a possibility, it's an Asus server board so theres a 2nd NIC ready to go, it just has to be enabled.

 

Thanks for pointing out the Ram, I'll run a memtest on it overnight and see if that comes up with anything. I'll also try throwing another disk in it and moving the client backup files to that drive.. that should eliminate it being a disk error on the array...

 

I've poured through the event viewer for hours and there is nothing in there of any help, nothing of any importance gets logged between the freeze and the time it gets rebooted. Nothing short of a hard boot will bring it back to life when this happens too, the cursor will still move but everything else is unresponsive.

 

Thanks for the tip about the permissions too, hadnt thought to check the permissions.

Edited by tojoski

Share this post


Link to post
Share on other sites

Any possibility that the client is closed on weekends so you could bring it home to work on it? Or are you already ahead of me? :)

Share this post


Link to post
Share on other sites

If you think it could be the NIC, you could open a performance monitor applet and then kick off a client back up to see what happens during the event.

Share this post


Link to post
Share on other sites

If you think it could be the NIC, you could open a performance monitor applet and then kick off a client back up to see what happens during the event.

 

Good idea. I'm still liking the idea of putting a different disk, with a fresh install of SBS, into the server and trying a backup in order to absolutely isolate if it's hardware or software. I'm a big believer in the 'divide the problem in half repeatedly until there's only one option left'. Tojoski's idea of putting in a new disk and moving the backups to it is good, but it's not an absolute test of hardware vs software.

Share this post


Link to post
Share on other sites

Thanks for the pointers guys.. they are open tomorrow so monday I'm going to attempt moving the backups to a different drive, as well as a overnight memtest.

Share this post


Link to post
Share on other sites

Closed Sunday & Monday?

Share this post


Link to post
Share on other sites

They are open Mon - Sat, but I wasnt available over the weekend to do any troubleshooting.

 

The overnight memtest I ran last night came out squeaky clean, 14 passes with no issues.

 

I did find some interesting errors in the event log which I initially thought were unrelated, but I researched them and fixed them and so far the backups are working again.

 

The errors I found were:

 

In the Application log:

Log Name:		 Application
Source:		 VSS
Date:			 3/6/2012 1:33:48 PM
Event ID:		 8193
Task Category: None
Level:		   Error
Keywords:		 Classic
User:			 N/A
Computer:		 SERVER.AUDIOEXPRESS.local
Description:
Volume Shadow Copy Service error: Unexpected error calling routine RegOpenKeyExW(-2147483646,SYSTEM\CurrentControlSet\Services\VSS\Diag,...).  hr = 0x80070005, Access is denied.

 

This was corrected by adding the "Network Service" full control permissions of the registry key "HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\services\VSS"

 

The there was this in the System log:

 

Log Name:	  System
Source:		Microsoft-Windows-WinRM
Date:		  3/6/2012 1:33:48 PM
Event ID:	  10154
Task Category: None
Level:		 Warning
Keywords:	  Classic
User:		  N/A
Computer:	  SERVER.AUDIOEXPRESS.local
Description:
The WinRM service failed to create the following SPNs: WSMAN/SERVER.AUDIOEXPRESS.local; WSMAN/SERVER.

 

This was corrected by adding the "Validated Write to Service Principal Name" permission for the "Network Service" account to the server's computer account in Active Directory.

 

I'm cautiously optimistic at this point..

Share this post


Link to post
Share on other sites

Good troubleshooting. Don't you just wish the logs had more real-world descriptions though?

Share this post


Link to post
Share on other sites

They are open Mon - Sat, but I wasnt available over the weekend to do any troubleshooting.

 

The overnight memtest I ran last night came out squeaky clean, 14 passes with no issues.

 

I did find some interesting errors in the event log which I initially thought were unrelated, but I researched them and fixed them and so far the backups are working again.

 

The errors I found were:

 

In the Application log:

Log Name:		 Application
Source:		 VSS
Date:			 3/6/2012 1:33:48 PM
Event ID:		 8193
Task Category: None
Level:		   Error
Keywords:		 Classic
User:			 N/A
Computer:		 SERVER.AUDIOEXPRESS.local
Description:
Volume Shadow Copy Service error: Unexpected error calling routine RegOpenKeyExW(-2147483646,SYSTEM\CurrentControlSet\Services\VSS\Diag,...).  hr = 0x80070005, Access is denied.

 

This was corrected by adding the "Network Service" full control permissions of the registry key "HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\services\VSS"

 

The there was this in the System log:

 

Log Name:	  System
Source:		Microsoft-Windows-WinRM
Date:		  3/6/2012 1:33:48 PM
Event ID:	  10154
Task Category: None
Level:		 Warning
Keywords:	  Classic
User:		  N/A
Computer:	  SERVER.AUDIOEXPRESS.local
Description:
The WinRM service failed to create the following SPNs: WSMAN/SERVER.AUDIOEXPRESS.local; WSMAN/SERVER.

 

This was corrected by adding the "Validated Write to Service Principal Name" permission for the "Network Service" account to the server's computer account in Active Directory.

 

I'm cautiously optimistic at this point..

 

Seems like I do remember that from a couple of years ago, but in my case they were just failing, or timing out.

 

Here is a good resource for SBS:

 

http://blog.mpecsinc.ca/

Share this post


Link to post
Share on other sites

The real question is why did it suddenly lose or change its settings? Are automatic updates enabled?

Share this post


Link to post
Share on other sites

If I rememebr correctly, there is an update that breaks that permission. I found it on Susan Bradley's site.

Share this post


Link to post
Share on other sites

"It's not a bug, it's a feature" :D

Share this post


Link to post
Share on other sites

Well, it turned out to be a short-lived victory. It froze again about 30% into the 3rd client backup.

 

Yesterday evening before I left I added a 2TB drive (attached to one of the extra ports on the raid controller) and moved the client backups over to it. At the same time I re-enabled the 2nd nic and disabled the one we had been using prior.

 

I remoted into it last night and was able to do a manual backup of all the machines, and then at 2am each client did another backup without issue.

 

I woke back up at 4:30am and check on it and it was still ok, but apparenty shortly after they got there at about 8:30 it froze and they had to restart it..

 

So it's looking more and more like its not really an issue with the backups at all, more or less just the stress that the actual backup process puts on it is causing it to freeze.

 

Looking at the event viewer, the server was restarted at 9:07a and the last thing in the log before that was at 8:17a and that was and informational:

 

Disk 1

Log Name:	  Application
Source:	    MSSQL$SQLEXPRESS
Date:		  3/7/2012 8:17:54 AM
Event ID:	  17137
Task Category: Server
Level:		 Information
Keywords:	  Classic
User:		  SYSTEM
Computer:	  SERVER.AUDIOEXPRESS.local
Description:
Starting up database 'ReportServer$SQLEXPRESSTempDB'.

 

So at this point the event viewer really isnt all that helpful..

 

My gut says this is a raid controller / disk issue, but if thats the case I would have thought that it would have frozen while writing to the 2Tb as well, as it was also attached to that controller.

 

I can also see the SMART data for the disks from the raid controller's interface, and everything looks peachy there:

 

Device Type  SATA(5001B4D419635010)
Device Location  Enclosure#1 Slot#1
Model Name  WDC WD5003ABYX-01WERA0
Serial Number  WD-WMAYP1405383
Firmware Rev.  01.01S01
Disk Capacity  500.1GB
Current SATA Mode  SATA300+NCQ(Depth32)
Supported SATA Mode  SATA300+NCQ(Depth32)
Disk APM Support  Yes
Device State  Normal
Timeout Count  0
Media Error Count  0
Device Temperature  30 ºC
SMART Read Error Rate  200(51)
SMART Spinup Time  139(21)
SMART Reallocation Count  200(140)
SMART Seek Error Rate  200(0)
SMART Spinup Retries  100(0)
SMART Calibration Retries  100(0)

 

Disk 2

Device Type  SATA(5001B4D419635011)
Device Location  Enclosure#1 Slot#2
Model Name  WDC WD5003ABYX-01WERA0
Serial Number  WD-WMAYP1315290
Firmware Rev.  01.01S01
Disk Capacity  500.1GB
Current SATA Mode  SATA300+NCQ(Depth32)
Supported SATA Mode  SATA300+NCQ(Depth32)
Disk APM Support  Yes
Device State  Normal
Timeout Count  0
Media Error Count  0
Device Temperature  31 ºC
SMART Read Error Rate  200(51)
SMART Spinup Time  141(21)
SMART Reallocation Count  200(140)
SMART Seek Error Rate  200(0)
SMART Spinup Retries  100(0)
SMART Calibration Retries  100(0)

 

Disk 3

Device Type  SATA(5001B4D419635012)
Device Location  Enclosure#1 Slot#3
Model Name  WDC WD5003ABYX-01WERA0
Serial Number  WD-WMAYP1313026
Firmware Rev.  01.01S01
Disk Capacity  500.1GB
Current SATA Mode  SATA300+NCQ(Depth32)
Supported SATA Mode  SATA300+NCQ(Depth32)
Disk APM Support  Yes
Device State  Normal
Timeout Count  0
Media Error Count  0
Device Temperature  31 ºC
SMART Read Error Rate  200(51)
SMART Spinup Time  144(21)
SMART Reallocation Count  200(140)
SMART Seek Error Rate  200(0)
SMART Spinup Retries  100(0)
SMART Calibration Retries  100(0)

 

Disk 4

Device Type  SATA(5001B4D419635013)
Device Location  Enclosure#1 Slot#4
Model Name  WDC WD5003ABYX-01WERA0
Serial Number  WD-WMAYP1304942
Firmware Rev.  01.01S01
Disk Capacity  500.1GB
Current SATA Mode  SATA300+NCQ(Depth32)
Supported SATA Mode  SATA300+NCQ(Depth32)
Disk APM Support  Yes
Device State  Normal
Timeout Count  0
Media Error Count  0
Device Temperature  31 ºC
SMART Read Error Rate  200(51)
SMART Spinup Time  142(21)
SMART Reallocation Count  200(140)
SMART Seek Error Rate  200(0)
SMART Spinup Retries  100(0)
SMART Calibration Retries  100(0)

 

Disk 5 (Hot Spare)

Device Type  SATA(5001B4D419635014)
Device Location  Enclosure#1 Slot#5
Model Name  WDC WD5003ABYX-01WERA0
Serial Number  WD-WMAYP1315115
Firmware Rev.  01.01S01
Disk Capacity  500.1GB
Current SATA Mode  SATA300+NCQ(Depth32)
Supported SATA Mode  SATA300+NCQ(Depth32)
Disk APM Support  Yes
Device State  Normal
Timeout Count  0
Media Error Count  0
Device Temperature  31 ºC
SMART Read Error Rate  100(51)
SMART Spinup Time  142(21)
SMART Reallocation Count  200(140)
SMART Seek Error Rate  200(0)
SMART Spinup Retries  100(0)
SMART Calibration Retries  100(0)

 

I think the next step might be to do a restore back to a single hard drive, attached directly to the motherboard.. at this point I'm up for any ideas..

 

Thanks guys

Share this post


Link to post
Share on other sites

My apologies if this has been covered before:

  • Have you looked at overheating of the CPU?
    • Air filters plugged?
    • vents dusty/plugged?

    [*]Have you checked into power quality issues?

    • under powered PSU?
    • failing UPS or no UPS and inadequate circuit protection/filtering?
    • low/high voltage?
    • What new devices have been added to the Circuit that feeds you Server?
      • Anything with rotating machines will add LOTS of harmonics to your circuit
      • florissant and energy saving lights?

    [*]Grounding issues? Floating grounds?

Edited by Joe_Miner

Share this post


Link to post
Share on other sites

Have you checked the internal temps? Is it possible it's overheating during backups?

Share this post


Link to post
Share on other sites

I'm here now working on it, and I'm convinced now that its the raid controller after reading similar stories around the net about this exact controller.

 

Temps are fine, in any case I should know for sure here in a few minutes about the controller.

Share this post


Link to post
Share on other sites

Which controller?

Share this post


Link to post
Share on other sites

Is there a firmware update for it?

Share this post


Link to post
Share on other sites

Is there a firmware update for it?

 

Unfortunately no, the newest firmware is from 2010, and the card was purchased last summer so I would assume that that is the initial release version.

 

After a round-about of restores and clones I have the server up and running from a single 1Tb WD Black drive... so far so good, we'll watch it and see.

 

On a side note, It never occurred to me that the restore process wouldn't allow you to restore to a smaller drive (from the 1.5TB array to a 1TB drive)

I just assumed because very little of the actual disk was used that it would resize it as most clone utilities would.. I was wrong :(

Share this post


Link to post
Share on other sites

On a side note, It never occurred to me that the restore process wouldn't allow you to restore to a smaller drive (from the 1.5TB array to a 1TB drive)

I just assumed because very little of the actual disk was used that it would resize it as most clone utilities would.. I was wrong :(

 

Yep. That subject has been covered a number of times in these forums.

Share this post


Link to post
Share on other sites

1 to 1 for a restore unless you are going to a larger drive.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now