Jump to content


Photo

SBS2011 Freezes during client backups


  • Please log in to reply
24 replies to this topic

#1 tojoski

tojoski

    HSS Pro

  • Members
  • 211 posts
  • LocationMcRae, Arkansas

Posted 02 March 2012 - 11:01 AM

Ok I realize this is SBS and not WHS, but here's a doozie for you guys.. I'm about to pull my hair out!

One of my clients is running SBS2011 Essentials for their file / application server. All of a sudden, after about 8 months without so much as a hiccup, the server will freeze when a client pc is backing up to it. Server backup still runs fine and the server runs great as long as no client attempts to do a backup.

I blew away the Client Computer Backups content and attempted to start over, no dice... I can restore the server back to an earlier date, but I'd rather avoid that if possible. Theres just so much to make sure is right if I have to do a restore.. (SQL Databases and Config for Microsoft Dynamics Point of Sale)

There are 2 partitions, the C and D, both of which are on a Raid 5 array, and I've ran a consistency check on the array and it came out fine.

So I dont think its a disk issue, there are no issues other than the client backups..

Ideas?

#2 jmwills

jmwills

    HSS Genius

  • Donating Member
  • 5,370 posts
  • LocationHuntsville, AL - Kandahar, AFG

Posted 02 March 2012 - 11:51 AM

I'll throw some things out to check:

Permisisons
RAM
AV interferance
Ample space to perform the backups with Shadow Copy room

What is in the Event Logs when this starts to happen? Do the clients have any issues accessing files?
Windows 7 Desktop - Antec 100 Case, Intel D8H67BL, OCZ 550W PSU, Intel i3-530 CPU w/16GB G-Skill DDR3 1333 RAM
Server 2012 - Fractal Arc Midi, CoolerMaster M600 PSU, ASUS P8H67V, Intel i5-2500 CPU w/32GBG-Skill DDR3 1333 RAM, 90 GIG OCZ SSD OS Drive – Roles: Hyper-V (WHS-SharePoint-DC-SQL-Exchange-WSE 2012), Print Server - Rocket RAID 2720 5x2TB
HTPC Build - Silverstone GD05 Case, ASUS P7H55-M PRO, CoolerMaster M600W PSU, Intel i3-530 CPU w/4GB G-Skill DDR3 1333 RAM. OCZ 60GB SSD Drive for the OS with a 120GB WD 2.5" Blue drive for data storage.
Travel Laptop: Dell XPSL502X 15.6"

#3 ikon

ikon

    HSS Genius

  • Donating Member
  • 8,863 posts

Posted 02 March 2012 - 12:04 PM

I still wouldn't rule out a disk issue. RAID5 check utilities don't typically (AFAIK) do a surface analysis of the drives.

Also, what about the network? The NIC could be a bit flaky, such that it can handle normal traffic OK (with retransmits, but not enough to hang anything up) but falls over when the extra traffic of a backup starts.

I also agree with jmwills about RAM; you might want to run MemTest86 on it.

All of that said, I'm not, at this point, really thinking it's hardware. If it's feasible, I would arrange a time when I could go in with a spare drive, install it into the box after removing all the existing ones, install SBS on it and test backups. That would take a while (maybe 1 to 2 hours) but it would isolate whether it's a hardware or software issue. Best of all would be if you have a duplicate system at your place that you could use to pre-install SBS and just take in the already configured drive. I know not many people do it nowadays but I used to always try to maintain duplicate hardware at my shop that I could use to support clients. Course, I'm not in that business any more.....

If at first you don't succeed, do it like your mother told you.


#4 tojoski

tojoski

    HSS Pro

  • Members
  • 211 posts
  • LocationMcRae, Arkansas

Posted 02 March 2012 - 01:08 PM

When I first started looking at it, I suspected a disk error as well..because when I opened the recycle bin, I got the error " the recycle bin for drive D is corrupt".. oh great :(

I did a chkdsk on both partitions and they came out fine..

The NIC is a possibility, it's an Asus server board so theres a 2nd NIC ready to go, it just has to be enabled.

Thanks for pointing out the Ram, I'll run a memtest on it overnight and see if that comes up with anything. I'll also try throwing another disk in it and moving the client backup files to that drive.. that should eliminate it being a disk error on the array...

I've poured through the event viewer for hours and there is nothing in there of any help, nothing of any importance gets logged between the freeze and the time it gets rebooted. Nothing short of a hard boot will bring it back to life when this happens too, the cursor will still move but everything else is unresponsive.

Thanks for the tip about the permissions too, hadnt thought to check the permissions.

Edited by tojoski, 02 March 2012 - 01:09 PM.


#5 ikon

ikon

    HSS Genius

  • Donating Member
  • 8,863 posts

Posted 02 March 2012 - 01:34 PM

Any possibility that the client is closed on weekends so you could bring it home to work on it? Or are you already ahead of me? :)

If at first you don't succeed, do it like your mother told you.


#6 jmwills

jmwills

    HSS Genius

  • Donating Member
  • 5,370 posts
  • LocationHuntsville, AL - Kandahar, AFG

Posted 02 March 2012 - 01:42 PM

If you think it could be the NIC, you could open a performance monitor applet and then kick off a client back up to see what happens during the event.
Windows 7 Desktop - Antec 100 Case, Intel D8H67BL, OCZ 550W PSU, Intel i3-530 CPU w/16GB G-Skill DDR3 1333 RAM
Server 2012 - Fractal Arc Midi, CoolerMaster M600 PSU, ASUS P8H67V, Intel i5-2500 CPU w/32GBG-Skill DDR3 1333 RAM, 90 GIG OCZ SSD OS Drive – Roles: Hyper-V (WHS-SharePoint-DC-SQL-Exchange-WSE 2012), Print Server - Rocket RAID 2720 5x2TB
HTPC Build - Silverstone GD05 Case, ASUS P7H55-M PRO, CoolerMaster M600W PSU, Intel i3-530 CPU w/4GB G-Skill DDR3 1333 RAM. OCZ 60GB SSD Drive for the OS with a 120GB WD 2.5" Blue drive for data storage.
Travel Laptop: Dell XPSL502X 15.6"

#7 ikon

ikon

    HSS Genius

  • Donating Member
  • 8,863 posts

Posted 02 March 2012 - 02:01 PM

If you think it could be the NIC, you could open a performance monitor applet and then kick off a client back up to see what happens during the event.


Good idea. I'm still liking the idea of putting a different disk, with a fresh install of SBS, into the server and trying a backup in order to absolutely isolate if it's hardware or software. I'm a big believer in the 'divide the problem in half repeatedly until there's only one option left'. Tojoski's idea of putting in a new disk and moving the backups to it is good, but it's not an absolute test of hardware vs software.

If at first you don't succeed, do it like your mother told you.


#8 tojoski

tojoski

    HSS Pro

  • Members
  • 211 posts
  • LocationMcRae, Arkansas

Posted 02 March 2012 - 05:14 PM

Thanks for the pointers guys.. they are open tomorrow so monday I'm going to attempt moving the backups to a different drive, as well as a overnight memtest.

#9 ikon

ikon

    HSS Genius

  • Donating Member
  • 8,863 posts

Posted 02 March 2012 - 05:31 PM

Closed Sunday & Monday?

If at first you don't succeed, do it like your mother told you.


#10 tojoski

tojoski

    HSS Pro

  • Members
  • 211 posts
  • LocationMcRae, Arkansas

Posted 06 March 2012 - 05:22 PM

They are open Mon - Sat, but I wasnt available over the weekend to do any troubleshooting.

The overnight memtest I ran last night came out squeaky clean, 14 passes with no issues.

I did find some interesting errors in the event log which I initially thought were unrelated, but I researched them and fixed them and so far the backups are working again.

The errors I found were:

In the Application log:
Log Name:		 Application
Source:		 VSS
Date:			 3/6/2012 1:33:48 PM
Event ID:		 8193
Task Category: None
Level:		   Error
Keywords:		 Classic
User:			 N/A
Computer:		 SERVER.AUDIOEXPRESS.local
Description:
Volume Shadow Copy Service error: Unexpected error calling routine RegOpenKeyExW(-2147483646,SYSTEM\CurrentControlSet\Services\VSS\Diag,...).  hr = 0x80070005, Access is denied.

This was corrected by adding the "Network Service" full control permissions of the registry key "HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\services\VSS"

The there was this in the System log:

Log Name:	  System
Source:		Microsoft-Windows-WinRM
Date:		  3/6/2012 1:33:48 PM
Event ID:	  10154
Task Category: None
Level:		 Warning
Keywords:	  Classic
User:		  N/A
Computer:	  SERVER.AUDIOEXPRESS.local
Description:
The WinRM service failed to create the following SPNs: WSMAN/SERVER.AUDIOEXPRESS.local; WSMAN/SERVER.

This was corrected by adding the "Validated Write to Service Principal Name" permission for the "Network Service" account to the server's computer account in Active Directory.

I'm cautiously optimistic at this point..

#11 ikon

ikon

    HSS Genius

  • Donating Member
  • 8,863 posts

Posted 06 March 2012 - 05:35 PM

Good troubleshooting. Don't you just wish the logs had more real-world descriptions though?

If at first you don't succeed, do it like your mother told you.


#12 jmwills

jmwills

    HSS Genius

  • Donating Member
  • 5,370 posts
  • LocationHuntsville, AL - Kandahar, AFG

Posted 06 March 2012 - 06:08 PM

They are open Mon - Sat, but I wasnt available over the weekend to do any troubleshooting.

The overnight memtest I ran last night came out squeaky clean, 14 passes with no issues.

I did find some interesting errors in the event log which I initially thought were unrelated, but I researched them and fixed them and so far the backups are working again.

The errors I found were:

In the Application log:

Log Name:		 Application
Source:		 VSS
Date:			 3/6/2012 1:33:48 PM
Event ID:		 8193
Task Category: None
Level:		   Error
Keywords:		 Classic
User:			 N/A
Computer:		 SERVER.AUDIOEXPRESS.local
Description:
Volume Shadow Copy Service error: Unexpected error calling routine RegOpenKeyExW(-2147483646,SYSTEM\CurrentControlSet\Services\VSS\Diag,...).  hr = 0x80070005, Access is denied.

This was corrected by adding the "Network Service" full control permissions of the registry key "HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\services\VSS"

The there was this in the System log:

Log Name:	  System
Source:		Microsoft-Windows-WinRM
Date:		  3/6/2012 1:33:48 PM
Event ID:	  10154
Task Category: None
Level:		 Warning
Keywords:	  Classic
User:		  N/A
Computer:	  SERVER.AUDIOEXPRESS.local
Description:
The WinRM service failed to create the following SPNs: WSMAN/SERVER.AUDIOEXPRESS.local; WSMAN/SERVER.

This was corrected by adding the "Validated Write to Service Principal Name" permission for the "Network Service" account to the server's computer account in Active Directory.

I'm cautiously optimistic at this point..


Seems like I do remember that from a couple of years ago, but in my case they were just failing, or timing out.

Here is a good resource for SBS:

http://blog.mpecsinc.ca/
Windows 7 Desktop - Antec 100 Case, Intel D8H67BL, OCZ 550W PSU, Intel i3-530 CPU w/16GB G-Skill DDR3 1333 RAM
Server 2012 - Fractal Arc Midi, CoolerMaster M600 PSU, ASUS P8H67V, Intel i5-2500 CPU w/32GBG-Skill DDR3 1333 RAM, 90 GIG OCZ SSD OS Drive – Roles: Hyper-V (WHS-SharePoint-DC-SQL-Exchange-WSE 2012), Print Server - Rocket RAID 2720 5x2TB
HTPC Build - Silverstone GD05 Case, ASUS P7H55-M PRO, CoolerMaster M600W PSU, Intel i3-530 CPU w/4GB G-Skill DDR3 1333 RAM. OCZ 60GB SSD Drive for the OS with a 120GB WD 2.5" Blue drive for data storage.
Travel Laptop: Dell XPSL502X 15.6"

#13 ikon

ikon

    HSS Genius

  • Donating Member
  • 8,863 posts

Posted 07 March 2012 - 10:40 AM

The real question is why did it suddenly lose or change its settings? Are automatic updates enabled?

If at first you don't succeed, do it like your mother told you.


#14 jmwills

jmwills

    HSS Genius

  • Donating Member
  • 5,370 posts
  • LocationHuntsville, AL - Kandahar, AFG

Posted 07 March 2012 - 11:49 AM

If I rememebr correctly, there is an update that breaks that permission. I found it on Susan Bradley's site.
Windows 7 Desktop - Antec 100 Case, Intel D8H67BL, OCZ 550W PSU, Intel i3-530 CPU w/16GB G-Skill DDR3 1333 RAM
Server 2012 - Fractal Arc Midi, CoolerMaster M600 PSU, ASUS P8H67V, Intel i5-2500 CPU w/32GBG-Skill DDR3 1333 RAM, 90 GIG OCZ SSD OS Drive – Roles: Hyper-V (WHS-SharePoint-DC-SQL-Exchange-WSE 2012), Print Server - Rocket RAID 2720 5x2TB
HTPC Build - Silverstone GD05 Case, ASUS P7H55-M PRO, CoolerMaster M600W PSU, Intel i3-530 CPU w/4GB G-Skill DDR3 1333 RAM. OCZ 60GB SSD Drive for the OS with a 120GB WD 2.5" Blue drive for data storage.
Travel Laptop: Dell XPSL502X 15.6"

#15 ikon

ikon

    HSS Genius

  • Donating Member
  • 8,863 posts

Posted 07 March 2012 - 12:20 PM

"It's not a bug, it's a feature" :D

If at first you don't succeed, do it like your mother told you.


#16 tojoski

tojoski

    HSS Pro

  • Members
  • 211 posts
  • LocationMcRae, Arkansas

Posted 07 March 2012 - 01:52 PM

Well, it turned out to be a short-lived victory. It froze again about 30% into the 3rd client backup.

Yesterday evening before I left I added a 2TB drive (attached to one of the extra ports on the raid controller) and moved the client backups over to it. At the same time I re-enabled the 2nd nic and disabled the one we had been using prior.

I remoted into it last night and was able to do a manual backup of all the machines, and then at 2am each client did another backup without issue.

I woke back up at 4:30am and check on it and it was still ok, but apparenty shortly after they got there at about 8:30 it froze and they had to restart it..

So it's looking more and more like its not really an issue with the backups at all, more or less just the stress that the actual backup process puts on it is causing it to freeze.

Looking at the event viewer, the server was restarted at 9:07a and the last thing in the log before that was at 8:17a and that was and informational:

Disk 1
Log Name:	  Application
Source:	    MSSQL$SQLEXPRESS
Date:		  3/7/2012 8:17:54 AM
Event ID:	  17137
Task Category: Server
Level:		 Information
Keywords:	  Classic
User:		  SYSTEM
Computer:	  SERVER.AUDIOEXPRESS.local
Description:
Starting up database 'ReportServer$SQLEXPRESSTempDB'.

So at this point the event viewer really isnt all that helpful..

My gut says this is a raid controller / disk issue, but if thats the case I would have thought that it would have frozen while writing to the 2Tb as well, as it was also attached to that controller.

I can also see the SMART data for the disks from the raid controller's interface, and everything looks peachy there:

Device Type  SATA(5001B4D419635010)
Device Location  Enclosure#1 Slot#1
Model Name  WDC WD5003ABYX-01WERA0
Serial Number  WD-WMAYP1405383
Firmware Rev.  01.01S01
Disk Capacity  500.1GB
Current SATA Mode  SATA300+NCQ(Depth32)
Supported SATA Mode  SATA300+NCQ(Depth32)
Disk APM Support  Yes
Device State  Normal
Timeout Count  0
Media Error Count  0
Device Temperature  30 ºC
SMART Read Error Rate  200(51)
SMART Spinup Time  139(21)
SMART Reallocation Count  200(140)
SMART Seek Error Rate  200(0)
SMART Spinup Retries  100(0)
SMART Calibration Retries  100(0)

Disk 2
Device Type  SATA(5001B4D419635011)
Device Location  Enclosure#1 Slot#2
Model Name  WDC WD5003ABYX-01WERA0
Serial Number  WD-WMAYP1315290
Firmware Rev.  01.01S01
Disk Capacity  500.1GB
Current SATA Mode  SATA300+NCQ(Depth32)
Supported SATA Mode  SATA300+NCQ(Depth32)
Disk APM Support  Yes
Device State  Normal
Timeout Count  0
Media Error Count  0
Device Temperature  31 ºC
SMART Read Error Rate  200(51)
SMART Spinup Time  141(21)
SMART Reallocation Count  200(140)
SMART Seek Error Rate  200(0)
SMART Spinup Retries  100(0)
SMART Calibration Retries  100(0)

Disk 3
Device Type  SATA(5001B4D419635012)
Device Location  Enclosure#1 Slot#3
Model Name  WDC WD5003ABYX-01WERA0
Serial Number  WD-WMAYP1313026
Firmware Rev.  01.01S01
Disk Capacity  500.1GB
Current SATA Mode  SATA300+NCQ(Depth32)
Supported SATA Mode  SATA300+NCQ(Depth32)
Disk APM Support  Yes
Device State  Normal
Timeout Count  0
Media Error Count  0
Device Temperature  31 ºC
SMART Read Error Rate  200(51)
SMART Spinup Time  144(21)
SMART Reallocation Count  200(140)
SMART Seek Error Rate  200(0)
SMART Spinup Retries  100(0)
SMART Calibration Retries  100(0)

Disk 4
Device Type  SATA(5001B4D419635013)
Device Location  Enclosure#1 Slot#4
Model Name  WDC WD5003ABYX-01WERA0
Serial Number  WD-WMAYP1304942
Firmware Rev.  01.01S01
Disk Capacity  500.1GB
Current SATA Mode  SATA300+NCQ(Depth32)
Supported SATA Mode  SATA300+NCQ(Depth32)
Disk APM Support  Yes
Device State  Normal
Timeout Count  0
Media Error Count  0
Device Temperature  31 ºC
SMART Read Error Rate  200(51)
SMART Spinup Time  142(21)
SMART Reallocation Count  200(140)
SMART Seek Error Rate  200(0)
SMART Spinup Retries  100(0)
SMART Calibration Retries  100(0)

Disk 5 (Hot Spare)
Device Type  SATA(5001B4D419635014)
Device Location  Enclosure#1 Slot#5
Model Name  WDC WD5003ABYX-01WERA0
Serial Number  WD-WMAYP1315115
Firmware Rev.  01.01S01
Disk Capacity  500.1GB
Current SATA Mode  SATA300+NCQ(Depth32)
Supported SATA Mode  SATA300+NCQ(Depth32)
Disk APM Support  Yes
Device State  Normal
Timeout Count  0
Media Error Count  0
Device Temperature  31 ºC
SMART Read Error Rate  100(51)
SMART Spinup Time  142(21)
SMART Reallocation Count  200(140)
SMART Seek Error Rate  200(0)
SMART Spinup Retries  100(0)
SMART Calibration Retries  100(0)

I think the next step might be to do a restore back to a single hard drive, attached directly to the motherboard.. at this point I'm up for any ideas..

Thanks guys

#17 Joe_Miner

Joe_Miner

    HSS Elite

  • Moderators
  • 1,819 posts
  • LocationCentral Illinois

Posted 07 March 2012 - 04:47 PM

My apologies if this has been covered before:
  • Have you looked at overheating of the CPU?
    • Air filters plugged?
    • vents dusty/plugged?
  • Have you checked into power quality issues?
    • under powered PSU?
    • failing UPS or no UPS and inadequate circuit protection/filtering?
    • low/high voltage?
    • What new devices have been added to the Circuit that feeds you Server?
      • Anything with rotating machines will add LOTS of harmonics to your circuit
      • florissant and energy saving lights?
  • Grounding issues? Floating grounds?

Edited by Joe_Miner, 07 March 2012 - 04:52 PM.

WHS-V1: HP EX-487: 4*WD20EARX, Athena AP-MFATX30, 4GB G.Skill, E5200, Stablebit Scanner-|-
WHS-2011: HP N54L G7, Kingston ECC 8GB KVR1333D3E9SK2/8G, OS: 256GB M4, 5*ST3000DM001, WD PCIe USB3, R640L, Stablebit DrivePool & Scanner -|-
Labs: HP N40L, G.Skill 16GB F3-1333C9D-16GAO, Misc HDD's -|- HP N40L, KingstonECC KVR1333D3E9SK2/16G, CorsairGT60GB, 1xVR, Misc HDD's -|-
S2012DC VM Lab: Lian-Li K9WX, Z77X-UD5H, i7-3770, 32GBG.Skill, 240GBCorsairGT, 2xSamsung840Pro256G, 2xVR's, + Misc HDD's-|-
Desktop W8P64: HAF 932,GA-X58A-UD3R,i7-930,12GB,240GB Corsair GS + Misc HDD's,HD5850,Samsung Series9 & 213T+Planar PX2710MW,C920 -|-
HTPC3 W8P64WMC: GD05B, GA-Z68MX-UD2H-B3, i5-2500K, 16GB G.Skill, Corsair GTX 240GB, Crucial M4 256GB , C910, Camtasia-|-
Laptop W8P64WMC: Acer 1810T, 4GB RAM, 240GB Corsair GT SSD-|-


#18 ikon

ikon

    HSS Genius

  • Donating Member
  • 8,863 posts

Posted 08 March 2012 - 11:03 AM

Have you checked the internal temps? Is it possible it's overheating during backups?

If at first you don't succeed, do it like your mother told you.


#19 tojoski

tojoski

    HSS Pro

  • Members
  • 211 posts
  • LocationMcRae, Arkansas

Posted 08 March 2012 - 12:14 PM

I'm here now working on it, and I'm convinced now that its the raid controller after reading similar stories around the net about this exact controller.

Temps are fine, in any case I should know for sure here in a few minutes about the controller.

#20 ikon

ikon

    HSS Genius

  • Donating Member
  • 8,863 posts

Posted 08 March 2012 - 12:29 PM

Which controller?

If at first you don't succeed, do it like your mother told you.





0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users