isthewebsitedown if you are asking, probably not. if I am asking, probably so

31Oct/090

When to panic…

I am working on a 225 mailbox migration this weekend. The environment is basically the following:

Old Server: Windows 2003 Std/Exchange 2003 Std, all patches (pretty basic)

New environment: 2 Windows 2008 Enterprise Mailbox servers running the Exchange 2007 Enterprise mailbox role in CCR with a Windows 2008 Standard machine running the CAS/HT roles and serving as the File Share Witness host. Each of the mailbox servers have three volumes (CCR likes both machines to be as nearly identical as possible): a 40GB C:, a 20GB D: for log files (on a RAID10) and a 300GB E: (on a RAID6). These volumes were set up by a co-worker a few weeks ago and he did a great job with it. The servers are fast and they have great I/O on disk writes. All three machines are hosted in a ESX/Blade server environment with a SAN backend connected via Fibre Channel. This is becoming a pretty popular arrangement. The RAID10 logfile volume is considered best practice for performance reasons. The mailbox store lives on the big RAID6 volume for fault tolerance.

Anyway, all machines were updated and I had tested failing over the CCR cluster nodes successfully, so at about midnight last night, I started moving mailboxes. At around 2am, the old mail server went offline. It responded to ping, but I could not RDP to it or get to and SMB shares. Couldn't get to the services either. It was, for my purposes, dead. The big issue here is that the mailbox move process was still trying to work, for all 225 mailboxes. The lack of old server caused all kinds of issues to take place that had the effect of hammering the log files. And since log shipping is pretty much how CCR works, both servers started choking. In two hours, we generated 19.8GB of log files, which then knocked the mailstore offline. I could not remount it, since there was no room for more logfiles.

Panic mode.

I temporarily stopped the replication, created new log file folders on both of the cluster nodes, moved the location of the log files in AD, moved the files themselves over to the big data volume, and restarted replication. These steps were originally from EXPTA.com, but it appears that that site is down, so I am linking to the google cache. These should all be done in the Exchange Management Shell (launched as administrator), and only performed after the new log directories have been created on both cluster nodes in the exact same location. Obviously, you will need to also change the paths to match your environment.

Step 1:  Suspend-StorageGroupCopy -Identity "First Storage Group" -SuspendComment "Moving transaction logs" -Confirm:$False

Step 2:  Move-StorageGroupPath -Identity 'First Storage Group' -LogFolderPath 'E:\ExchangeLogs' -SystemFolderPath 'E:\ExchangeLogs' -ConfigurationOnly

Step 3: move [oldpath]\*.* [newpath]

Step 4: Resume-StorageGroupCopy -Identity "exchange1\First Storage Group"

After this was completed (step 3 took a while, since I had 20GB of logfiles) I was able to remount the store and test via OWA. Then it was time to figure out why the Ex2003 box went down. After the moves are complete, I will run a backup to commit those log files to the DB and then move them back to the correct drive, as 20GB should be enough in any  normal case.