iSCSI copy cause Windows 2012 OS to hang / crash / burn

Started by Dieselboy, September 23, 2016, 04:00:30 AM

Previous topic - Next topic

Dieselboy

Does anyone know or can think of any reason why a Windows 2012 server is hanging / crashing when trying to copy a reasonable amount of data (65GB) to a disk which is an iSCSI disk via a 10GB network?

I done the following:

1. create the new volume on the SAN as normal
2. open the iscsi initiator properties on the windows 2012 box, and discover / add the target
3. open disk management and initialise / set up the new "disk"
4. in "my computer" open the disk and try to copy a large amount of data to it

When this issue happened I built a brand new Win 2012 server from the ISO and re-done the same test. I got the same hang/crash result not long after copying.

Since this was on our brand new SAN, I done the same test above with the new VM I built, but I created a volume on our old SAN and repeated the copy test.
I do get the same hang result, but I notice that the copy operation gets about 50% complete before this. Also the copy rate does go a lot slower since our old SAN is a bit slower than our new one. The transfer speed drops from slightly over 100MB/s to 80MB/s then to 60MB/ (these are all rough averages and are not exact from the screen!), then it goes up to around 70MB/s and levels out.

I'm going to test with a Windows 7 VM and see what happens.

icecream-guy

:professorcat:

My Moral Fibers have been cut.

deanwebb

Quote from: ristau5741 on September 23, 2016, 06:55:26 AM
hangs because it's Microsoft?  >:D
That's the short version... but if there's something messed up in the I/O with the SAN device, that could be messing things up on the Windows box.
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

Dieselboy

Quote from: deanwebb on September 23, 2016, 08:39:30 AM
Quote from: ristau5741 on September 23, 2016, 06:55:26 AM
hangs because it's Microsoft?  >:D
That's the short version... but if there's something messed up in the I/O with the SAN device, that could be messing things up on the Windows box.

That's what I was wondering.

I've been doing a load of tests today to narrow this down. Here's the summary of tests done

1. Original Win2012 vm to new SAN = HANG
2. new Win2012 vm built from CD to new SAN = HANG
3. new Win2012 vm built from CD to previous SAN = HANG
4. Win7 VM to old SAN = GOOD
5. Win7 VM to new SAN = GOOD

I've given extra CPU to the Win2012 VM and I thought that had resolved it but it hung short while after. I then made some changes to the nic policy (removed the nic policy entirely on a hypervisor for the storage nic for testing). Initially I thought this had resolved it but it eventually hung.

I did try to run Wireshark, but capturing the data is easy to fill up the disk with the capture.

Jumbo frames are enabled on the NIC in device manager, for the storage VLAN. I've set the MTU to 9000 in the MTU field in the nic driver.
I'm now wondering if there could be a NIC driver issue or something similar.

deanwebb

Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

Dieselboy

Came in this morning to see I'd not posted my update:

"I managed to find my issue, sort of: http://arstechnica.com/civis/viewtopic.php?f=17&t=118082

I'm not able to update the nic driver as we're on the latest. So I was going to try a copy test with Receive Side Scaling disabled.

I went to the storage nic, disabled RSS
I then went to open iscsi and start the microsoft iscsi console - Windows hangs immediately.

So I force a reboot, check that RSS is disabled. Then open iscsi properties and Windows again hangs immediately."

So it's looking like a NIC driver issue, possibly something to do with memory. I've asked Red Hat to lab up in their env. and test. Since the Windows 2012 OS is a simple OS with gui, fully up to date and the only installation is the Red Hat Virtualisation drivers/ nic driver, they should be able to confirm.

Going to see if I can mount that volume a different way as a sufficient workaround (ie mount from the hypervisor instead) so that I'm not waiting on a fix.

deanwebb

Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

Dieselboy

Mounted the volume in the Red Hat Virtualisation Management console, which in turn mounts the volume at each hypervisor.

This produced two positive results:
1. VM completes the copy and does not hang (over 60GB in one go)
2. Higher throughput (seeing peak of over 230MB/s now from the Windows copy window and monitoring the physical 10GB switch port I'm seeing peaks of 2.81G)

I'll provide feedback to the Red Hat Virtualisation team against the support case I raised. But I'm happy with this "workaround". <- I say workaround but this is probably the better way of doing this anyway. :D

deanwebb

Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

Langly

NIC driver most likely the full cause from what you've been able to test out. Few other thoughts below since I do a lot with SAN fabrics now.

Did you check for any iScsi login/logouts on the SAN and Host? Any multipathing issues (or is there even MPIO in use?)

Any ODX support on the storage? Is your Windows host trying to use ODX at all?

For the iscsi setup, are you fully following your storage vendors best practices? Some require tcp offload disabled, others a proper segmentation of iScsi initiators/adapters so theres multiple vlans/subnets for the paths across the SAN. Flow control enabled/disabled on the switches/host/storage?

Remember SAN is storage area network which includes everything not just the storage array. The storage array also has its own drivers it has to utilize on the NICs which could be a problem too. I've seen some storage arrays panic due to bad drivers

If using UCS or Nexus equipment, make sure you're not hitting any odd bugs that can cause dropped frames leading to scsi aborts from the host losing access. There were a few bugs from Cisco that you could hit by reseating an SFP/cable or even rebooting an FI or Nexus switch

Dieselboy

Hi Langly, you've given me a few more things to check so thanks!

I can't comment on some things until I go and check them over, will be next week now as I'm travelling back from Sri Lanka.

MPIO is not in use on either SAN, Yet. I will be setting this up on our new SAN.

We do have UCS and I have seen in the past many errors propagated within the UCS. This was due to a bad SFP and the fact the switches are cut-through so all they're able to do is log the error as it's already on the wire before they notice. The error frame will get forwarded. Re-seat all SFPs until the errors stop was our fix. I've not seen this for a long time and when I hit the issue this was the first thing I checked over. There are zero errors.

Thanks for the info - I'll check them out :)

Langly

Not a problem, keep us updated and I can help out where I can to keep you in the right direction (and if you have to engage your vendor I'm very good at poking the right places to make vendors dance for SAN connectivity :D)

The UCS item may not actually present with noticeable errors in the UCS logs so if all else fails make sure the UCS firmware is on a good code revision and beat up Cisco to make sure you can't be impacted by any of the like 6-7 bugs that can cause that type of issue.

Get MPIO going too as soon as you can, thats where you can really take advantage of the SAN fabric for proper speeds and IO control using block storage

Thinking I should write up a nice long thread for SAN connectivity on FC and iScsi. Maybe even dive into NFS and File based storage

Dieselboy

Late reply!

The issue was caused due to the Hypervisor mounting the iSCSI LUNs for the VM's on itself as a volume. I raised support cases to request additional information about iSCSI LUNs as VM disks before configuring but they couldn't really help me and basically said trust the documentation, which was a bit light on info hence the case.

So eventually, their virtualisation expert got involved and saw this and other issues were related to the Hypervisor mounting the volumes on itself instead of passing through the traffic to the VM. They had to put in some complex LV filter on each host to prevent this. But this is required to be done for "Storage domains" which is most common it seems. So it feels like out of the box it is left as default to mount ALL volumes because the ovirt/kvm coders assume the users of this product will be using storage domains instead of LUN disks.

Storage domains can be a LUN but the usage is different. A Storage domain contains disk "images" for many VMs. A Disk image is basically just a file and the VM uses it as a hard disk.

LUN disks are attaching the LUN to the Hypervisor/VM and the VM uses the LUN as a hard disk.

The difference between the two is that the virtualisation system manages the storage domain and contained disk images. Managing the storage domain incorporates tasks like snapshotting the disks, creating disks, deleting disks, cloning etc. This is all contained within the "storage domain" or the single LUN.

With Direct LUN disks, all of the above is not managed by the virtualisation system at all.

deanwebb

So the solution was...? I think I missed something, there.
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

wintermute000

I too am confused. So your KVM hosts are intercepting  the iSCSI call from the VM and mounting the LUN at the hypervisor level even though you mounted it from the VM, not the hypervisor?!?!!?
And for some reason it doesn't do this for Win7 guests, just Win Server 2012?