Red Hat Enterprise Virtualisation RHEV network limitation (rhev 3.6)

Started by Dieselboy, July 10, 2016, 09:37:17 PM

Previous topic - Next topic

Dieselboy

For a while I have noticed that our 10GB interfaces from the Hypervisor Hosts are never utilised very much. We had a case open with Red Hat where some Guest VMs are not migrating to other Hypervisors. The error states that the Guests are taking too long to migrate and effectively the RHEV system gives up and cancels the migration.

I queried with Red Hat does RHEV throttle the migration because the utilisation on our 10GB interfaces between Hypervisors are rediculously low; sometimes not even 1gbps. Turns out that RHEV throttles live-migration to 32mbps.
:zomgwtfbbq:

So this meant that our 10GB / 20GB channels are well under-utilised and resulting in extended times for live migration. One Guest used for testing was coming back as calculated 6.4 minutes for live-migration times. We now have this down between 30s maximum and sometimes as low as 8s for Guests with lower RAM. I'll keep on the lookout for migration times lower than this!

We done two things to combat this
1. Configure a unique live-migration VLAN, with jumbo frames
2. configure the Hosts to remove the throttle limitation

Point 1 should have been in the initial virtualisation design. Unfortunately in our case this system has somewhat organically grown. Storage is not yet using Jumbo's because it requires downtime. It's on the list so once some downtime is planned I hope to work this in; if the outage permits (I don't like mutliple simultaneous changes!).

Point 2 required a support case. We had to configure three lines in a config file and restart a service on each Hypervisor! In my case because we have 10GB/20GB/40GB bandwidth (depending on where the traffic is going to/from), we are currently trialling this config:

1. remove the max bandwidth limitation completely. This defaults to 32mbps out of the box
2. increase the migration time out to 500s
3. Remove the limitation for the time which is derived per 1GB of Guest VM RAM allocation, the default time is 64 seconds per 1GB. So even by this you can see that a VM with 4GB RAM will have a time limit of 256 seconds, and the migration time is calculated to be half of this value at 128s or just over 2 minutes.

To make these changes, on each hypervisor you need to add the 3 lines of config to /etc/vdsm/vdsm.conf
1.migration_max_bandwidth = 0
2.migration_progress_timeout = 500
3.migration_max_time_per_gib_mem = 0

After that, restart the vdsmd service: systemctl restart vdsmd

The above was done on RHEV 3.6 with Hypervisor appliances built from the .ISO which are running RHEL7.2

After the above config changes, Guests are live-migrated in seconds. Most around the 8 or 9 second mark.

deanwebb

Ouch. 32Mb limitation... what are they, planning to use port-channel on a pair of token-ring 16Mb interfaces?
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

Dieselboy

All I could think of as a reson for the low limitation, was that they are trying to protect the environment, storage access and running VMs by limiting the bandwidth used by live migration; so that it does not max out the bandwidth and take away bandwidth from storage (most importantly) or access to guest VMs. If this happens then theoretically the hypervisors would not be able to communicate with IP storage which would then wreak havoc on the running VMs.

I can see this being more of an issue with a lower bandwidth network, where it would be easy to max out say 1gbps.

I was going to say that I've included it here because I expect the same to be for the ovirt virtualisation stuff, which is basically what RHEV is anyway. Like how Centos is almost RHEL.

Otanx

You (or someone) mentioned the 32M limit recently, and I forwarded the associated Redhat doc to our VM guys. They have a couple heavy VMs they can't migrate, and this may fix it.

-Otanx

Dieselboy

That was our problem, we had a couple of OpenShift VMs with 6 and 12GB RAM, that was taking longer than the timeout value to migrate, and so RHEV cancelled the migration. Timeout value is calculated on some variables but in our case was 384 seconds - that's a very long time for 1 VM.

After this change, I can migrate those VMs in around 10 to 20 seconds each.

I did repeatedly question support about why this limitation is there and they confirmed my thoughts and some more. They say:
Quote
The 32MBps limitation was considered outdated and was raised to 52MBps in RHEV 3.6 (your vdsm version has 52MBps as default)

These limitations exist because, initially, the default network 'ovirtmgmt' is shared for everything. And this includes host management, vm migration, vm network, and likely includes storage (NFS/iSCSI). Once one moves to more sophisticated setups, using FibreChannel, or with 25/40/100Gbps NICs and switches/vlans dedicated to VM migration, these values can of course be raised. The most critical here would be RHEV-M fencing a host because the network is saturated with migrations. So the idea is to play safe as it's better to, by default, fail a migration than to disconnect from storage or have a perfectly fine host with running VMs fenced. Once migrations start failing, then the situation can be evaluated and, if appropriate, these values raised.

They also went on to give a reasonable detail about how the VM is migrated. Such as the hypervisors copying the RAM across to the new hypervisor into a blank / paused VM whilst the VM is running on the previous Hypervisor, freezing the VM when this copy is almost complete, copying the last memory pages such as memory that has been updated on the running VM after / during the copy. Then new VM is run and old VM shut.

Red Cap link on this here: https://access.redhat.com/solutions/744423

I have still left the max concurrent live-migrations to 5. This is for policy reasons, I don't want the guests to migrated too many at a time in case there are not enough resources at the destination hypervisor.

If I can pick this up again then I will else would prefer this to be taken up by the server guys so I can get on with my UC studies.  :)